Source from repo

Web Content Fetcher — 网页正文提取

Extract clean Markdown content from any URL using a three-tier strategy: Jina Reader, Scrapling, or web_fetch.

shirenchuangGitHub shirenchuangSource repo Original GitHub link Publisher page

Files

Skill

n/a

Size

55.4 KB

Entrypoint

SKILL.md

Format

git-repo

Open file

SKILL.md

Syntax-highlighted preview of this file as included in the skill package.

Rendered Source

markdown95 linesEntrypointFree

SKILL.md

1---
2name: web-content-fetcher
3description: >
4  Extract article content from any URL as clean Markdown.
5  Uses Scrapling script as primary method (with auto fast→stealth fallback),
6  Jina Reader as alternative for simple pages.
7  Preserves headings, links, images, lists, and code blocks.
8  Use this skill whenever the user wants to fetch, read, extract, scrape, or summarize
9  content from a URL — including blog posts, news articles, WeChat articles (微信公众号),
10  documentation pages, or any web page. Also trigger when the user says things like
11  "帮我读一下这篇文章", "抓取这个网页", "提取正文", or "read this page for me".
12---
13 
14# Web Content Fetcher
15 
16Given a URL, return its main content as clean Markdown — headings, links, images, lists, code blocks all preserved.
17 
18## Extraction Strategy
19 
20Always try **one method per URL** — don't cascade blindly. Pick the right one upfront.
21 
22```
23URL
24 │
25 ├─ 1. Scrapling script (preferred)
26 │     Run fetch.py — check the domain routing table to decide fast vs --stealth.
27 │     Works for most sites. Returns clean Markdown directly.
28 │
29 └─ 2. Jina Reader (fallback — only if Scrapling fails or dependencies not installed)
30       web_fetch("https://r.jina.ai/<url>")
31       Free tier: 200 req/day. Fast (~1-2s), good Markdown output.
32       Does NOT work for: WeChat (403), some Chinese platforms.
33```
34 
35### Scrapling script
36 
37```bash
38python3 <SKILL_DIR>/scripts/fetch.py "<url>" [max_chars] [--stealth]
39```
40 
41`<SKILL_DIR>` is the directory where this SKILL.md lives. Resolve it before calling the script.
42 
43The script has two modes built in:
44- **Default (fast):** HTTP fetch, ~1-3s, works for most sites
45- **`--stealth`:** Headless browser, ~5-15s, for JS-rendered or anti-scraping sites
46 
47When run without `--stealth`, the script automatically falls back to stealth if the fast result has too little content. So you rarely need to specify `--stealth` manually — the only reason to force it is when you already know the site needs it (see routing table), which saves the initial fast attempt.
48 
49## Domain Routing
50 
51Use this table to pick the right mode on the first call:
52 
53| Domain | Command | Why |
54|--------|---------|-----|
55| `mp.weixin.qq.com` | `fetch.py <url> --stealth` | JS-rendered content |
56| `zhuanlan.zhihu.com` | `fetch.py <url> --stealth` | Anti-scraping + JS |
57| `juejin.cn` | `fetch.py <url> --stealth` | JS-rendered SPA |
58| `sspai.com` | `fetch.py <url>` | Static HTML |
59| `blog.csdn.net` | `fetch.py <url>` | Static HTML |
60| `ruanyifeng.com` | `fetch.py <url>` | Static blog |
61| `openai.com` | `fetch.py <url>` | Static HTML |
62| `blog.google` | `fetch.py <url>` | Static HTML |
63| Everything else | `fetch.py <url>` | Auto-fallback handles it |
64 
65## Script Options
66 
67```bash
68# Basic — auto-selects fast or stealth
69python3 <SKILL_DIR>/scripts/fetch.py "https://sspai.com/post/73145"
70 
71# Force stealth for known JS-heavy sites
72python3 <SKILL_DIR>/scripts/fetch.py "https://mp.weixin.qq.com/s/xxx" --stealth
73 
74# Limit output to 15000 characters (default: 30000)
75python3 <SKILL_DIR>/scripts/fetch.py "https://example.com/article" 15000
76 
77# JSON output with metadata (url, mode, selector, content_length)
78python3 <SKILL_DIR>/scripts/fetch.py "https://example.com" --json
79```
80 
81## Install Dependencies
82 
83First use only — the script checks and tells you if anything is missing:
84 
85```bash
86pip install scrapling html2text
87```
88 
89If on system-managed Python (macOS/Linux), add `--break-system-packages` or use a venv.
90 
91## Failure Rules
92 
93- Same URL fails once → give up, tell the user "unable to extract content from this URL"
94- Do not retry — each failed call wastes context tokens
95

Preparing the source view

Web Content Fetcher — 网页正文提取

SKILL.md