Loading source
Pulling the file list, source metadata, and syntax-aware rendering for this listing.
Source from repo
Extract clean Markdown content from any URL using a three-tier strategy: Jina Reader, Scrapling, or web_fetch.
Files
Skill
Size
Entrypoint
Format
Open file
Syntax-highlighted preview of this file as included in the skill package.
SKILL.md
1---2name: web-content-fetcher3description: >4Extract article content from any URL as clean Markdown.5Uses Scrapling script as primary method (with auto fast→stealth fallback),6Jina Reader as alternative for simple pages.7Preserves headings, links, images, lists, and code blocks.8Use this skill whenever the user wants to fetch, read, extract, scrape, or summarize9content from a URL — including blog posts, news articles, WeChat articles (微信公众号),10documentation pages, or any web page. Also trigger when the user says things like11"帮我读一下这篇文章", "抓取这个网页", "提取正文", or "read this page for me".12---1314# Web Content Fetcher1516Given a URL, return its main content as clean Markdown — headings, links, images, lists, code blocks all preserved.1718## Extraction Strategy1920Always try **one method per URL** — don't cascade blindly. Pick the right one upfront.2122```23URL24│25├─ 1. Scrapling script (preferred)26│ Run fetch.py — check the domain routing table to decide fast vs --stealth.27│ Works for most sites. Returns clean Markdown directly.28│29└─ 2. Jina Reader (fallback — only if Scrapling fails or dependencies not installed)30web_fetch("https://r.jina.ai/<url>")31Free tier: 200 req/day. Fast (~1-2s), good Markdown output.32Does NOT work for: WeChat (403), some Chinese platforms.33```3435### Scrapling script3637```bash38python3 <SKILL_DIR>/scripts/fetch.py "<url>" [max_chars] [--stealth]39```4041`<SKILL_DIR>` is the directory where this SKILL.md lives. Resolve it before calling the script.4243The script has two modes built in:44- **Default (fast):** HTTP fetch, ~1-3s, works for most sites45- **`--stealth`:** Headless browser, ~5-15s, for JS-rendered or anti-scraping sites4647When run without `--stealth`, the script automatically falls back to stealth if the fast result has too little content. So you rarely need to specify `--stealth` manually — the only reason to force it is when you already know the site needs it (see routing table), which saves the initial fast attempt.4849## Domain Routing5051Use this table to pick the right mode on the first call:5253| Domain | Command | Why |54|--------|---------|-----|55| `mp.weixin.qq.com` | `fetch.py <url> --stealth` | JS-rendered content |56| `zhuanlan.zhihu.com` | `fetch.py <url> --stealth` | Anti-scraping + JS |57| `juejin.cn` | `fetch.py <url> --stealth` | JS-rendered SPA |58| `sspai.com` | `fetch.py <url>` | Static HTML |59| `blog.csdn.net` | `fetch.py <url>` | Static HTML |60| `ruanyifeng.com` | `fetch.py <url>` | Static blog |61| `openai.com` | `fetch.py <url>` | Static HTML |62| `blog.google` | `fetch.py <url>` | Static HTML |63| Everything else | `fetch.py <url>` | Auto-fallback handles it |6465## Script Options6667```bash68# Basic — auto-selects fast or stealth69python3 <SKILL_DIR>/scripts/fetch.py "https://sspai.com/post/73145"7071# Force stealth for known JS-heavy sites72python3 <SKILL_DIR>/scripts/fetch.py "https://mp.weixin.qq.com/s/xxx" --stealth7374# Limit output to 15000 characters (default: 30000)75python3 <SKILL_DIR>/scripts/fetch.py "https://example.com/article" 150007677# JSON output with metadata (url, mode, selector, content_length)78python3 <SKILL_DIR>/scripts/fetch.py "https://example.com" --json79```8081## Install Dependencies8283First use only — the script checks and tells you if anything is missing:8485```bash86pip install scrapling html2text87```8889If on system-managed Python (macOS/Linux), add `--break-system-packages` or use a venv.9091## Failure Rules9293- Same URL fails once → give up, tell the user "unable to extract content from this URL"94- Do not retry — each failed call wastes context tokens95