Loading source
Pulling the file list, source metadata, and syntax-aware rendering for this listing.
Source from repo
Web scraping methodology for journalism: ethically extracting and structuring data from public web sources.
Files
Skill
Size
Entrypoint
Format
Open file
Syntax-highlighted preview of this file as included in the skill package.
SKILL.md
1---2name: web-scraping3description: Web scraping with anti-bot bypass, content extraction, undocumented APIs and poison pill detection. Use when extracting content from websites, handling paywalls, implementing scraping cascades or processing social media. Covers requests, trafilatura, Playwright with stealth mode, yt-dlp and instaloader patterns.4---56# Web scraping methodology78Patterns for reliable, ethical web scraping with fallback strategies and anti-bot handling.910## Scraping cascade architecture1112Implement multiple extraction strategies with automatic fallback:1314```python15from abc import ABC, abstractmethod16from typing import Optional17import requests18from bs4 import BeautifulSoup19import trafilatura2021#for .py files22from playwright.sync_api import sync_playwright23from playwright_stealth import stealth_sync2425#for .ipynb files26import asyncio27from playwright.async_api import async_playwright2829class ScrapingResult:30def __init__(self, content: str, title: str, method: str):31self.content = content32self.title = title33self.method = method # Track which method succeeded3435class Scraper(ABC):36@abstractmethod37def fetch(self, url: str) -> Optional[ScrapingResult]: ...3839class TrafilaturaCscraper(Scraper):40"""Fast, lightweight extraction for standard articles."""4142def fetch(self, url: str) -> Optional[ScrapingResult]:43try:44downloaded = trafilatura.fetch_url(url)45if not downloaded:46return None4748content = trafilatura.extract(49downloaded,50include_comments=False,51include_tables=True,52favor_recall=True53)5455if not content or len(content) < 100:56return None5758# Extract title separately59soup = BeautifulSoup(downloaded, 'html.parser')60title = soup.find('title')61title_text = title.get_text() if title else ''6263return ScrapingResult(content, title_text, 'trafilatura')64except Exception:65return None6667class RequestsScraper(Scraper):68"""HTTP requests with rotating user agents."""6970USER_AGENTS = [71'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',72'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36',73'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36',74]7576def fetch(self, url: str) -> Optional[ScrapingResult]:77import random7879headers = {80'User-Agent': random.choice(self.USER_AGENTS),81'Accept': 'text/html,application/xhtml+xml',82'Accept-Language': 'en-US,en;q=0.9',83}8485try:86response = requests.get(url, headers=headers, timeout=30)87response.raise_for_status()8889soup = BeautifulSoup(response.text, 'html.parser')9091# Remove script/style elements92for element in soup(['script', 'style', 'nav', 'footer', 'aside']):93element.decompose()9495# Find main content96main = soup.find('main') or soup.find('article') or soup.find('body')97content = main.get_text(separator='\n', strip=True) if main else ''9899title = soup.find('title')100title_text = title.get_text() if title else ''101102if len(content) < 100:103return None104105return ScrapingResult(content, title_text, 'requests')106except Exception:107return None108109class PlaywrightScraper(Scraper):110"""Heavy JavaScript rendering with stealth mode for anti-bot bypass."""111112def fetch(self, url: str) -> Optional[ScrapingResult]:113try:114with sync_playwright() as p:115browser = p.chromium.launch(headless=True)116context = browser.new_context(117viewport={'width': 1920, 'height': 1080},118user_agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'119)120page = context.new_page()121122# Apply stealth to avoid detection123stealth_sync(page)124125page.goto(url, wait_until='networkidle', timeout=60000)126127# Wait for content to load128page.wait_for_timeout(2000)129130# Extract content131content = page.evaluate('''() => {132const article = document.querySelector('article, main, .content, #content');133return article ? article.innerText : document.body.innerText;134}''')135136title = page.title()137138browser.close()139140if len(content) < 100:141return None142143return ScrapingResult(content, title, 'playwright')144except Exception:145return None146147class PlaywrightScraperAsync:148"""Async Playwright scraper for Jupyter notebooks (.ipynb files).149150Jupyter notebooks run their own event loop, so sync Playwright won't work.151Use this async version with `await` in notebook cells.152"""153154async def fetch(self, url: str) -> Optional[ScrapingResult]:155try:156async with async_playwright() as p:157browser = await p.chromium.launch(headless=True)158context = await browser.new_context(159viewport={'width': 1920, 'height': 1080},160user_agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'161)162page = await context.new_page()163164# Note: playwright-stealth async version165# from playwright_stealth import stealth_async166# await stealth_async(page)167168await page.goto(url, wait_until='networkidle', timeout=60000)169170# Wait for content to load171await page.wait_for_timeout(2000)172173# Extract content174content = await page.evaluate('''() => {175const article = document.querySelector('article, main, .content, #content');176return article ? article.innerText : document.body.innerText;177}''')178179title = await page.title()180181await browser.close()182183if len(content) < 100:184return None185186return ScrapingResult(content, title, 'playwright_async')187except Exception:188return None189190# Usage in Jupyter notebook cells:191# scraper = PlaywrightScraperAsync()192# result = await scraper.fetch('https://example.com')193194class ScrapingCascade:195"""Try multiple scrapers in order until one succeeds."""196197def __init__(self):198self.scrapers = [199TrafilaturaCscraper(),200RequestsScraper(),201PlaywrightScraper(),202]203204def fetch(self, url: str) -> Optional[ScrapingResult]:205for scraper in self.scrapers:206result = scraper.fetch(url)207if result:208return result209return None210```211212## Anti-bot landscape (as of 2026-05)213214The cascade above (`requests` → `trafilatura` → Playwright + `playwright-stealth`) handles plain HTML and lightly-protected JS sites. Modern anti-bot stacks (Cloudflare Bot Management / Turnstile, DataDome, Akamai Bot Manager, PerimeterX) layer multiple detection signals: TLS / HTTP-2 fingerprints, browser fingerprints, JS-execution proofs, residential-IP reputation, session behavior. No single tool defeats all of them.215216`playwright-stealth` (2.0+, current) patches obvious detection vectors — `navigator.webdriver`, `chrome.runtime`, plugin enumeration, language settings, WebGL fingerprints. Treat it as the floor, not the ceiling. If a target fingerprints TLS or runs Turnstile, stealth alone won't pass.217218| Tool | Layer it addresses | Notes |219|---|---|---|220| `curl_cffi` | TLS / HTTP-2 fingerprint | Drop-in replacement for `requests` that mimics Chrome/Safari/Edge JA3+ALPN. Can't run JS — pair with a parsed-HTML extractor when JS isn't required. |221| `playwright-stealth` 2.x | JS-runtime fingerprint | The starting line for Playwright/Chromium. Updates lag the bot stacks; expect to combine with rotation. |222| Camoufox | JS + browser fingerprint at C++ level | Firefox-based stealth browser. Spoofs fingerprint values low enough that JS-side checks can't see through them. Use when Chromium-based stealth is detected. |223| SeleniumBase UC Mode | Turnstile + browser fingerprint | The closest thing to a one-shot Turnstile solver in 2026, but heavier than playwright-stealth. |224| Residential proxy pool | IP reputation | Datacenter IPs (DigitalOcean, AWS) get challenged on first request. Residential pools cost more but bypass the cheapest layer of defense. |225226**Use the lightest tool that works.** Targets without aggressive defense don't need Camoufox or proxy pools — `curl_cffi` plus a sleep is usually enough. Reserve heavier tools for sites that explicitly serve a Turnstile challenge or DataDome interstitial.227228## Undocumented APIs229230### Finding undocumented APIs231232Use browser developer tools to discover APIs:2332341. **Open developer tools** (right-click → Inspect, or F12)2352. **Go to the Network tab** to monitor all requests2363. **Filter by Fetch/XHR** to show only API calls2374. **Trigger the action** you want to capture (search, scroll, click)2385. **Analyze the response** — usually JSON with key-value pairs2396. **Copy as cURL** (right-click the request)2407. **Convert to code** using [curlconverter.com](https://curlconverter.com/)241242### Stripping down API requests243244When you copy a cURL from dev tools, it includes many parameters. Strip it down by:2452461. **Remove unnecessary cookies** — test without them first2472. **Keep authentication tokens** if required2483. **Identify the input parameters** you can modify (like `prefix` for search terms)2494. **Test parameter values** — some expire, so periodically verify250251### Example: Reverse-engineering an autocomplete API252253```python254import requests255import time256257def search_suggestions(keyword: str) -> dict:258"""259Get autocompleted search suggestions from an undocumented API.260Stripped down from browser dev tools capture.261"""262headers = {263'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:100.0) Gecko/20100101 Firefox/100.0',264'Accept': 'application/json, text/javascript, */*; q=0.01',265'Accept-Language': 'en-US,en;q=0.5',266}267268params = {269'prefix': keyword,270'suggestion-type': ['WIDGET', 'KEYWORD'],271'alias': 'aps',272'plain-mid': '1',273}274275response = requests.get(276'https://completion.amazon.com/api/2017/suggestions',277params=params,278headers=headers279)280return response.json()281282# Collect suggestions for multiple keywords283keywords = ['a', 'b', 'cookie', 'sock']284data = []285286for keyword in keywords:287suggestions = search_suggestions(keyword)288suggestions['search_word'] = keyword # track seed keyword289time.sleep(1) # rate limit yourself290data.extend(suggestions.get('suggestions', []))291```292*Source: [Leon Yin, "Finding Undocumented APIs," Inspect Element](https://inspectelement.org/apis.html), 2023*293294## Poison pill detection295296Detect paywalls, anti-bot pages, and other failures:297298```python299from dataclasses import dataclass300from enum import Enum301import re302303class PoisonPillType(Enum):304PAYWALL = 'paywall'305CAPTCHA = 'captcha'306RATE_LIMIT = 'rate_limit'307CLOUDFLARE = 'cloudflare'308LOGIN_REQUIRED = 'login_required'309NOT_FOUND = 'not_found'310NONE = 'none'311312@dataclass313class PoisonPillResult:314detected: bool315type: PoisonPillType316confidence: float317details: str318319class PoisonPillDetector:320PATTERNS = {321PoisonPillType.PAYWALL: [322r'subscribe to continue',323r'subscription required',324r'become a member',325r'sign up to read',326r'you\'ve reached your limit',327r'article limit reached',328],329PoisonPillType.CAPTCHA: [330r'verify you are human',331r'captcha',332r'robot verification',333r'prove you\'re not a robot',334],335PoisonPillType.RATE_LIMIT: [336r'too many requests',337r'rate limit exceeded',338r'slow down',339r'429',340],341PoisonPillType.CLOUDFLARE: [342r'checking your browser',343r'cloudflare',344r'ddos protection',345r'please wait while we verify',346],347PoisonPillType.LOGIN_REQUIRED: [348r'sign in to continue',349r'log in required',350r'create an account',351],352}353354PAYWALL_DOMAINS = {355'nytimes.com': PoisonPillType.PAYWALL,356'wsj.com': PoisonPillType.PAYWALL,357'washingtonpost.com': PoisonPillType.PAYWALL,358'ft.com': PoisonPillType.PAYWALL,359'bloomberg.com': PoisonPillType.PAYWALL,360}361362def detect(self, url: str, content: str, status_code: int = 200) -> PoisonPillResult:363# Check status code364if status_code == 429:365return PoisonPillResult(True, PoisonPillType.RATE_LIMIT, 1.0, 'HTTP 429')366if status_code == 403:367return PoisonPillResult(True, PoisonPillType.CLOUDFLARE, 0.8, 'HTTP 403')368if status_code == 404:369return PoisonPillResult(True, PoisonPillType.NOT_FOUND, 1.0, 'HTTP 404')370371# Check known paywall domains372from urllib.parse import urlparse373domain = urlparse(url).netloc.replace('www.', '')374for paywall_domain, pill_type in self.PAYWALL_DOMAINS.items():375if paywall_domain in domain:376# Check if content is suspiciously short (paywall truncation)377if len(content) < 500:378return PoisonPillResult(True, pill_type, 0.9, f'Short content from {domain}')379380# Pattern matching381content_lower = content.lower()382for pill_type, patterns in self.PATTERNS.items():383for pattern in patterns:384if re.search(pattern, content_lower):385return PoisonPillResult(True, pill_type, 0.7, f'Pattern match: {pattern}')386387return PoisonPillResult(False, PoisonPillType.NONE, 0.0, '')388```389390## Social media scraping391392### YouTube with yt-dlp393394```python395import yt_dlp396from pathlib import Path397398def download_video_metadata(url: str) -> dict:399"""Extract metadata without downloading video."""400ydl_opts = {401'skip_download': True,402'quiet': True,403'no_warnings': True,404}405406with yt_dlp.YoutubeDL(ydl_opts) as ydl:407info = ydl.extract_info(url, download=False)408return {409'title': info.get('title'),410'description': info.get('description'),411'duration': info.get('duration'),412'upload_date': info.get('upload_date'),413'view_count': info.get('view_count'),414'channel': info.get('channel'),415'thumbnail': info.get('thumbnail'),416}417418def download_video(url: str, output_dir: Path, audio_only: bool = False) -> Path:419"""Download video or audio."""420output_template = str(output_dir / '%(title)s.%(ext)s')421422ydl_opts = {423'outtmpl': output_template,424'quiet': True,425}426427if audio_only:428ydl_opts['format'] = 'bestaudio/best'429ydl_opts['postprocessors'] = [{430'key': 'FFmpegExtractAudio',431'preferredcodec': 'mp3',432}]433434with yt_dlp.YoutubeDL(ydl_opts) as ydl:435info = ydl.extract_info(url, download=True)436filename = ydl.prepare_filename(info)437if audio_only:438filename = filename.rsplit('.', 1)[0] + '.mp3'439return Path(filename)440441def get_transcript(url: str) -> list[dict]:442"""Extract auto-generated or manual subtitles."""443ydl_opts = {444'skip_download': True,445'writesubtitles': True,446'writeautomaticsub': True,447'subtitleslangs': ['en'],448'quiet': True,449}450451with yt_dlp.YoutubeDL(ydl_opts) as ydl:452info = ydl.extract_info(url, download=False)453454# Check for subtitles455subtitles = info.get('subtitles', {})456auto_captions = info.get('automatic_captions', {})457458# Prefer manual subtitles over auto-generated459subs = subtitles.get('en') or auto_captions.get('en')460if not subs:461return []462463# Get the vtt or json format464for sub in subs:465if sub['ext'] in ['vtt', 'json3']:466# Download and parse subtitle file467# ... implementation depends on format468pass469470return []471```472473### Instagram with instaloader474475```python476import instaloader477from pathlib import Path478479class InstagramScraper:480def __init__(self, username: str = None, session_file: str = None):481self.loader = instaloader.Instaloader(482download_videos=True,483download_video_thumbnails=False,484download_geotags=False,485download_comments=False,486save_metadata=True,487compress_json=False,488)489490if session_file and Path(session_file).exists():491self.loader.load_session_from_file(username, session_file)492493def get_profile_posts(self, username: str, limit: int = 50) -> list[dict]:494"""Get recent posts from a profile."""495profile = instaloader.Profile.from_username(self.loader.context, username)496posts = []497498for i, post in enumerate(profile.get_posts()):499if i >= limit:500break501502posts.append({503'shortcode': post.shortcode,504'url': f'https://instagram.com/p/{post.shortcode}/',505'caption': post.caption,506'timestamp': post.date_utc.isoformat(),507'likes': post.likes,508'comments': post.comments,509'is_video': post.is_video,510'video_url': post.video_url if post.is_video else None,511})512513return posts514515def download_post(self, shortcode: str, output_dir: Path):516"""Download a single post's media."""517post = instaloader.Post.from_shortcode(self.loader.context, shortcode)518self.loader.download_post(post, target=str(output_dir))519```520521### TikTok with yt-dlp522523```python524def scrape_tiktok_profile(username: str, output_dir: Path, limit: int = 50) -> list[dict]:525"""Scrape TikTok profile videos."""526profile_url = f'https://tiktok.com/@{username}'527528ydl_opts = {529'quiet': True,530'extract_flat': True, # Don't download, just get info531'playlistend': limit,532}533534with yt_dlp.YoutubeDL(ydl_opts) as ydl:535info = ydl.extract_info(profile_url, download=False)536videos = []537538for entry in info.get('entries', []):539videos.append({540'id': entry.get('id'),541'title': entry.get('title'),542'url': entry.get('url'),543'timestamp': entry.get('timestamp'),544'view_count': entry.get('view_count'),545})546547return videos548549def download_tiktok_video(url: str, output_dir: Path) -> Path:550"""Download a single TikTok video."""551ydl_opts = {552'outtmpl': str(output_dir / '%(id)s.%(ext)s'),553'quiet': True,554}555556with yt_dlp.YoutubeDL(ydl_opts) as ydl:557info = ydl.extract_info(url, download=True)558return Path(ydl.prepare_filename(info))559```560561## Request patterns562563### Rotating user agents and headers564565```python566import random567from fake_useragent import UserAgent568569class RequestManager:570def __init__(self):571self.ua = UserAgent()572self.session = requests.Session()573574def get_headers(self) -> dict:575return {576'User-Agent': self.ua.random,577'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',578'Accept-Language': 'en-US,en;q=0.5',579'Accept-Encoding': 'gzip, deflate, br',580'DNT': '1',581'Connection': 'keep-alive',582'Upgrade-Insecure-Requests': '1',583}584585def fetch(self, url: str, retry_count: int = 3) -> requests.Response:586for attempt in range(retry_count):587try:588response = self.session.get(589url,590headers=self.get_headers(),591timeout=30592)593response.raise_for_status()594return response595except requests.RequestException as e:596if attempt == retry_count - 1:597raise598time.sleep(2 ** attempt) # Exponential backoff599```600601### Respectful scraping with delays602603```python604import time605import random606from urllib.parse import urlparse607608class PoliteRequester:609def __init__(self, min_delay: float = 1.0, max_delay: float = 3.0):610self.min_delay = min_delay611self.max_delay = max_delay612self.last_request_per_domain = {}613614def wait_for_domain(self, url: str):615domain = urlparse(url).netloc616last_request = self.last_request_per_domain.get(domain, 0)617618elapsed = time.time() - last_request619delay = random.uniform(self.min_delay, self.max_delay)620621if elapsed < delay:622time.sleep(delay - elapsed)623624self.last_request_per_domain[domain] = time.time()625```626627## Ethics, robots.txt, and the legal landscape628629Scraping is technically simple, ethically nuanced, and legally a moving target. The current state in the US (2026):630631**Computer Fraud and Abuse Act (CFAA).** *Van Buren v. United States* (2021) and *hiQ Labs v. LinkedIn* (2022) narrowed the CFAA so that scraping public, non-credentialed pages does NOT constitute "unauthorized access." Logging in (or using credentials), bypassing technical access controls, or scraping after an explicit cease-and-desist letter remains legally fraught. State equivalents (e.g., California's CDAFA) sometimes go further than federal law.632633**Terms of service.** Many sites' ToS forbid scraping. ToS is a contract, not a criminal statute — breach exposes you to civil claims (breach of contract, tortious interference, trespass to chattels in some jurisdictions), not jail. The risk profile differs sharply from CFAA.634635**robots.txt** is a polite request, not a legal mandate. Ignoring it doesn't make you criminally liable, but courts have cited it as evidence of intent. For journalism in the public interest, that intent can be defensible; for commercial use, it's harder.636637**EU GDPR / UK DPA.** If your scraping pulls personal data of EU/UK residents, GDPR/DPA apply regardless of where you run the scraper. Public availability does NOT exempt personal data from these regimes — `Lloyd v. Google` (UK Supreme Court 2021) and CJEU's `Schrems II` lineage make scraping personal data without a lawful basis a real liability.638639**Practical baseline:**640- Always read `robots.txt`. Honor crawl delays. Honor `Disallow:`.641- Respect rate limits; add jitter; back off on `429`.642- Don't scrape behind authentication unless you have explicit permission.643- Don't scrape personal data (names, emails, photos) without a lawful basis.644- Identify yourself with a descriptive User-Agent and a contact URL when crawling at volume.645- Cache aggressively to avoid redundant requests.646- Stop if you receive a cease-and-desist or explicit blocking signal — escalating past one is the move that turns a civil dispute into a CFAA case.647648**Notes on specific platforms.** Instagram's `instaloader` and TikTok scraping via `yt-dlp` work today but break frequently — Meta and TikTok roll out anti-bot updates monthly. Account bans on the credentials you used are common. For journalism, the official APIs (Meta Content Library, TikTok Research API) are slower but more durable.649