Source from repo
Web scraping methodology

Web scraping methodology for journalism: ethically extracting and structuring data from public web sources.
jamditisGitHub jamditisSource repo Original GitHub link Publisher page
Files
Skill
n/a
Size
23.4 KB
Entrypoint
SKILL.md
Format
git-repo
Open file
SKILL.md

Syntax-highlighted preview of this file as included in the skill package.
Rendered Source
markdown649 linesEntrypointFree
SKILL.md
1---
2name: web-scraping
3description: Web scraping with anti-bot bypass, content extraction, undocumented APIs and poison pill detection. Use when extracting content from websites, handling paywalls, implementing scraping cascades or processing social media. Covers requests, trafilatura, Playwright with stealth mode, yt-dlp and instaloader patterns.
4---
5 
6# Web scraping methodology
7 
8Patterns for reliable, ethical web scraping with fallback strategies and anti-bot handling.
9 
10## Scraping cascade architecture
11 
12Implement multiple extraction strategies with automatic fallback:
13 
14```python
15from abc import ABC, abstractmethod
16from typing import Optional
17import requests
18from bs4 import BeautifulSoup
19import trafilatura
20 
21#for .py files
22from playwright.sync_api import sync_playwright
23from playwright_stealth import stealth_sync
24 
25#for .ipynb files
26import asyncio
27from playwright.async_api import async_playwright
28 
29class ScrapingResult:
30    def __init__(self, content: str, title: str, method: str):
31        self.content = content
32        self.title = title
33        self.method = method  # Track which method succeeded
34 
35class Scraper(ABC):
36    @abstractmethod
37    def fetch(self, url: str) -> Optional[ScrapingResult]: ...
38 
39class TrafilaturaCscraper(Scraper):
40    """Fast, lightweight extraction for standard articles."""
41 
42    def fetch(self, url: str) -> Optional[ScrapingResult]:
43        try:
44            downloaded = trafilatura.fetch_url(url)
45            if not downloaded:
46                return None
47 
48            content = trafilatura.extract(
49                downloaded,
50                include_comments=False,
51                include_tables=True,
52                favor_recall=True
53            )
54 
55            if not content or len(content) < 100:
56                return None
57 
58            # Extract title separately
59            soup = BeautifulSoup(downloaded, 'html.parser')
60            title = soup.find('title')
61            title_text = title.get_text() if title else ''
62 
63            return ScrapingResult(content, title_text, 'trafilatura')
64        except Exception:
65            return None
66 
67class RequestsScraper(Scraper):
68    """HTTP requests with rotating user agents."""
69 
70    USER_AGENTS = [
71        'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
72        'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36',
73        'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36',
74    ]
75 
76    def fetch(self, url: str) -> Optional[ScrapingResult]:
77        import random
78 
79        headers = {
80            'User-Agent': random.choice(self.USER_AGENTS),
81            'Accept': 'text/html,application/xhtml+xml',
82            'Accept-Language': 'en-US,en;q=0.9',
83        }
84 
85        try:
86            response = requests.get(url, headers=headers, timeout=30)
87            response.raise_for_status()
88 
89            soup = BeautifulSoup(response.text, 'html.parser')
90 
91            # Remove script/style elements
92            for element in soup(['script', 'style', 'nav', 'footer', 'aside']):
93                element.decompose()
94 
95            # Find main content
96            main = soup.find('main') or soup.find('article') or soup.find('body')
97            content = main.get_text(separator='\n', strip=True) if main else ''
98 
99            title = soup.find('title')
100            title_text = title.get_text() if title else ''
101 
102            if len(content) < 100:
103                return None
104 
105            return ScrapingResult(content, title_text, 'requests')
106        except Exception:
107            return None
108 
109class PlaywrightScraper(Scraper):
110    """Heavy JavaScript rendering with stealth mode for anti-bot bypass."""
111 
112    def fetch(self, url: str) -> Optional[ScrapingResult]:
113        try:
114            with sync_playwright() as p:
115                browser = p.chromium.launch(headless=True)
116                context = browser.new_context(
117                    viewport={'width': 1920, 'height': 1080},
118                    user_agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
119                )
120                page = context.new_page()
121 
122                # Apply stealth to avoid detection
123                stealth_sync(page)
124 
125                page.goto(url, wait_until='networkidle', timeout=60000)
126 
127                # Wait for content to load
128                page.wait_for_timeout(2000)
129 
130                # Extract content
131                content = page.evaluate('''() => {
132                    const article = document.querySelector('article, main, .content, #content');
133                    return article ? article.innerText : document.body.innerText;
134                }''')
135 
136                title = page.title()
137 
138                browser.close()
139 
140                if len(content) < 100:
141                    return None
142 
143                return ScrapingResult(content, title, 'playwright')
144        except Exception:
145            return None
146 
147class PlaywrightScraperAsync:
148    """Async Playwright scraper for Jupyter notebooks (.ipynb files).
149    
150    Jupyter notebooks run their own event loop, so sync Playwright won't work.
151    Use this async version with `await` in notebook cells.
152    """
153 
154    async def fetch(self, url: str) -> Optional[ScrapingResult]:
155        try:
156            async with async_playwright() as p:
157                browser = await p.chromium.launch(headless=True)
158                context = await browser.new_context(
159                    viewport={'width': 1920, 'height': 1080},
160                    user_agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
161                )
162                page = await context.new_page()
163 
164                # Note: playwright-stealth async version
165                # from playwright_stealth import stealth_async
166                # await stealth_async(page)
167 
168                await page.goto(url, wait_until='networkidle', timeout=60000)
169 
170                # Wait for content to load
171                await page.wait_for_timeout(2000)
172 
173                # Extract content
174                content = await page.evaluate('''() => {
175                    const article = document.querySelector('article, main, .content, #content');
176                    return article ? article.innerText : document.body.innerText;
177                }''')
178 
179                title = await page.title()
180 
181                await browser.close()
182 
183                if len(content) < 100:
184                    return None
185 
186                return ScrapingResult(content, title, 'playwright_async')
187        except Exception:
188            return None
189 
190# Usage in Jupyter notebook cells:
191# scraper = PlaywrightScraperAsync()
192# result = await scraper.fetch('https://example.com')
193 
194class ScrapingCascade:
195    """Try multiple scrapers in order until one succeeds."""
196 
197    def __init__(self):
198        self.scrapers = [
199            TrafilaturaCscraper(),
200            RequestsScraper(),
201            PlaywrightScraper(),
202        ]
203 
204    def fetch(self, url: str) -> Optional[ScrapingResult]:
205        for scraper in self.scrapers:
206            result = scraper.fetch(url)
207            if result:
208                return result
209        return None
210```
211 
212## Anti-bot landscape (as of 2026-05)
213 
214The cascade above (`requests` → `trafilatura` → Playwright + `playwright-stealth`) handles plain HTML and lightly-protected JS sites. Modern anti-bot stacks (Cloudflare Bot Management / Turnstile, DataDome, Akamai Bot Manager, PerimeterX) layer multiple detection signals: TLS / HTTP-2 fingerprints, browser fingerprints, JS-execution proofs, residential-IP reputation, session behavior. No single tool defeats all of them.
215 
216`playwright-stealth` (2.0+, current) patches obvious detection vectors — `navigator.webdriver`, `chrome.runtime`, plugin enumeration, language settings, WebGL fingerprints. Treat it as the floor, not the ceiling. If a target fingerprints TLS or runs Turnstile, stealth alone won't pass.
217 
218| Tool | Layer it addresses | Notes |
219|---|---|---|
220| `curl_cffi` | TLS / HTTP-2 fingerprint | Drop-in replacement for `requests` that mimics Chrome/Safari/Edge JA3+ALPN. Can't run JS — pair with a parsed-HTML extractor when JS isn't required. |
221| `playwright-stealth` 2.x | JS-runtime fingerprint | The starting line for Playwright/Chromium. Updates lag the bot stacks; expect to combine with rotation. |
222| Camoufox | JS + browser fingerprint at C++ level | Firefox-based stealth browser. Spoofs fingerprint values low enough that JS-side checks can't see through them. Use when Chromium-based stealth is detected. |
223| SeleniumBase UC Mode | Turnstile + browser fingerprint | The closest thing to a one-shot Turnstile solver in 2026, but heavier than playwright-stealth. |
224| Residential proxy pool | IP reputation | Datacenter IPs (DigitalOcean, AWS) get challenged on first request. Residential pools cost more but bypass the cheapest layer of defense. |
225 
226**Use the lightest tool that works.** Targets without aggressive defense don't need Camoufox or proxy pools — `curl_cffi` plus a sleep is usually enough. Reserve heavier tools for sites that explicitly serve a Turnstile challenge or DataDome interstitial.
227 
228## Undocumented APIs
229 
230### Finding undocumented APIs
231 
232Use browser developer tools to discover APIs:
233 
2341. **Open developer tools** (right-click → Inspect, or F12)
2352. **Go to the Network tab** to monitor all requests
2363. **Filter by Fetch/XHR** to show only API calls
2374. **Trigger the action** you want to capture (search, scroll, click)
2385. **Analyze the response** — usually JSON with key-value pairs
2396. **Copy as cURL** (right-click the request)
2407. **Convert to code** using [curlconverter.com](https://curlconverter.com/)
241 
242### Stripping down API requests
243 
244When you copy a cURL from dev tools, it includes many parameters. Strip it down by:
245 
2461. **Remove unnecessary cookies** — test without them first
2472. **Keep authentication tokens** if required
2483. **Identify the input parameters** you can modify (like `prefix` for search terms)
2494. **Test parameter values** — some expire, so periodically verify
250 
251### Example: Reverse-engineering an autocomplete API
252 
253```python
254import requests
255import time
256 
257def search_suggestions(keyword: str) -> dict:
258    """
259    Get autocompleted search suggestions from an undocumented API.
260    Stripped down from browser dev tools capture.
261    """
262    headers = {
263        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:100.0) Gecko/20100101 Firefox/100.0',
264        'Accept': 'application/json, text/javascript, */*; q=0.01',
265        'Accept-Language': 'en-US,en;q=0.5',
266    }
267 
268    params = {
269        'prefix': keyword,
270        'suggestion-type': ['WIDGET', 'KEYWORD'],
271        'alias': 'aps',
272        'plain-mid': '1',
273    }
274 
275    response = requests.get(
276        'https://completion.amazon.com/api/2017/suggestions',
277        params=params,
278        headers=headers
279    )
280    return response.json()
281 
282# Collect suggestions for multiple keywords
283keywords = ['a', 'b', 'cookie', 'sock']
284data = []
285 
286for keyword in keywords:
287    suggestions = search_suggestions(keyword)
288    suggestions['search_word'] = keyword  # track seed keyword
289    time.sleep(1)  # rate limit yourself
290    data.extend(suggestions.get('suggestions', []))
291```
292*Source: [Leon Yin, "Finding Undocumented APIs," Inspect Element](https://inspectelement.org/apis.html), 2023*
293 
294## Poison pill detection
295 
296Detect paywalls, anti-bot pages, and other failures:
297 
298```python
299from dataclasses import dataclass
300from enum import Enum
301import re
302 
303class PoisonPillType(Enum):
304    PAYWALL = 'paywall'
305    CAPTCHA = 'captcha'
306    RATE_LIMIT = 'rate_limit'
307    CLOUDFLARE = 'cloudflare'
308    LOGIN_REQUIRED = 'login_required'
309    NOT_FOUND = 'not_found'
310    NONE = 'none'
311 
312@dataclass
313class PoisonPillResult:
314    detected: bool
315    type: PoisonPillType
316    confidence: float
317    details: str
318 
319class PoisonPillDetector:
320    PATTERNS = {
321        PoisonPillType.PAYWALL: [
322            r'subscribe to continue',
323            r'subscription required',
324            r'become a member',
325            r'sign up to read',
326            r'you\'ve reached your limit',
327            r'article limit reached',
328        ],
329        PoisonPillType.CAPTCHA: [
330            r'verify you are human',
331            r'captcha',
332            r'robot verification',
333            r'prove you\'re not a robot',
334        ],
335        PoisonPillType.RATE_LIMIT: [
336            r'too many requests',
337            r'rate limit exceeded',
338            r'slow down',
339            r'429',
340        ],
341        PoisonPillType.CLOUDFLARE: [
342            r'checking your browser',
343            r'cloudflare',
344            r'ddos protection',
345            r'please wait while we verify',
346        ],
347        PoisonPillType.LOGIN_REQUIRED: [
348            r'sign in to continue',
349            r'log in required',
350            r'create an account',
351        ],
352    }
353 
354    PAYWALL_DOMAINS = {
355        'nytimes.com': PoisonPillType.PAYWALL,
356        'wsj.com': PoisonPillType.PAYWALL,
357        'washingtonpost.com': PoisonPillType.PAYWALL,
358        'ft.com': PoisonPillType.PAYWALL,
359        'bloomberg.com': PoisonPillType.PAYWALL,
360    }
361 
362    def detect(self, url: str, content: str, status_code: int = 200) -> PoisonPillResult:
363        # Check status code
364        if status_code == 429:
365            return PoisonPillResult(True, PoisonPillType.RATE_LIMIT, 1.0, 'HTTP 429')
366        if status_code == 403:
367            return PoisonPillResult(True, PoisonPillType.CLOUDFLARE, 0.8, 'HTTP 403')
368        if status_code == 404:
369            return PoisonPillResult(True, PoisonPillType.NOT_FOUND, 1.0, 'HTTP 404')
370 
371        # Check known paywall domains
372        from urllib.parse import urlparse
373        domain = urlparse(url).netloc.replace('www.', '')
374        for paywall_domain, pill_type in self.PAYWALL_DOMAINS.items():
375            if paywall_domain in domain:
376                # Check if content is suspiciously short (paywall truncation)
377                if len(content) < 500:
378                    return PoisonPillResult(True, pill_type, 0.9, f'Short content from {domain}')
379 
380        # Pattern matching
381        content_lower = content.lower()
382        for pill_type, patterns in self.PATTERNS.items():
383            for pattern in patterns:
384                if re.search(pattern, content_lower):
385                    return PoisonPillResult(True, pill_type, 0.7, f'Pattern match: {pattern}')
386 
387        return PoisonPillResult(False, PoisonPillType.NONE, 0.0, '')
388```
389 
390## Social media scraping
391 
392### YouTube with yt-dlp
393 
394```python
395import yt_dlp
396from pathlib import Path
397 
398def download_video_metadata(url: str) -> dict:
399    """Extract metadata without downloading video."""
400    ydl_opts = {
401        'skip_download': True,
402        'quiet': True,
403        'no_warnings': True,
404    }
405 
406    with yt_dlp.YoutubeDL(ydl_opts) as ydl:
407        info = ydl.extract_info(url, download=False)
408        return {
409            'title': info.get('title'),
410            'description': info.get('description'),
411            'duration': info.get('duration'),
412            'upload_date': info.get('upload_date'),
413            'view_count': info.get('view_count'),
414            'channel': info.get('channel'),
415            'thumbnail': info.get('thumbnail'),
416        }
417 
418def download_video(url: str, output_dir: Path, audio_only: bool = False) -> Path:
419    """Download video or audio."""
420    output_template = str(output_dir / '%(title)s.%(ext)s')
421 
422    ydl_opts = {
423        'outtmpl': output_template,
424        'quiet': True,
425    }
426 
427    if audio_only:
428        ydl_opts['format'] = 'bestaudio/best'
429        ydl_opts['postprocessors'] = [{
430            'key': 'FFmpegExtractAudio',
431            'preferredcodec': 'mp3',
432        }]
433 
434    with yt_dlp.YoutubeDL(ydl_opts) as ydl:
435        info = ydl.extract_info(url, download=True)
436        filename = ydl.prepare_filename(info)
437        if audio_only:
438            filename = filename.rsplit('.', 1)[0] + '.mp3'
439        return Path(filename)
440 
441def get_transcript(url: str) -> list[dict]:
442    """Extract auto-generated or manual subtitles."""
443    ydl_opts = {
444        'skip_download': True,
445        'writesubtitles': True,
446        'writeautomaticsub': True,
447        'subtitleslangs': ['en'],
448        'quiet': True,
449    }
450 
451    with yt_dlp.YoutubeDL(ydl_opts) as ydl:
452        info = ydl.extract_info(url, download=False)
453 
454        # Check for subtitles
455        subtitles = info.get('subtitles', {})
456        auto_captions = info.get('automatic_captions', {})
457 
458        # Prefer manual subtitles over auto-generated
459        subs = subtitles.get('en') or auto_captions.get('en')
460        if not subs:
461            return []
462 
463        # Get the vtt or json format
464        for sub in subs:
465            if sub['ext'] in ['vtt', 'json3']:
466                # Download and parse subtitle file
467                # ... implementation depends on format
468                pass
469 
470        return []
471```
472 
473### Instagram with instaloader
474 
475```python
476import instaloader
477from pathlib import Path
478 
479class InstagramScraper:
480    def __init__(self, username: str = None, session_file: str = None):
481        self.loader = instaloader.Instaloader(
482            download_videos=True,
483            download_video_thumbnails=False,
484            download_geotags=False,
485            download_comments=False,
486            save_metadata=True,
487            compress_json=False,
488        )
489 
490        if session_file and Path(session_file).exists():
491            self.loader.load_session_from_file(username, session_file)
492 
493    def get_profile_posts(self, username: str, limit: int = 50) -> list[dict]:
494        """Get recent posts from a profile."""
495        profile = instaloader.Profile.from_username(self.loader.context, username)
496        posts = []
497 
498        for i, post in enumerate(profile.get_posts()):
499            if i >= limit:
500                break
501 
502            posts.append({
503                'shortcode': post.shortcode,
504                'url': f'https://instagram.com/p/{post.shortcode}/',
505                'caption': post.caption,
506                'timestamp': post.date_utc.isoformat(),
507                'likes': post.likes,
508                'comments': post.comments,
509                'is_video': post.is_video,
510                'video_url': post.video_url if post.is_video else None,
511            })
512 
513        return posts
514 
515    def download_post(self, shortcode: str, output_dir: Path):
516        """Download a single post's media."""
517        post = instaloader.Post.from_shortcode(self.loader.context, shortcode)
518        self.loader.download_post(post, target=str(output_dir))
519```
520 
521### TikTok with yt-dlp
522 
523```python
524def scrape_tiktok_profile(username: str, output_dir: Path, limit: int = 50) -> list[dict]:
525    """Scrape TikTok profile videos."""
526    profile_url = f'https://tiktok.com/@{username}'
527 
528    ydl_opts = {
529        'quiet': True,
530        'extract_flat': True,  # Don't download, just get info
531        'playlistend': limit,
532    }
533 
534    with yt_dlp.YoutubeDL(ydl_opts) as ydl:
535        info = ydl.extract_info(profile_url, download=False)
536        videos = []
537 
538        for entry in info.get('entries', []):
539            videos.append({
540                'id': entry.get('id'),
541                'title': entry.get('title'),
542                'url': entry.get('url'),
543                'timestamp': entry.get('timestamp'),
544                'view_count': entry.get('view_count'),
545            })
546 
547        return videos
548 
549def download_tiktok_video(url: str, output_dir: Path) -> Path:
550    """Download a single TikTok video."""
551    ydl_opts = {
552        'outtmpl': str(output_dir / '%(id)s.%(ext)s'),
553        'quiet': True,
554    }
555 
556    with yt_dlp.YoutubeDL(ydl_opts) as ydl:
557        info = ydl.extract_info(url, download=True)
558        return Path(ydl.prepare_filename(info))
559```
560 
561## Request patterns
562 
563### Rotating user agents and headers
564 
565```python
566import random
567from fake_useragent import UserAgent
568 
569class RequestManager:
570    def __init__(self):
571        self.ua = UserAgent()
572        self.session = requests.Session()
573 
574    def get_headers(self) -> dict:
575        return {
576            'User-Agent': self.ua.random,
577            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
578            'Accept-Language': 'en-US,en;q=0.5',
579            'Accept-Encoding': 'gzip, deflate, br',
580            'DNT': '1',
581            'Connection': 'keep-alive',
582            'Upgrade-Insecure-Requests': '1',
583        }
584 
585    def fetch(self, url: str, retry_count: int = 3) -> requests.Response:
586        for attempt in range(retry_count):
587            try:
588                response = self.session.get(
589                    url,
590                    headers=self.get_headers(),
591                    timeout=30
592                )
593                response.raise_for_status()
594                return response
595            except requests.RequestException as e:
596                if attempt == retry_count - 1:
597                    raise
598                time.sleep(2 ** attempt)  # Exponential backoff
599```
600 
601### Respectful scraping with delays
602 
603```python
604import time
605import random
606from urllib.parse import urlparse
607 
608class PoliteRequester:
609    def __init__(self, min_delay: float = 1.0, max_delay: float = 3.0):
610        self.min_delay = min_delay
611        self.max_delay = max_delay
612        self.last_request_per_domain = {}
613 
614    def wait_for_domain(self, url: str):
615        domain = urlparse(url).netloc
616        last_request = self.last_request_per_domain.get(domain, 0)
617 
618        elapsed = time.time() - last_request
619        delay = random.uniform(self.min_delay, self.max_delay)
620 
621        if elapsed < delay:
622            time.sleep(delay - elapsed)
623 
624        self.last_request_per_domain[domain] = time.time()
625```
626 
627## Ethics, robots.txt, and the legal landscape
628 
629Scraping is technically simple, ethically nuanced, and legally a moving target. The current state in the US (2026):
630 
631**Computer Fraud and Abuse Act (CFAA).** *Van Buren v. United States* (2021) and *hiQ Labs v. LinkedIn* (2022) narrowed the CFAA so that scraping public, non-credentialed pages does NOT constitute "unauthorized access." Logging in (or using credentials), bypassing technical access controls, or scraping after an explicit cease-and-desist letter remains legally fraught. State equivalents (e.g., California's CDAFA) sometimes go further than federal law.
632 
633**Terms of service.** Many sites' ToS forbid scraping. ToS is a contract, not a criminal statute — breach exposes you to civil claims (breach of contract, tortious interference, trespass to chattels in some jurisdictions), not jail. The risk profile differs sharply from CFAA.
634 
635**robots.txt** is a polite request, not a legal mandate. Ignoring it doesn't make you criminally liable, but courts have cited it as evidence of intent. For journalism in the public interest, that intent can be defensible; for commercial use, it's harder.
636 
637**EU GDPR / UK DPA.** If your scraping pulls personal data of EU/UK residents, GDPR/DPA apply regardless of where you run the scraper. Public availability does NOT exempt personal data from these regimes — `Lloyd v. Google` (UK Supreme Court 2021) and CJEU's `Schrems II` lineage make scraping personal data without a lawful basis a real liability.
638 
639**Practical baseline:**
640- Always read `robots.txt`. Honor crawl delays. Honor `Disallow:`.
641- Respect rate limits; add jitter; back off on `429`.
642- Don't scrape behind authentication unless you have explicit permission.
643- Don't scrape personal data (names, emails, photos) without a lawful basis.
644- Identify yourself with a descriptive User-Agent and a contact URL when crawling at volume.
645- Cache aggressively to avoid redundant requests.
646- Stop if you receive a cease-and-desist or explicit blocking signal — escalating past one is the move that turns a civil dispute into a CFAA case.
647 
648**Notes on specific platforms.** Instagram's `instaloader` and TikTok scraping via `yt-dlp` work today but break frequently — Meta and TikTok roll out anti-bot updates monthly. Account bans on the credentials you used are common. For journalism, the official APIs (Meta Content Library, TikTok Research API) are slower but more durable.
649
Preparing the source view

Web scraping methodology

SKILL.md