Loading source
Pulling the file list, source metadata, and syntax-aware rendering for this listing.
Source from repo
Production-ready Tavily API integration patterns for search, extract, crawl, map, and research in Python and JavaScript.
Files
Skill
Size
Entrypoint
Format
Open file
Syntax-highlighted preview of this file as included in the skill package.
references/extract.md
1# Extract API Reference23## Table of Contents45- [Extraction Approaches](#extraction-approaches)6- [Key Parameters](#key-parameters)7- [Query and Chunks](#query-and-chunks)8- [Extract Depth](#extract-depth)9- [Advanced Filtering Strategies](#advanced-filtering-strategies)10- [Response Fields](#response-fields)11- [Summary](#summary)1213---1415## Extraction Approaches1617### Search with include_raw_content1819Get search results and content in one call:2021```python22response = client.search(23query="AI healthcare applications",24include_raw_content=True,25max_results=526)27```2829**When to use:**30- Quick prototyping31- Simple queries where search results are likely relevant32- Single API call convenience3334### Direct Extract API (Recommended)3536Two-step pattern for more control:3738```python39# Step 1: Search40search_results = client.search(41query="Python async best practices",42max_results=1043)4445# Step 2: Filter by relevance score46relevant_urls = [47r["url"] for r in search_results["results"]48if r["score"] > 0.549]5051# Step 3: Extract with targeting52extracted = client.extract(53urls=relevant_urls[:20],54query="async patterns and concurrency", # Reranks chunks55chunks_per_source=3 # Prevents context explosion56)5758for item in extracted["results"]:59print(f"URL: {item['url']}")60print(f"Content: {item['raw_content'][:500]}...")61```6263**When to use:**64- You want control over which URLs to extract65- You need to filter/curate URLs before extraction66- You want targeted extraction with query and chunks_per_source6768---6970## Key Parameters7172| Parameter | Type | Default | Description |73|-----------|------|---------|-------------|74| `urls` | string/array | Required | Single URL or list (max 20) |75| `extract_depth` | enum | `"basic"` | `"basic"` or `"advanced"` (for complex/JS pages) |76| `query` | string | null | Reranks chunks by relevance to this query |77| `chunks_per_source` | integer | 3 | Chunks per source (1-5, max 500 chars each). Only with `query` |78| `format` | enum | `"markdown"` | Output: `"markdown"` or `"text"` |79| `include_images` | boolean | false | Include image URLs |80| `include_favicon` | boolean | false | Include favicon URL |81| `include_usage` | boolean | false | Include credit consumption data in response |82| `timeout` | float | varies | Max wait time (1.0-60.0 seconds) |8384---8586## Query and Chunks8788Use `query` and `chunks_per_source` to get only relevant content and prevent context window explosion:8990```python91extracted = client.extract(92urls=[93"https://example.com/ml-healthcare",94"https://example.com/ai-diagnostics",95"https://example.com/medical-ai"96],97query="AI diagnostic tools accuracy",98chunks_per_source=2 # 2 most relevant chunks per URL99)100```101102**When to use query:**103- To extract only relevant portions of long documents104- When you need focused content instead of full page extraction105- For targeted information retrieval from specific URLs106107**Key benefits of chunks_per_source:**108- Returns only relevant snippets (max 500 chars each) instead of full page109- Chunks appear in `raw_content` as: `<chunk 1> [...] <chunk 2> [...] <chunk 3>`110- Prevents context window from exploding in agentic use cases111112**Note:** `chunks_per_source` only works when `query` is provided.113114---115116## Extract Depth117118| Depth | When to use |119|-------|-------------|120| `basic` (default) | Simple text extraction, faster |121| `advanced` | Dynamic/JS-rendered pages, tables, structured data, embedded media |122123```python124# For complex pages125extracted = client.extract(126urls=["https://example.com/complex-page"],127extract_depth="advanced"128)129```130131**Fallback strategy:** If `basic` fails, retry with `advanced`:132133```python134result = client.extract(urls=[url], extract_depth="basic")135if url in [f["url"] for f in result.get("failed_results", [])]:136result = client.extract(urls=[url], extract_depth="advanced")137```138139**Timeout tuning:** If latency isn't critical, set `timeout=60.0` for better success on slow pages.140141---142143## Advanced Filtering Strategies144145Beyond query-based filtering, consider these approaches before extraction:146147| Strategy | When to use |148|----------|-------------|149| Score-based | Filter search results by relevance score |150| Domain-based | Filter by trusted domains |151| Re-ranking | Use dedicated re-ranking models for precision |152| LLM-based | Let an LLM assess relevance before extraction |153| Clustering | Group similar documents, extract from clusters |154155### Optimal Workflow1561571. **Search** to discover relevant URLs1582. **Filter** by relevance score, domain, or content snippet1593. **Re-rank** if needed using specialized models1604. **Extract** from top-ranked sources with query and chunks_per_source1615. **Validate** extracted content quality1626. **Process** for your AI application163164### Example: Complete Pipeline165166```python167import asyncio168from tavily import AsyncTavilyClient169170client = AsyncTavilyClient()171172async def content_pipeline(topic):173# 1. Search with sub-queries for breadth174queries = [175f"{topic} overview",176f"{topic} best practices",177f"{topic} recent developments"178]179responses = await asyncio.gather(180*(client.search(q, search_depth="advanced", max_results=10) for q in queries)181)182183# 2. Filter and aggregate by score184urls = []185for response in responses:186urls.extend([187r['url'] for r in response['results']188if r['score'] > 0.5189])190191# 3. Deduplicate192urls = list(set(urls))[:20]193194# 4. Extract with error handling195extracted = await asyncio.gather(196*(client.extract(urls=[url], query=topic, extract_depth="advanced")197for url in urls),198return_exceptions=True199)200201# 5. Filter successful extractions202return [e for e in extracted if not isinstance(e, Exception)]203204asyncio.run(content_pipeline("machine learning in healthcare"))205```206207---208209## Response Fields210211**Top-level response:**212213| Field | Description |214|-------|-------------|215| `results` | Array of successfully extracted content |216| `failed_results` | Array of URLs that failed extraction |217| `response_time` | Time in seconds |218| `request_id` | Unique identifier for support reference |219| `usage` | Credit usage info (if `include_usage=True`) |220221**Each result object:**222223| Field | Description |224|-------|-------------|225| `url` | The URL extracted from |226| `raw_content` | Full content, or top-ranked chunks joined by `[...]` when `query` provided |227| `images` | Array of image URLs (if `include_images=true`) |228| `favicon` | Favicon URL (if `include_favicon=true`) |229230**Each failed_results object:**231232| Field | Description |233|-------|-------------|234| `url` | The URL that failed |235| `error` | Error message |236237---238239## Summary2402411. **Use query and chunks_per_source** for targeted, focused extraction2422. **Choose Extract API** when you need control over which URLs to extract from2433. **Filter URLs** before extraction using scores, re-ranking, or domain trust2444. **Choose appropriate extract_depth** based on content complexity2455. **Process URLs concurrently** with async operations for better performance2466. **Implement error handling** to manage failed extractions gracefully2477. **Validate extracted content** before downstream processing248249For more details, see the [full API reference](https://docs.tavily.com/documentation/api-reference/endpoint/extract)250