Source from repo
Tavily

Production-ready Tavily API integration patterns for search, extract, crawl, map, and research in Python and JavaScript.
tavily-aiGitHub tavily-aiOfficialSource repo Original GitHub link Publisher page
Files
Skill
n/a
Size
68.8 KB
Entrypoint
SKILL.md
Format
git-repo
Open file
references/extract.md

Syntax-highlighted preview of this file as included in the skill package.
Rendered Source
markdown250 linesFree
references/extract.md
1# Extract API Reference
2 
3## Table of Contents
4 
5- [Extraction Approaches](#extraction-approaches)
6- [Key Parameters](#key-parameters)
7- [Query and Chunks](#query-and-chunks)
8- [Extract Depth](#extract-depth)
9- [Advanced Filtering Strategies](#advanced-filtering-strategies)
10- [Response Fields](#response-fields)
11- [Summary](#summary)
12 
13---
14 
15## Extraction Approaches
16 
17### Search with include_raw_content
18 
19Get search results and content in one call:
20 
21```python
22response = client.search(
23    query="AI healthcare applications",
24    include_raw_content=True,
25    max_results=5
26)
27```
28 
29**When to use:**
30- Quick prototyping
31- Simple queries where search results are likely relevant
32- Single API call convenience
33 
34### Direct Extract API (Recommended)
35 
36Two-step pattern for more control:
37 
38```python
39# Step 1: Search
40search_results = client.search(
41    query="Python async best practices",
42    max_results=10
43)
44 
45# Step 2: Filter by relevance score
46relevant_urls = [
47    r["url"] for r in search_results["results"]
48    if r["score"] > 0.5
49]
50 
51# Step 3: Extract with targeting
52extracted = client.extract(
53    urls=relevant_urls[:20],
54    query="async patterns and concurrency",  # Reranks chunks
55    chunks_per_source=3  # Prevents context explosion
56)
57 
58for item in extracted["results"]:
59    print(f"URL: {item['url']}")
60    print(f"Content: {item['raw_content'][:500]}...")
61```
62 
63**When to use:**
64- You want control over which URLs to extract
65- You need to filter/curate URLs before extraction
66- You want targeted extraction with query and chunks_per_source
67 
68---
69 
70## Key Parameters
71 
72| Parameter | Type | Default | Description |
73|-----------|------|---------|-------------|
74| `urls` | string/array | Required | Single URL or list (max 20) |
75| `extract_depth` | enum | `"basic"` | `"basic"` or `"advanced"` (for complex/JS pages) |
76| `query` | string | null | Reranks chunks by relevance to this query |
77| `chunks_per_source` | integer | 3 | Chunks per source (1-5, max 500 chars each). Only with `query` |
78| `format` | enum | `"markdown"` | Output: `"markdown"` or `"text"` |
79| `include_images` | boolean | false | Include image URLs |
80| `include_favicon` | boolean | false | Include favicon URL |
81| `include_usage` | boolean | false | Include credit consumption data in response |
82| `timeout` | float | varies | Max wait time (1.0-60.0 seconds) |
83 
84---
85 
86## Query and Chunks
87 
88Use `query` and `chunks_per_source` to get only relevant content and prevent context window explosion:
89 
90```python
91extracted = client.extract(
92    urls=[
93        "https://example.com/ml-healthcare",
94        "https://example.com/ai-diagnostics",
95        "https://example.com/medical-ai"
96    ],
97    query="AI diagnostic tools accuracy",
98    chunks_per_source=2  # 2 most relevant chunks per URL
99)
100```
101 
102**When to use query:**
103- To extract only relevant portions of long documents
104- When you need focused content instead of full page extraction
105- For targeted information retrieval from specific URLs
106 
107**Key benefits of chunks_per_source:**
108- Returns only relevant snippets (max 500 chars each) instead of full page
109- Chunks appear in `raw_content` as: `<chunk 1> [...] <chunk 2> [...] <chunk 3>`
110- Prevents context window from exploding in agentic use cases
111 
112**Note:** `chunks_per_source` only works when `query` is provided.
113 
114---
115 
116## Extract Depth
117 
118| Depth | When to use |
119|-------|-------------|
120| `basic` (default) | Simple text extraction, faster |
121| `advanced` | Dynamic/JS-rendered pages, tables, structured data, embedded media |
122 
123```python
124# For complex pages
125extracted = client.extract(
126    urls=["https://example.com/complex-page"],
127    extract_depth="advanced"
128)
129```
130 
131**Fallback strategy:** If `basic` fails, retry with `advanced`:
132 
133```python
134result = client.extract(urls=[url], extract_depth="basic")
135if url in [f["url"] for f in result.get("failed_results", [])]:
136    result = client.extract(urls=[url], extract_depth="advanced")
137```
138 
139**Timeout tuning:** If latency isn't critical, set `timeout=60.0` for better success on slow pages.
140 
141---
142 
143## Advanced Filtering Strategies
144 
145Beyond query-based filtering, consider these approaches before extraction:
146 
147| Strategy | When to use |
148|----------|-------------|
149| Score-based | Filter search results by relevance score |
150| Domain-based | Filter by trusted domains |
151| Re-ranking | Use dedicated re-ranking models for precision |
152| LLM-based | Let an LLM assess relevance before extraction |
153| Clustering | Group similar documents, extract from clusters |
154 
155### Optimal Workflow
156 
1571. **Search** to discover relevant URLs
1582. **Filter** by relevance score, domain, or content snippet
1593. **Re-rank** if needed using specialized models
1604. **Extract** from top-ranked sources with query and chunks_per_source
1615. **Validate** extracted content quality
1626. **Process** for your AI application
163 
164### Example: Complete Pipeline
165 
166```python
167import asyncio
168from tavily import AsyncTavilyClient
169 
170client = AsyncTavilyClient()
171 
172async def content_pipeline(topic):
173    # 1. Search with sub-queries for breadth
174    queries = [
175        f"{topic} overview",
176        f"{topic} best practices",
177        f"{topic} recent developments"
178    ]
179    responses = await asyncio.gather(
180        *(client.search(q, search_depth="advanced", max_results=10) for q in queries)
181    )
182 
183    # 2. Filter and aggregate by score
184    urls = []
185    for response in responses:
186        urls.extend([
187            r['url'] for r in response['results']
188            if r['score'] > 0.5
189        ])
190 
191    # 3. Deduplicate
192    urls = list(set(urls))[:20]
193 
194    # 4. Extract with error handling
195    extracted = await asyncio.gather(
196        *(client.extract(urls=[url], query=topic, extract_depth="advanced")
197          for url in urls),
198        return_exceptions=True
199    )
200 
201    # 5. Filter successful extractions
202    return [e for e in extracted if not isinstance(e, Exception)]
203 
204asyncio.run(content_pipeline("machine learning in healthcare"))
205```
206 
207---
208 
209## Response Fields
210 
211**Top-level response:**
212 
213| Field | Description |
214|-------|-------------|
215| `results` | Array of successfully extracted content |
216| `failed_results` | Array of URLs that failed extraction |
217| `response_time` | Time in seconds |
218| `request_id` | Unique identifier for support reference |
219| `usage` | Credit usage info (if `include_usage=True`) |
220 
221**Each result object:**
222 
223| Field | Description |
224|-------|-------------|
225| `url` | The URL extracted from |
226| `raw_content` | Full content, or top-ranked chunks joined by `[...]` when `query` provided |
227| `images` | Array of image URLs (if `include_images=true`) |
228| `favicon` | Favicon URL (if `include_favicon=true`) |
229 
230**Each failed_results object:**
231 
232| Field | Description |
233|-------|-------------|
234| `url` | The URL that failed |
235| `error` | Error message |
236 
237---
238 
239## Summary
240 
2411. **Use query and chunks_per_source** for targeted, focused extraction
2422. **Choose Extract API** when you need control over which URLs to extract from
2433. **Filter URLs** before extraction using scores, re-ranking, or domain trust
2444. **Choose appropriate extract_depth** based on content complexity
2455. **Process URLs concurrently** with async operations for better performance
2466. **Implement error handling** to manage failed extractions gracefully
2477. **Validate extracted content** before downstream processing
248 
249For more details, see the [full API reference](https://docs.tavily.com/documentation/api-reference/endpoint/extract)
250
Preparing the source view

Tavily

references/extract.md