Extract API Reference
Table of Contents
- Extraction Approaches
- Key Parameters
- Query and Chunks
- Extract Depth
- Advanced Filtering Strategies
- Response Fields
- Summary
Extraction Approaches
Search with includerawcontent
Get search results and content in one call:
response = client.search(
query="AI healthcare applications",
include_raw_content=True,
max_results=5
)When to use:
- Quick prototyping
- Simple queries where search results are likely relevant
- Single API call convenience
Direct Extract API (Recommended)
Two-step pattern for more control:
# Step 1: Search
search_results = client.search(
query="Python async best practices",
max_results=10
)
# Step 2: Filter by relevance score
relevant_urls = [
r["url"] for r in search_results["results"]
if r["score"] > 0.5
]
# Step 3: Extract with targeting
extracted = client.extract(
urls=relevant_urls[:20],
query="async patterns and concurrency", # Reranks chunks
chunks_per_source=3 # Prevents context explosion
)
for item in extracted["results"]:
print(f"URL: {item['url']}")
print(f"Content: {item['raw_content'][:500]}...")When to use:
- You want control over which URLs to extract
- You need to filter/curate URLs before extraction
- You want targeted extraction with query and chunkspersource
Key Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
urls | string/array | Required | Single URL or list (max 20) |
extract_depth | enum | "basic" | "basic" or "advanced" (for complex/JS pages) |
query | string | null | Reranks chunks by relevance to this query |
chunks_per_source | integer | 3 | Chunks per source (1-5, max 500 chars each). Only with query |
format | enum | "markdown" | Output: "markdown" or "text" |
include_images | boolean | false | Include image URLs |
include_favicon | boolean | false | Include favicon URL |
include_usage | boolean | false | Include credit consumption data in response |
timeout | float | varies | Max wait time (1.0-60.0 seconds) |
Query and Chunks
Use query and chunks_per_source to get only relevant content and prevent context window explosion:
extracted = client.extract(
urls=[
"https://example.com/ml-healthcare",
"https://example.com/ai-diagnostics",
"https://example.com/medical-ai"
],
query="AI diagnostic tools accuracy",
chunks_per_source=2 # 2 most relevant chunks per URL
)When to use query:
- To extract only relevant portions of long documents
- When you need focused content instead of full page extraction
- For targeted information retrieval from specific URLs
Key benefits of chunkspersource:
- Returns only relevant snippets (max 500 chars each) instead of full page
- Chunks appear in
raw_contentas:<chunk 1> [...] <chunk 2> [...] <chunk 3> - Prevents context window from exploding in agentic use cases
Note: chunks_per_source only works when query is provided.
Extract Depth
| Depth | When to use |
|---|---|
basic (default) | Simple text extraction, faster |
advanced | Dynamic/JS-rendered pages, tables, structured data, embedded media |
# For complex pages
extracted = client.extract(
urls=["https://example.com/complex-page"],
extract_depth="advanced"
)Fallback strategy: If basic fails, retry with advanced:
result = client.extract(urls=[url], extract_depth="basic")
if url in [f["url"] for f in result.get("failed_results", [])]:
result = client.extract(urls=[url], extract_depth="advanced")Timeout tuning: If latency isn't critical, set timeout=60.0 for better success on slow pages.
Advanced Filtering Strategies
Beyond query-based filtering, consider these approaches before extraction:
| Strategy | When to use |
|---|---|
| Score-based | Filter search results by relevance score |
| Domain-based | Filter by trusted domains |
| Re-ranking | Use dedicated re-ranking models for precision |
| LLM-based | Let an LLM assess relevance before extraction |
| Clustering | Group similar documents, extract from clusters |
Optimal Workflow
- Search to discover relevant URLs
- Filter by relevance score, domain, or content snippet
- Re-rank if needed using specialized models
- Extract from top-ranked sources with query and chunkspersource
- Validate extracted content quality
- Process for your AI application
Example: Complete Pipeline
import asyncio
from tavily import AsyncTavilyClient
client = AsyncTavilyClient()
async def content_pipeline(topic):
# 1. Search with sub-queries for breadth
queries = [
f"{topic} overview",
f"{topic} best practices",
f"{topic} recent developments"
]
responses = await asyncio.gather(
*(client.search(q, search_depth="advanced", max_results=10) for q in queries)
)
# 2. Filter and aggregate by score
urls = []
for response in responses:
urls.extend([
r['url'] for r in response['results']
if r['score'] > 0.5
])
# 3. Deduplicate
urls = list(set(urls))[:20]
# 4. Extract with error handling
extracted = await asyncio.gather(
*(client.extract(urls=[url], query=topic, extract_depth="advanced")
for url in urls),
return_exceptions=True
)
# 5. Filter successful extractions
return [e for e in extracted if not isinstance(e, Exception)]
asyncio.run(content_pipeline("machine learning in healthcare"))Response Fields
Top-level response:
| Field | Description |
|---|---|
results | Array of successfully extracted content |
failed_results | Array of URLs that failed extraction |
response_time | Time in seconds |
request_id | Unique identifier for support reference |
usage | Credit usage info (if include_usage=True) |
Each result object:
| Field | Description |
|---|---|
url | The URL extracted from |
raw_content | Full content, or top-ranked chunks joined by [...] when query provided |
images | Array of image URLs (if include_images=true) |
favicon | Favicon URL (if include_favicon=true) |
Each failed_results object:
| Field | Description |
|---|---|
url | The URL that failed |
error | Error message |
Summary
- Use query and chunkspersource for targeted, focused extraction
- Choose Extract API when you need control over which URLs to extract from
- Filter URLs before extraction using scores, re-ranking, or domain trust
- Choose appropriate extract_depth based on content complexity
- Process URLs concurrently with async operations for better performance
- Implement error handling to manage failed extractions gracefully
- Validate extracted content before downstream processing
For more details, see the full API reference