Loading source
Pulling the file list, source metadata, and syntax-aware rendering for this listing.
Source from repo
Comprehensive Cloudflare platform skill covering Workers, D1, R2, KV, AI, Durable Objects, and security.
Files
Skill
Size
Entrypoint
Format
Open file
Syntax-highlighted preview of this file as included in the skill package.
references/r2-sql/SKILL.md.backup
1# Cloudflare R2 SQL Skill23Guide for using Cloudflare R2 SQL - serverless distributed query engine for Apache Iceberg tables in R2 Data Catalog.45## Overview67R2 SQL is Cloudflare's serverless distributed analytics query engine for querying Apache Iceberg tables in R2 Data Catalog. Features:8- Serverless - no clusters to manage9- Distributed - leverages Cloudflare's global network10- Zero egress fees - query from any cloud/region11- Open beta - free during beta (standard R2 storage costs apply)1213## Core Concepts1415### Apache Iceberg Table Format16- Open table format for large-scale analytics datasets17- ACID transactions for reliable concurrent reads/writes18- Schema evolution - add/rename/drop columns without rewriting data19- Optimized metadata - avoids full table scans via indexed metadata20- Supported by Spark, Trino, Snowflake, DuckDB, ClickHouse, PyIceberg2122### R2 Data Catalog23- Managed Apache Iceberg catalog built into R2 bucket24- Exposes standard Iceberg REST catalog interface25- Single source of truth for table metadata26- Tracks table state via immutable snapshots27- Supports multiple query engines safely accessing same tables2829### Architecture30**Query Planner**:31- Top-down metadata investigation32- Multi-layer pruning (partition-level, column-level, row-group level)33- Streaming pipeline - execution starts before planning completes34- Early termination - stops when result complete without full scan35- Uses partition stats and column stats (min/max, null counts)3637**Query Execution**:38- Coordinator distributes work to workers across Cloudflare network39- Workers run Apache DataFusion for parallel query execution40- Arrow IPC format for inter-process communication41- Parquet column pruning - reads only required columns42- Ranged reads from R2 for efficiency4344**Aggregation Strategies**:45- Scatter-gather - for simple aggregations (sum, count, avg)46- Shuffling - for ORDER BY/HAVING on aggregates via hash partitioning4748## Setup & Configuration4950### 1. Enable R2 Data Catalog5152CLI:53```bash54npx wrangler r2 bucket catalog enable <bucket-name>55```5657Note the Warehouse name and Catalog URI from output.5859Dashboard:601. R2 Object Storage → Select bucket612. Settings tab → R2 Data Catalog → Enable623. Note Catalog URI and Warehouse name6364### 2. Create API Token6566Required permissions: R2 Admin Read & Write (includes R2 SQL Read)6768Dashboard:691. R2 Object Storage → Manage API tokens702. Create API token → Admin Read & Write713. Save token value7273### 3. Configure Environment7475```bash76export WRANGLER_R2_SQL_AUTH_TOKEN=<your-token>77```7879Or `.env` file:80```81WRANGLER_R2_SQL_AUTH_TOKEN=<your-token>82```8384## Common Code Patterns8586### Wrangler CLI Query8788```bash89npx wrangler r2 sql query "<warehouse-name>" "90SELECT *91FROM namespace.table_name92WHERE condition93LIMIT 10"94```9596### PyIceberg Setup9798```python99from pyiceberg.catalog.rest import RestCatalog100101catalog = RestCatalog(102name="my_catalog",103warehouse="<WAREHOUSE>",104uri="<CATALOG_URI>",105token="<TOKEN>",106)107108# Create namespace109catalog.create_namespace_if_not_exists("default")110```111112### Create Table113114```python115import pyarrow as pa116117# Define schema118df = pa.table({119"id": [1, 2, 3],120"name": ["Alice", "Bob", "Charlie"],121"score": [80.0, 92.5, 88.0],122})123124# Create table125table = catalog.create_table(126("default", "people"),127schema=df.schema,128)129```130131### Append Data132133```python134table.append(df)135```136137### Query Table138139```python140# Scan and convert to Pandas141scanned = table.scan().to_arrow()142print(scanned.to_pandas())143```144145## SQL Reference146147### Query Structure148149```sql150SELECT column_list | aggregation_function151FROM table_name152WHERE conditions153[GROUP BY column_list]154[HAVING conditions]155[ORDER BY partition_key [DESC | ASC]]156[LIMIT number]157```158159### Schema Discovery160161```sql162-- List namespaces163SHOW DATABASES;164SHOW NAMESPACES;165166-- List tables167SHOW TABLES IN namespace_name;168169-- Describe table170DESCRIBE namespace_name.table_name;171```172173### SELECT Patterns174175```sql176-- All columns177SELECT * FROM ns.table;178179-- Specific columns180SELECT user_id, timestamp, status FROM ns.table;181182-- With conditions183SELECT * FROM ns.table184WHERE timestamp BETWEEN '2025-01-01T00:00:00Z' AND '2025-01-31T23:59:59Z'185AND status = 200186LIMIT 100;187188-- Complex conditions189SELECT * FROM ns.table190WHERE (status = 404 OR status = 500)191AND method = 'POST'192AND user_agent IS NOT NULL193ORDER BY timestamp DESC;194```195196### Aggregations197198Supported functions: COUNT(*), SUM(col), AVG(col), MIN(col), MAX(col)199200```sql201-- Count by group202SELECT department, COUNT(*)203FROM ns.sales_data204GROUP BY department;205206-- Multiple aggregates207SELECT region, MIN(price), MAX(price), AVG(price)208FROM ns.products209GROUP BY region210ORDER BY AVG(price) DESC;211212-- With HAVING filter213SELECT category, SUM(amount)214FROM ns.sales215WHERE sale_date >= '2024-01-01'216GROUP BY category217HAVING SUM(amount) > 10000218LIMIT 10;219```220221### Data Types222223| Type | Description | Example |224|------|-------------|---------|225| integer | Whole numbers | 1, 42, -10 |226| float | Decimals | 1.5, 3.14 |227| string | Text (quoted) | 'hello', 'GET' |228| boolean | true/false | true, false |229| timestamp | RFC3339 | '2025-01-01T00:00:00Z' |230| date | YYYY-MM-DD | '2025-01-01' |231232### Operators233234Comparison: =, !=, <, <=, >, >=, LIKE, BETWEEN, IS NULL, IS NOT NULL235Logical: AND (higher precedence), OR (lower precedence)236237### ORDER BY Limitations238239**CRITICAL**: ORDER BY only supports partition key columns240241```sql242-- Valid if timestamp is partition key243SELECT * FROM ns.logs ORDER BY timestamp DESC LIMIT 100;244245-- Invalid if column not in partition key246SELECT * FROM ns.logs ORDER BY user_id; -- ERROR247```248249### LIMIT Defaults250251- Range: 1 to 10,000252- Default: 500 if not specified253254## Pipelines Integration255256### Create Pipeline with Data Catalog Sink257258Schema file (`schema.json`):259```json260{261"fields": [262{"name": "user_id", "type": "string", "required": true},263{"name": "event_type", "type": "string", "required": true},264{"name": "amount", "type": "float64", "required": false}265]266}267```268269Setup:270```bash271npx wrangler pipelines setup272```273274Configuration:275- Pipeline name: ecommerce276- Enable HTTP endpoint: yes277- Schema: Load from file → schema.json278- Destination: Data Catalog Table279- R2 bucket: your-bucket280- Namespace: default281- Table name: events282- Catalog token: <your-token>283- Compression: zstd284- Roll file time: 10 seconds (dev), 300+ (prod)285286### Send Data to Pipeline287288```bash289curl -X POST https://{stream-id}.ingest.cloudflare.com \290-H "Content-Type: application/json" \291-d '[292{293"user_id": "user_123",294"event_type": "purchase",295"amount": 29.99296}297]'298```299300## Common Use Cases301302### Log Analytics303- Ingest logs via Pipelines to Iceberg table304- Partition by day(timestamp) for efficient queries305- Query specific time ranges with automatic pruning306- Aggregate by status codes, endpoints, user agents307308```sql309SELECT status, COUNT(*)310FROM logs.http_requests311WHERE timestamp BETWEEN '2025-01-01T00:00:00Z' AND '2025-01-31T23:59:59Z'312AND method = 'GET'313GROUP BY status314ORDER BY COUNT(*) DESC;315```316317### Fraud Detection318- Stream transaction events to catalog319- Query suspicious patterns with WHERE filters320- Aggregate by location, merchant, time windows321322```sql323SELECT location, COUNT(*), AVG(amount)324FROM fraud.transactions325WHERE is_fraud = true326AND transaction_timestamp >= '2025-01-01'327GROUP BY location328HAVING COUNT(*) > 10;329```330331### Business Intelligence332- ETL data into partitioned Iceberg tables333- Run analytical queries across large datasets334- Generate reports with GROUP BY aggregations335- No egress fees when querying from BI tools336337```sql338SELECT339department,340SUM(revenue) as total_revenue,341AVG(revenue) as avg_revenue342FROM sales.transactions343WHERE sale_date >= '2024-01-01'344GROUP BY department345ORDER BY SUM(revenue) DESC346LIMIT 10;347```348349## Performance Optimization350351### Partitioning Strategy352- Choose partition key based on common query patterns353- Typical: day(timestamp), hour(timestamp), region, category354- Enables metadata pruning to skip entire partitions355- Required for ORDER BY optimization356357### Query Optimization358- Use WHERE filters to leverage partition/column stats359- Specify LIMIT to enable early termination360- ORDER BY partition key columns only361- Filter on high-selectivity columns first362363### Data Organization364- Smaller files → slower queries (overhead)365- Larger files → better compression, fewer metadata ops366- Recommended: 100-500MB Parquet files after compression367- Use appropriate roll intervals in Pipelines (300+ seconds for prod)368369### File Pruning370Automatic at three levels:3711. Partition-level: Skip manifests not matching query3722. File-level: Skip Parquet files via column stats3733. Row-group level: Skip row groups within files374375## Iceberg Metadata Structure376377```378bucket/379metadata/380snap-{id}.avro # Snapshot (points to manifest list)381{uuid}-m0.avro # Manifest file (lists data files + stats)382version-hint.text # Current metadata version383v{n}.metadata.json # Table metadata (schema, snapshots)384data/38500000-0-{uuid}.parquet # Data files386```387388**Metadata hierarchy**:3891. Table metadata JSON - schema, partition spec, snapshot log3902. Snapshot - points to manifest list3913. Manifest list - partition stats for each manifest3924. Manifest files - column stats for each data file3935. Parquet files - row group stats in footer394395## Limitations & Best Practices396397### Current Limitations (Open Beta)398- ORDER BY only on partition key columns399- COUNT(*) only - COUNT(column) not supported400- No aliases in SELECT401- No subqueries, joins, or CTEs402- No nested column access403- LIMIT max 10,000404405### Best Practices406- Partition by time dimension for time-series data407- Use BETWEEN for time ranges (leverages partition pruning)408- Combine filters with AND for better pruning409- Set appropriate LIMIT based on use case410- Use compression (zstd recommended)411- Monitor query performance and adjust partitioning412413### Type Safety414- Quote string values: 'value'415- Use RFC3339 for timestamps: '2025-01-01T00:00:00Z'416- Use YYYY-MM-DD for dates: '2025-01-01'417- No implicit type conversions418419## Connecting Other Engines420421R2 Data Catalog supports standard Iceberg REST catalog API.422423### Spark (Scala)424```scala425val spark = SparkSession.builder()426.config("spark.sql.catalog.my_catalog", "org.apache.iceberg.spark.SparkCatalog")427.config("spark.sql.catalog.my_catalog.catalog-impl", "org.apache.iceberg.rest.RESTCatalog")428.config("spark.sql.catalog.my_catalog.uri", catalogUri)429.config("spark.sql.catalog.my_catalog.token", token)430.config("spark.sql.catalog.my_catalog.warehouse", warehouse)431.getOrCreate()432```433434### Snowflake435- Create external Iceberg catalog connection436- Configure with Catalog URI and R2 credentials437- Query tables via SQL interface438439### DuckDB, Trino, ClickHouse440- Supported via Iceberg REST catalog protocol441- Refer to engine-specific documentation for configuration442443## Pricing (Future)444445Currently in open beta - no charges beyond standard R2 costs.446447Planned future pricing:448- R2 storage: $0.015/GB-month449- Class A operations: $4.50/million450- Class B operations: $0.36/million451- Catalog operations: $9.00/million (create table, get metadata, etc)452- Compaction: $0.05/GB + $4.00/million objects processed453- Egress: $0 (always free)45445530+ days notice before billing begins.456457## Troubleshooting458459### Common Errors460461**"ORDER BY column not in partition key"**462- Only partition key columns can be used in ORDER BY463- Check table partition spec with DESCRIBE464- Remove ORDER BY or adjust table partitioning465466**"Token authentication failed"**467- Verify WRANGLER_R2_SQL_AUTH_TOKEN is set468- Ensure token has R2 Admin Read & Write + SQL Read permissions469- Token may be expired - create new one470471**"Table not found"**472- Verify namespace exists: SHOW DATABASES473- Check table name: SHOW TABLES IN namespace474- Ensure catalog enabled on bucket475476**"No data returned"**477- Check WHERE conditions match data478- Verify time range in BETWEEN clause479- Try removing filters to confirm data exists480481### Performance Issues482483**Slow queries**:484- Check partition pruning effectiveness485- Reduce LIMIT if scanning too much data486- Ensure filters on partition key columns487- Review Parquet file sizes (aim for 100-500MB)488489**Query timeout**:490- Add more restrictive WHERE filters491- Reduce LIMIT492- Consider better partitioning strategy493494## Resources495496- Docs: https://developers.cloudflare.com/r2-sql/497- Data Catalog: https://developers.cloudflare.com/r2/data-catalog/498- Blog: https://blog.cloudflare.com/r2-sql-deep-dive/499- Discord: https://discord.cloudflare.com/500501## Key Reminders5025031. R2 SQL queries ONLY Apache Iceberg tables in R2 Data Catalog5042. Enable catalog on bucket before use5053. Create API token with R2 + catalog permissions5064. Partition by time for time-series data5075. ORDER BY limited to partition key columns5086. Use LIMIT and WHERE for optimal performance5097. Zero egress fees - query from anywhere5108. Open beta - free during testing phase5119. Serverless - no infrastructure management51210. Leverage Cloudflare's global network for distributed execution513