Source from repo
Cloudflare Platform Skill

Comprehensive Cloudflare platform skill covering Workers, D1, R2, KV, AI, Durable Objects, and security.
cloudflareGitHub cloudflareSource repo Original GitHub link Publisher page
Files
321
Skill
n/a
Size
1.4 MB
Entrypoint
SKILL.md
Format
git-repo
Open file
references/r2-sql/SKILL.md.backup

Syntax-highlighted preview of this file as included in the skill package.
Rendered Source
text513 linesFree
references/r2-sql/SKILL.md.backup
1# Cloudflare R2 SQL Skill
2 
3Guide for using Cloudflare R2 SQL - serverless distributed query engine for Apache Iceberg tables in R2 Data Catalog.
4 
5## Overview
6 
7R2 SQL is Cloudflare's serverless distributed analytics query engine for querying Apache Iceberg tables in R2 Data Catalog. Features:
8- Serverless - no clusters to manage
9- Distributed - leverages Cloudflare's global network
10- Zero egress fees - query from any cloud/region
11- Open beta - free during beta (standard R2 storage costs apply)
12 
13## Core Concepts
14 
15### Apache Iceberg Table Format
16- Open table format for large-scale analytics datasets
17- ACID transactions for reliable concurrent reads/writes
18- Schema evolution - add/rename/drop columns without rewriting data
19- Optimized metadata - avoids full table scans via indexed metadata
20- Supported by Spark, Trino, Snowflake, DuckDB, ClickHouse, PyIceberg
21 
22### R2 Data Catalog
23- Managed Apache Iceberg catalog built into R2 bucket
24- Exposes standard Iceberg REST catalog interface
25- Single source of truth for table metadata
26- Tracks table state via immutable snapshots
27- Supports multiple query engines safely accessing same tables
28 
29### Architecture
30**Query Planner**:
31- Top-down metadata investigation
32- Multi-layer pruning (partition-level, column-level, row-group level)
33- Streaming pipeline - execution starts before planning completes
34- Early termination - stops when result complete without full scan
35- Uses partition stats and column stats (min/max, null counts)
36 
37**Query Execution**:
38- Coordinator distributes work to workers across Cloudflare network
39- Workers run Apache DataFusion for parallel query execution
40- Arrow IPC format for inter-process communication
41- Parquet column pruning - reads only required columns
42- Ranged reads from R2 for efficiency
43 
44**Aggregation Strategies**:
45- Scatter-gather - for simple aggregations (sum, count, avg)
46- Shuffling - for ORDER BY/HAVING on aggregates via hash partitioning
47 
48## Setup & Configuration
49 
50### 1. Enable R2 Data Catalog
51 
52CLI:
53```bash
54npx wrangler r2 bucket catalog enable <bucket-name>
55```
56 
57Note the Warehouse name and Catalog URI from output.
58 
59Dashboard:
601. R2 Object Storage → Select bucket
612. Settings tab → R2 Data Catalog → Enable
623. Note Catalog URI and Warehouse name
63 
64### 2. Create API Token
65 
66Required permissions: R2 Admin Read & Write (includes R2 SQL Read)
67 
68Dashboard:
691. R2 Object Storage → Manage API tokens
702. Create API token → Admin Read & Write
713. Save token value
72 
73### 3. Configure Environment
74 
75```bash
76export WRANGLER_R2_SQL_AUTH_TOKEN=<your-token>
77```
78 
79Or `.env` file:
80```
81WRANGLER_R2_SQL_AUTH_TOKEN=<your-token>
82```
83 
84## Common Code Patterns
85 
86### Wrangler CLI Query
87 
88```bash
89npx wrangler r2 sql query "<warehouse-name>" "
90  SELECT * 
91  FROM namespace.table_name 
92  WHERE condition 
93  LIMIT 10"
94```
95 
96### PyIceberg Setup
97 
98```python
99from pyiceberg.catalog.rest import RestCatalog
100 
101catalog = RestCatalog(
102    name="my_catalog",
103    warehouse="<WAREHOUSE>",
104    uri="<CATALOG_URI>",
105    token="<TOKEN>",
106)
107 
108# Create namespace
109catalog.create_namespace_if_not_exists("default")
110```
111 
112### Create Table
113 
114```python
115import pyarrow as pa
116 
117# Define schema
118df = pa.table({
119    "id": [1, 2, 3],
120    "name": ["Alice", "Bob", "Charlie"],
121    "score": [80.0, 92.5, 88.0],
122})
123 
124# Create table
125table = catalog.create_table(
126    ("default", "people"),
127    schema=df.schema,
128)
129```
130 
131### Append Data
132 
133```python
134table.append(df)
135```
136 
137### Query Table
138 
139```python
140# Scan and convert to Pandas
141scanned = table.scan().to_arrow()
142print(scanned.to_pandas())
143```
144 
145## SQL Reference
146 
147### Query Structure
148 
149```sql
150SELECT column_list | aggregation_function
151FROM table_name
152WHERE conditions
153[GROUP BY column_list]
154[HAVING conditions]
155[ORDER BY partition_key [DESC | ASC]]
156[LIMIT number]
157```
158 
159### Schema Discovery
160 
161```sql
162-- List namespaces
163SHOW DATABASES;
164SHOW NAMESPACES;
165 
166-- List tables
167SHOW TABLES IN namespace_name;
168 
169-- Describe table
170DESCRIBE namespace_name.table_name;
171```
172 
173### SELECT Patterns
174 
175```sql
176-- All columns
177SELECT * FROM ns.table;
178 
179-- Specific columns
180SELECT user_id, timestamp, status FROM ns.table;
181 
182-- With conditions
183SELECT * FROM ns.table 
184WHERE timestamp BETWEEN '2025-01-01T00:00:00Z' AND '2025-01-31T23:59:59Z'
185  AND status = 200
186LIMIT 100;
187 
188-- Complex conditions
189SELECT * FROM ns.table 
190WHERE (status = 404 OR status = 500) 
191  AND method = 'POST'
192  AND user_agent IS NOT NULL
193ORDER BY timestamp DESC;
194```
195 
196### Aggregations
197 
198Supported functions: COUNT(*), SUM(col), AVG(col), MIN(col), MAX(col)
199 
200```sql
201-- Count by group
202SELECT department, COUNT(*)
203FROM ns.sales_data
204GROUP BY department;
205 
206-- Multiple aggregates
207SELECT region, MIN(price), MAX(price), AVG(price)
208FROM ns.products
209GROUP BY region
210ORDER BY AVG(price) DESC;
211 
212-- With HAVING filter
213SELECT category, SUM(amount)
214FROM ns.sales
215WHERE sale_date >= '2024-01-01'
216GROUP BY category
217HAVING SUM(amount) > 10000
218LIMIT 10;
219```
220 
221### Data Types
222 
223| Type | Description | Example |
224|------|-------------|---------|
225| integer | Whole numbers | 1, 42, -10 |
226| float | Decimals | 1.5, 3.14 |
227| string | Text (quoted) | 'hello', 'GET' |
228| boolean | true/false | true, false |
229| timestamp | RFC3339 | '2025-01-01T00:00:00Z' |
230| date | YYYY-MM-DD | '2025-01-01' |
231 
232### Operators
233 
234Comparison: =, !=, <, <=, >, >=, LIKE, BETWEEN, IS NULL, IS NOT NULL
235Logical: AND (higher precedence), OR (lower precedence)
236 
237### ORDER BY Limitations
238 
239**CRITICAL**: ORDER BY only supports partition key columns
240 
241```sql
242-- Valid if timestamp is partition key
243SELECT * FROM ns.logs ORDER BY timestamp DESC LIMIT 100;
244 
245-- Invalid if column not in partition key
246SELECT * FROM ns.logs ORDER BY user_id;  -- ERROR
247```
248 
249### LIMIT Defaults
250 
251- Range: 1 to 10,000
252- Default: 500 if not specified
253 
254## Pipelines Integration
255 
256### Create Pipeline with Data Catalog Sink
257 
258Schema file (`schema.json`):
259```json
260{
261  "fields": [
262    {"name": "user_id", "type": "string", "required": true},
263    {"name": "event_type", "type": "string", "required": true},
264    {"name": "amount", "type": "float64", "required": false}
265  ]
266}
267```
268 
269Setup:
270```bash
271npx wrangler pipelines setup
272```
273 
274Configuration:
275- Pipeline name: ecommerce
276- Enable HTTP endpoint: yes
277- Schema: Load from file → schema.json
278- Destination: Data Catalog Table
279- R2 bucket: your-bucket
280- Namespace: default
281- Table name: events
282- Catalog token: <your-token>
283- Compression: zstd
284- Roll file time: 10 seconds (dev), 300+ (prod)
285 
286### Send Data to Pipeline
287 
288```bash
289curl -X POST https://{stream-id}.ingest.cloudflare.com \
290  -H "Content-Type: application/json" \
291  -d '[
292    {
293      "user_id": "user_123",
294      "event_type": "purchase",
295      "amount": 29.99
296    }
297  ]'
298```
299 
300## Common Use Cases
301 
302### Log Analytics
303- Ingest logs via Pipelines to Iceberg table
304- Partition by day(timestamp) for efficient queries
305- Query specific time ranges with automatic pruning
306- Aggregate by status codes, endpoints, user agents
307 
308```sql
309SELECT status, COUNT(*)
310FROM logs.http_requests
311WHERE timestamp BETWEEN '2025-01-01T00:00:00Z' AND '2025-01-31T23:59:59Z'
312  AND method = 'GET'
313GROUP BY status
314ORDER BY COUNT(*) DESC;
315```
316 
317### Fraud Detection
318- Stream transaction events to catalog
319- Query suspicious patterns with WHERE filters
320- Aggregate by location, merchant, time windows
321 
322```sql
323SELECT location, COUNT(*), AVG(amount)
324FROM fraud.transactions
325WHERE is_fraud = true
326  AND transaction_timestamp >= '2025-01-01'
327GROUP BY location
328HAVING COUNT(*) > 10;
329```
330 
331### Business Intelligence
332- ETL data into partitioned Iceberg tables
333- Run analytical queries across large datasets
334- Generate reports with GROUP BY aggregations
335- No egress fees when querying from BI tools
336 
337```sql
338SELECT 
339  department, 
340  SUM(revenue) as total_revenue,
341  AVG(revenue) as avg_revenue
342FROM sales.transactions
343WHERE sale_date >= '2024-01-01'
344GROUP BY department
345ORDER BY SUM(revenue) DESC
346LIMIT 10;
347```
348 
349## Performance Optimization
350 
351### Partitioning Strategy
352- Choose partition key based on common query patterns
353- Typical: day(timestamp), hour(timestamp), region, category
354- Enables metadata pruning to skip entire partitions
355- Required for ORDER BY optimization
356 
357### Query Optimization
358- Use WHERE filters to leverage partition/column stats
359- Specify LIMIT to enable early termination
360- ORDER BY partition key columns only
361- Filter on high-selectivity columns first
362 
363### Data Organization
364- Smaller files → slower queries (overhead)
365- Larger files → better compression, fewer metadata ops
366- Recommended: 100-500MB Parquet files after compression
367- Use appropriate roll intervals in Pipelines (300+ seconds for prod)
368 
369### File Pruning
370Automatic at three levels:
3711. Partition-level: Skip manifests not matching query
3722. File-level: Skip Parquet files via column stats
3733. Row-group level: Skip row groups within files
374 
375## Iceberg Metadata Structure
376 
377```
378bucket/
379  metadata/
380    snap-{id}.avro          # Snapshot (points to manifest list)
381    {uuid}-m0.avro          # Manifest file (lists data files + stats)
382    version-hint.text       # Current metadata version
383    v{n}.metadata.json      # Table metadata (schema, snapshots)
384  data/
385    00000-0-{uuid}.parquet  # Data files
386```
387 
388**Metadata hierarchy**:
3891. Table metadata JSON - schema, partition spec, snapshot log
3902. Snapshot - points to manifest list
3913. Manifest list - partition stats for each manifest
3924. Manifest files - column stats for each data file
3935. Parquet files - row group stats in footer
394 
395## Limitations & Best Practices
396 
397### Current Limitations (Open Beta)
398- ORDER BY only on partition key columns
399- COUNT(*) only - COUNT(column) not supported
400- No aliases in SELECT
401- No subqueries, joins, or CTEs
402- No nested column access
403- LIMIT max 10,000
404 
405### Best Practices
406- Partition by time dimension for time-series data
407- Use BETWEEN for time ranges (leverages partition pruning)
408- Combine filters with AND for better pruning
409- Set appropriate LIMIT based on use case
410- Use compression (zstd recommended)
411- Monitor query performance and adjust partitioning
412 
413### Type Safety
414- Quote string values: 'value'
415- Use RFC3339 for timestamps: '2025-01-01T00:00:00Z'
416- Use YYYY-MM-DD for dates: '2025-01-01'
417- No implicit type conversions
418 
419## Connecting Other Engines
420 
421R2 Data Catalog supports standard Iceberg REST catalog API.
422 
423### Spark (Scala)
424```scala
425val spark = SparkSession.builder()
426  .config("spark.sql.catalog.my_catalog", "org.apache.iceberg.spark.SparkCatalog")
427  .config("spark.sql.catalog.my_catalog.catalog-impl", "org.apache.iceberg.rest.RESTCatalog")
428  .config("spark.sql.catalog.my_catalog.uri", catalogUri)
429  .config("spark.sql.catalog.my_catalog.token", token)
430  .config("spark.sql.catalog.my_catalog.warehouse", warehouse)
431  .getOrCreate()
432```
433 
434### Snowflake
435- Create external Iceberg catalog connection
436- Configure with Catalog URI and R2 credentials
437- Query tables via SQL interface
438 
439### DuckDB, Trino, ClickHouse
440- Supported via Iceberg REST catalog protocol
441- Refer to engine-specific documentation for configuration
442 
443## Pricing (Future)
444 
445Currently in open beta - no charges beyond standard R2 costs.
446 
447Planned future pricing:
448- R2 storage: $0.015/GB-month
449- Class A operations: $4.50/million
450- Class B operations: $0.36/million
451- Catalog operations: $9.00/million (create table, get metadata, etc)
452- Compaction: $0.05/GB + $4.00/million objects processed
453- Egress: $0 (always free)
454 
45530+ days notice before billing begins.
456 
457## Troubleshooting
458 
459### Common Errors
460 
461**"ORDER BY column not in partition key"**
462- Only partition key columns can be used in ORDER BY
463- Check table partition spec with DESCRIBE
464- Remove ORDER BY or adjust table partitioning
465 
466**"Token authentication failed"**
467- Verify WRANGLER_R2_SQL_AUTH_TOKEN is set
468- Ensure token has R2 Admin Read & Write + SQL Read permissions
469- Token may be expired - create new one
470 
471**"Table not found"**
472- Verify namespace exists: SHOW DATABASES
473- Check table name: SHOW TABLES IN namespace
474- Ensure catalog enabled on bucket
475 
476**"No data returned"**
477- Check WHERE conditions match data
478- Verify time range in BETWEEN clause
479- Try removing filters to confirm data exists
480 
481### Performance Issues
482 
483**Slow queries**:
484- Check partition pruning effectiveness
485- Reduce LIMIT if scanning too much data
486- Ensure filters on partition key columns
487- Review Parquet file sizes (aim for 100-500MB)
488 
489**Query timeout**:
490- Add more restrictive WHERE filters
491- Reduce LIMIT
492- Consider better partitioning strategy
493 
494## Resources
495 
496- Docs: https://developers.cloudflare.com/r2-sql/
497- Data Catalog: https://developers.cloudflare.com/r2/data-catalog/
498- Blog: https://blog.cloudflare.com/r2-sql-deep-dive/
499- Discord: https://discord.cloudflare.com/
500 
501## Key Reminders
502 
5031. R2 SQL queries ONLY Apache Iceberg tables in R2 Data Catalog
5042. Enable catalog on bucket before use
5053. Create API token with R2 + catalog permissions
5064. Partition by time for time-series data
5075. ORDER BY limited to partition key columns
5086. Use LIMIT and WHERE for optimal performance
5097. Zero egress fees - query from anywhere
5108. Open beta - free during testing phase
5119. Serverless - no infrastructure management
51210. Leverage Cloudflare's global network for distributed execution
513
Preparing the source view

Cloudflare Platform Skill

references/r2-sql/SKILL.md.backup