Source from repo
Cloudflare Platform Skill

Comprehensive Cloudflare platform skill covering Workers, D1, R2, KV, AI, Durable Objects, and security.
cloudflareGitHub cloudflareSource repo Original GitHub link Publisher page
Files
321
Skill
n/a
Size
1.4 MB
Entrypoint
SKILL.md
Format
git-repo
Open file
references/r2-data-catalog/patterns.md

Syntax-highlighted preview of this file as included in the skill package.
Rendered Source
markdown192 linesFree
references/r2-data-catalog/patterns.md
1# Common Patterns
2 
3Practical patterns for R2 Data Catalog with PyIceberg.
4 
5## PyIceberg Connection
6 
7```python
8import os
9from pyiceberg.catalog.rest import RestCatalog
10from pyiceberg.exceptions import NamespaceAlreadyExistsError
11 
12catalog = RestCatalog(
13    name="r2_catalog",
14    warehouse=os.getenv("R2_WAREHOUSE"),      # bucket name
15    uri=os.getenv("R2_CATALOG_URI"),          # catalog endpoint
16    token=os.getenv("R2_TOKEN"),              # API token
17)
18 
19# Create namespace (idempotent)
20try:
21    catalog.create_namespace("default")
22except NamespaceAlreadyExistsError:
23    pass
24```
25 
26## Pattern 1: Log Analytics Pipeline
27 
28Ingest logs incrementally, query by time/level.
29 
30```python
31import pyarrow as pa
32from datetime import datetime
33from pyiceberg.schema import Schema
34from pyiceberg.types import NestedField, TimestampType, StringType, IntegerType
35from pyiceberg.partitioning import PartitionSpec, PartitionField
36from pyiceberg.transforms import DayTransform
37 
38# Create partitioned table (once)
39schema = Schema(
40    NestedField(1, "timestamp", TimestampType(), required=True),
41    NestedField(2, "level", StringType(), required=True),
42    NestedField(3, "service", StringType(), required=True),
43    NestedField(4, "message", StringType(), required=False),
44)
45 
46partition_spec = PartitionSpec(
47    PartitionField(source_id=1, field_id=1000, transform=DayTransform(), name="day")
48)
49 
50catalog.create_namespace("logs")
51table = catalog.create_table(("logs", "app_logs"), schema=schema, partition_spec=partition_spec)
52 
53# Append logs (incremental)
54data = pa.table({
55    "timestamp": [datetime(2026, 1, 27, 10, 30, 0)],
56    "level": ["ERROR"],
57    "service": ["auth-service"],
58    "message": ["Failed login"],
59})
60table.append(data)
61 
62# Query by time + level (leverages partitioning)
63scan = table.scan(row_filter="level = 'ERROR' AND day = '2026-01-27'")
64errors = scan.to_pandas()
65```
66 
67## Pattern 2: Time-Travel Queries
68 
69```python
70from datetime import datetime, timedelta
71 
72table = catalog.load_table(("logs", "app_logs"))
73 
74# Query specific snapshot
75snapshot_id = table.current_snapshot().snapshot_id
76data = table.scan(snapshot_id=snapshot_id).to_pandas()
77 
78# Query as of timestamp (yesterday)
79yesterday_ms = int((datetime.now() - timedelta(days=1)).timestamp() * 1000)
80data = table.scan(as_of_timestamp=yesterday_ms).to_pandas()
81```
82 
83## Pattern 3: Schema Evolution
84 
85```python
86from pyiceberg.types import StringType
87 
88table = catalog.load_table(("users", "profiles"))
89 
90with table.update_schema() as update:
91    update.add_column("email", StringType(), required=False)
92    update.rename_column("name", "full_name")
93# Old readers ignore new columns, new readers see nulls for old data
94```
95 
96## Pattern 4: Partitioned Tables
97 
98```python
99from pyiceberg.partitioning import PartitionSpec, PartitionField
100from pyiceberg.transforms import DayTransform, IdentityTransform
101 
102# Partition by day + country
103partition_spec = PartitionSpec(
104    PartitionField(source_id=1, field_id=1000, transform=DayTransform(), name="day"),
105    PartitionField(source_id=2, field_id=1001, transform=IdentityTransform(), name="country"),
106)
107table = catalog.create_table(("events", "user_events"), schema=schema, partition_spec=partition_spec)
108 
109# Queries prune partitions automatically
110scan = table.scan(row_filter="country = 'US' AND day = '2026-01-27'")
111```
112 
113## Pattern 5: Table Maintenance
114 
115```python
116from datetime import datetime, timedelta
117 
118table = catalog.load_table(("logs", "app_logs"))
119 
120# Compact → expire → cleanup (in order)
121table.rewrite_data_files(target_file_size_bytes=128 * 1024 * 1024)
122seven_days_ms = int((datetime.now() - timedelta(days=7)).timestamp() * 1000)
123table.expire_snapshots(older_than=seven_days_ms, retain_last=10)
124three_days_ms = int((datetime.now() - timedelta(days=3)).timestamp() * 1000)
125table.delete_orphan_files(older_than=three_days_ms)
126```
127 
128See [api.md](api.md#table-maintenance) for detailed parameters.
129 
130## Pattern 6: Concurrent Writes with Retry
131 
132```python
133from pyiceberg.exceptions import CommitFailedException
134import time
135 
136def append_with_retry(table, data, max_retries=3):
137    for attempt in range(max_retries):
138        try:
139            table.append(data)
140            return
141        except CommitFailedException:
142            if attempt == max_retries - 1:
143                raise
144            time.sleep(2 ** attempt)
145```
146 
147## Pattern 7: Upsert Simulation
148 
149```python
150import pandas as pd
151import pyarrow as pa
152 
153# Read → merge → overwrite (not atomic, use Spark MERGE INTO for production)
154existing = table.scan().to_pandas()
155new_data = pd.DataFrame({"id": [1, 3], "value": [100, 300]})
156merged = pd.concat([existing, new_data]).drop_duplicates(subset=["id"], keep="last")
157table.overwrite(pa.Table.from_pandas(merged))
158```
159 
160## Pattern 8: DuckDB Integration
161 
162```python
163import duckdb
164 
165arrow_table = table.scan().to_arrow()
166con = duckdb.connect()
167con.register("logs", arrow_table)
168result = con.execute("SELECT level, COUNT(*) FROM logs GROUP BY level").fetchdf()
169```
170 
171## Pattern 9: Monitor Table Health
172 
173```python
174files = table.scan().plan_files()
175avg_mb = sum(f.file_size_in_bytes for f in files) / len(files) / (1024**2)
176print(f"Files: {len(files)}, Avg: {avg_mb:.1f}MB, Snapshots: {len(table.snapshots())}")
177 
178if avg_mb < 10 or len(files) > 1000:
179    print("⚠️ Needs compaction")
180```
181 
182## Best Practices
183 
184| Area | Guideline |
185|------|-----------|
186| **Partitioning** | Use day/hour for time-series; 100-1000 partitions; avoid high cardinality |
187| **File sizes** | Target 128-512MB; compact when avg <10MB or >10k files |
188| **Schema** | Add columns as nullable (`required=False`); batch changes |
189| **Maintenance** | Compact high-write daily/weekly; expire snapshots 7-30d; cleanup orphans after |
190| **Concurrency** | Reads automatic; writes to different partitions safe; retry same partition |
191| **Performance** | Filter on partitions; select only needed columns; batch appends 100MB+ |
192
Preparing the source view

Cloudflare Platform Skill

references/r2-data-catalog/patterns.md