update skills

2026-07-03 21:13:31 -07:00 · 2026-03-17 16:53:22 -07:00
parent 0b0783ef8e
commit f9a530667e
389 changed files with 54512 additions and 1 deletions
@@ -0,0 +1,149 @@
+# Cloudflare R2 Data Catalog Skill Reference
+
+Expert guidance for Cloudflare R2 Data Catalog - Apache Iceberg catalog built into R2 buckets.
+
+## Reading Order
+
+**New to R2 Data Catalog?** Start here:
+1. Read "What is R2 Data Catalog?" and "When to Use" below
+2. [configuration.md](configuration.md) - Enable catalog, create tokens
+3. [patterns.md](patterns.md) - PyIceberg setup and common patterns
+4. [api.md](api.md) - REST API reference as needed
+5. [gotchas.md](gotchas.md) - Troubleshooting when issues arise
+
+**Quick reference?** Jump to:
+- [Enable catalog on bucket](configuration.md#enable-catalog-on-bucket)
+- [PyIceberg connection pattern](patterns.md#pyiceberg-connection-pattern)
+- [Permission errors](gotchas.md#permission-errors)
+
+## What is R2 Data Catalog?
+
+R2 Data Catalog is a **managed Apache Iceberg REST catalog** built directly into R2 buckets. It provides:
+
+- **Apache Iceberg tables** - ACID transactions, schema evolution, time-travel queries
+- **Zero-egress costs** - Query from any cloud/region without data transfer fees
+- **Standard REST API** - Works with Spark, PyIceberg, Snowflake, Trino, DuckDB
+- **No infrastructure** - Fully managed, no catalog servers to run
+- **Public beta** - Available to all R2 subscribers, no extra cost beyond R2 storage
+
+### What is Apache Iceberg?
+
+Open table format for analytics datasets in object storage. Features:
+- **ACID transactions** - Safe concurrent reads/writes
+- **Metadata optimization** - Fast queries without full scans
+- **Schema evolution** - Add/rename/delete columns without rewrites
+- **Time-travel** - Query historical snapshots
+- **Partitioning** - Organize data for efficient queries
+
+## When to Use
+
+**Use R2 Data Catalog for:**
+- **Log analytics** - Store and query application/system logs
+- **Data lakes/warehouses** - Analytical datasets queried by multiple engines
+- **BI pipelines** - Aggregate data for dashboards and reports
+- **Multi-cloud analytics** - Share data across clouds without egress fees
+- **Time-series data** - Event streams, metrics, sensor data
+
+**Don't use for:**
+- **Transactional workloads** - Use D1 or external database instead
+- **Sub-second latency** - Iceberg optimized for batch/analytical queries
+- **Small datasets (<1GB)** - Setup overhead not worth it
+- **Unstructured data** - Store files directly in R2, not as Iceberg tables
+
+## Architecture
+
+```
+┌─────────────────────────────────────────────────┐
+│  Query Engines                                  │
+│  (PyIceberg, Spark, Trino, Snowflake, DuckDB)  │
+└────────────────┬────────────────────────────────┘
+                 │
+                 │ REST API (OAuth2 token)
+                 ▼
+┌─────────────────────────────────────────────────┐
+│  R2 Data Catalog (Managed Iceberg REST Catalog)│
+│  • Namespace/table metadata                     │
+│  • Transaction coordination                     │
+│  • Snapshot management                          │
+└────────────────┬────────────────────────────────┘
+                 │
+                 │ Vended credentials
+                 ▼
+┌─────────────────────────────────────────────────┐
+│  R2 Bucket Storage                              │
+│  • Parquet data files                           │
+│  • Metadata files                               │
+│  • Manifest files                               │
+└─────────────────────────────────────────────────┘
+```
+
+**Key concepts:**
+- **Catalog URI** - REST endpoint for catalog operations (e.g., `https://<account-id>.r2.cloudflarestorage.com/iceberg/<bucket>`)
+- **Warehouse** - Logical grouping of tables (typically same as bucket name)
+- **Namespace** - Schema/database containing tables (e.g., `logs`, `analytics`)
+- **Table** - Iceberg table with schema, data files, snapshots
+- **Vended credentials** - Temporary S3 credentials catalog provides for data access
+
+## Limits
+
+| Resource | Limit | Notes |
+|----------|-------|-------|
+| Namespaces per catalog | No hard limit | Organize tables logically |
+| Tables per namespace | <10,000 recommended | Performance degrades beyond this |
+| Files per table | <100,000 recommended | Run compaction regularly |
+| Snapshots per table | Configurable retention | Expire >7 days old |
+| Partitions per table | 100-1,000 optimal | Too many = slow metadata ops |
+| Table size | Same as R2 bucket | 10GB-10TB+ common |
+| API rate limits | Standard R2 API limits | Shared with R2 storage operations |
+| Target file size | 128-512 MB | After compaction |
+
+## Current Status
+
+**Public Beta** (as of Jan 2026)
+- Available to all R2 subscribers
+- No extra cost beyond standard R2 storage/operations
+- Production-ready, but breaking changes possible
+- Supports: namespaces, tables, snapshots, compaction, time-travel, table maintenance
+
+## Decision Tree: Is R2 Data Catalog Right For You?
+
+```
+Start → Need analytics on object storage data?
+         │
+         ├─ No → Use R2 directly for object storage
+         │
+         └─ Yes → Dataset >1GB with structured schema?
+                  │
+                  ├─ No → Too small, use R2 + ad-hoc queries
+                  │
+                  └─ Yes → Need ACID transactions or schema evolution?
+                           │
+                           ├─ No → Consider simpler solutions (Parquet on R2)
+                           │
+                           └─ Yes → Need multi-cloud/multi-tool access?
+                                    │
+                                    ├─ No → D1 or external DB may be simpler
+                                    │
+                                    └─ Yes → ✅ Use R2 Data Catalog
+```
+
+**Quick check:** If you answer "yes" to all:
+- Dataset >1GB and growing
+- Structured/tabular data (logs, events, metrics)
+- Multiple query tools or cloud environments
+- Need versioning, schema changes, or concurrent access
+
+→ R2 Data Catalog is a good fit.
+
+## In This Reference
+
+- **[configuration.md](configuration.md)** - Enable catalog, create API tokens, connect clients
+- **[api.md](api.md)** - REST endpoints, operations, maintenance
+- **[patterns.md](patterns.md)** - PyIceberg examples, common use cases
+- **[gotchas.md](gotchas.md)** - Troubleshooting, best practices, limitations
+
+## See Also
+
+- [Cloudflare R2 Data Catalog Docs](https://developers.cloudflare.com/r2/data-catalog/)
+- [Apache Iceberg Docs](https://iceberg.apache.org/)
+- [PyIceberg Docs](https://py.iceberg.apache.org/)
@@ -0,0 +1,199 @@
+# API Reference
+
+R2 Data Catalog exposes standard [Apache Iceberg REST Catalog API](https://github.com/apache/iceberg/blob/main/open-api/rest-catalog-open-api.yaml).
+
+## Quick Reference
+
+**Most common operations:**
+
+| Task | PyIceberg Code |
+|------|----------------|
+| Connect | `RestCatalog(name="r2", warehouse=bucket, uri=uri, token=token)` |
+| List namespaces | `catalog.list_namespaces()` |
+| Create namespace | `catalog.create_namespace("logs")` |
+| Create table | `catalog.create_table(("ns", "table"), schema=schema)` |
+| Load table | `catalog.load_table(("ns", "table"))` |
+| Append data | `table.append(pyarrow_table)` |
+| Query data | `table.scan().to_pandas()` |
+| Compact files | `table.rewrite_data_files(target_file_size_bytes=128*1024*1024)` |
+| Expire snapshots | `table.expire_snapshots(older_than=timestamp_ms, retain_last=10)` |
+
+## REST Endpoints
+
+Base: `https://<account-id>.r2.cloudflarestorage.com/iceberg/<bucket-name>`
+
+| Operation | Method | Path |
+|-----------|--------|------|
+| Catalog config | GET | `/v1/config` |
+| List namespaces | GET | `/v1/namespaces` |
+| Create namespace | POST | `/v1/namespaces` |
+| Delete namespace | DELETE | `/v1/namespaces/{ns}` |
+| List tables | GET | `/v1/namespaces/{ns}/tables` |
+| Create table | POST | `/v1/namespaces/{ns}/tables` |
+| Load table | GET | `/v1/namespaces/{ns}/tables/{table}` |
+| Update table | POST | `/v1/namespaces/{ns}/tables/{table}` |
+| Delete table | DELETE | `/v1/namespaces/{ns}/tables/{table}` |
+| Rename table | POST | `/v1/tables/rename` |
+
+**Authentication:** Bearer token in header: `Authorization: Bearer <token>`
+
+## PyIceberg Client API
+
+Most users use PyIceberg, not raw REST.
+
+### Connection
+
+```python
+from pyiceberg.catalog.rest import RestCatalog
+
+catalog = RestCatalog(
+    name="my_catalog",
+    warehouse="<bucket-name>",
+    uri="<catalog-uri>",
+    token="<api-token>",
+)
+```
+
+### Namespace Operations
+
+```python
+from pyiceberg.exceptions import NamespaceAlreadyExistsError
+
+namespaces = catalog.list_namespaces()  # [('default',), ('logs',)]
+catalog.create_namespace("logs", properties={"owner": "team"})
+catalog.drop_namespace("logs")  # Must be empty
+```
+
+### Table Operations
+
+```python
+from pyiceberg.schema import Schema
+from pyiceberg.types import NestedField, StringType, IntegerType
+
+schema = Schema(
+    NestedField(1, "id", IntegerType(), required=True),
+    NestedField(2, "name", StringType(), required=False),
+)
+table = catalog.create_table(("logs", "app_logs"), schema=schema)
+tables = catalog.list_tables("logs")
+table = catalog.load_table(("logs", "app_logs"))
+catalog.rename_table(("logs", "old"), ("logs", "new"))
+```
+
+### Data Operations
+
+```python
+import pyarrow as pa
+
+data = pa.table({"id": [1, 2], "name": ["Alice", "Bob"]})
+table.append(data)
+table.overwrite(data)
+
+# Read with filters
+scan = table.scan(row_filter="id > 100", selected_fields=["id", "name"])
+df = scan.to_pandas()
+```
+
+### Schema Evolution
+
+```python
+from pyiceberg.types import IntegerType, LongType
+
+with table.update_schema() as update:
+    update.add_column("user_id", IntegerType(), doc="User ID")
+    update.rename_column("msg", "message")
+    update.delete_column("old_field")
+    update.update_column("id", field_type=LongType())  # int→long only
+```
+
+### Time-Travel
+
+```python
+from datetime import datetime, timedelta
+
+# Query specific snapshot or timestamp
+scan = table.scan(snapshot_id=table.snapshots()[-2].snapshot_id)
+yesterday_ms = int((datetime.now() - timedelta(days=1)).timestamp() * 1000)
+scan = table.scan(as_of_timestamp=yesterday_ms)
+```
+
+### Partitioning
+
+```python
+from pyiceberg.partitioning import PartitionSpec, PartitionField
+from pyiceberg.transforms import DayTransform
+from pyiceberg.types import TimestampType
+
+partition_spec = PartitionSpec(
+    PartitionField(source_id=1, field_id=1000, transform=DayTransform(), name="day")
+)
+table = catalog.create_table(("events", "actions"), schema=schema, partition_spec=partition_spec)
+scan = table.scan(row_filter="day = '2026-01-27'")  # Prunes partitions
+```
+
+## Table Maintenance
+
+### Compaction
+
+```python
+files = table.scan().plan_files()
+avg_mb = sum(f.file_size_in_bytes for f in files) / len(files) / (1024**2)
+print(f"Files: {len(files)}, Avg: {avg_mb:.1f} MB")
+
+table.rewrite_data_files(target_file_size_bytes=128 * 1024 * 1024)
+```
+
+**When:** Avg <10MB or >1000 files. **Frequency:** High-write daily, medium weekly.
+
+### Snapshot Expiration
+
+```python
+from datetime import datetime, timedelta
+
+seven_days_ms = int((datetime.now() - timedelta(days=7)).timestamp() * 1000)
+table.expire_snapshots(older_than=seven_days_ms, retain_last=10)
+```
+
+**Retention:** Production 7-30d, dev 1-7d, audit 90+d.
+
+### Orphan Cleanup
+
+```python
+three_days_ms = int((datetime.now() - timedelta(days=3)).timestamp() * 1000)
+table.delete_orphan_files(older_than=three_days_ms)
+```
+
+⚠️ Always expire snapshots first, use 3+ day threshold, run during low traffic.
+
+### Full Maintenance
+
+```python
+# Compact → Expire → Cleanup (in order)
+if len(table.scan().plan_files()) > 1000:
+    table.rewrite_data_files(target_file_size_bytes=128 * 1024 * 1024)
+seven_days_ms = int((datetime.now() - timedelta(days=7)).timestamp() * 1000)
+table.expire_snapshots(older_than=seven_days_ms, retain_last=10)
+three_days_ms = int((datetime.now() - timedelta(days=3)).timestamp() * 1000)
+table.delete_orphan_files(older_than=three_days_ms)
+```
+
+## Metadata Inspection
+
+```python
+table = catalog.load_table(("logs", "app_logs"))
+print(table.schema())
+print(table.current_snapshot())
+print(table.properties)
+print(f"Files: {len(table.scan().plan_files())}")
+```
+
+## Error Codes
+
+| Code | Meaning | Common Causes |
+|------|---------|---------------|
+| 401 | Unauthorized | Invalid/missing token |
+| 404 | Not Found | Catalog not enabled, namespace/table missing |
+| 409 | Conflict | Already exists, concurrent update |
+| 422 | Validation | Invalid schema, incompatible type |
+
+See [gotchas.md](gotchas.md) for detailed troubleshooting.
@@ -0,0 +1,198 @@
+# Configuration
+
+How to enable R2 Data Catalog and configure authentication.
+
+## Prerequisites
+
+- Cloudflare account with [R2 subscription](https://developers.cloudflare.com/r2/pricing/)
+- R2 bucket created
+- Access to Cloudflare dashboard or Wrangler CLI
+
+## Enable Catalog on Bucket
+
+Choose one method:
+
+### Via Wrangler (Recommended)
+
+```bash
+npx wrangler r2 bucket catalog enable <BUCKET_NAME>
+```
+
+**Output:**
+```
+✅ Data Catalog enabled for bucket 'my-bucket'
+   Catalog URI: https://<account-id>.r2.cloudflarestorage.com/iceberg/my-bucket
+   Warehouse: my-bucket
+```
+
+### Via Dashboard
+
+1. Navigate to **R2** → Select your bucket → **Settings** tab
+2. Scroll to "R2 Data Catalog" section → Click **Enable**
+3. Note the **Catalog URI** and **Warehouse name** shown
+
+**Result:**
+- Catalog URI: `https://<account-id>.r2.cloudflarestorage.com/iceberg/<bucket-name>`
+- Warehouse: `<bucket-name>` (same as bucket name)
+
+### Via API (Programmatic)
+
+```bash
+curl -X POST \
+  "https://api.cloudflare.com/client/v4/accounts/<account-id>/r2/buckets/<bucket>/catalog" \
+  -H "Authorization: Bearer <api-token>" \
+  -H "Content-Type: application/json"
+```
+
+**Response:**
+```json
+{
+  "result": {
+    "catalog_uri": "https://<account-id>.r2.cloudflarestorage.com/iceberg/<bucket>",
+    "warehouse": "<bucket>"
+  },
+  "success": true
+}
+```
+
+## Check Catalog Status
+
+```bash
+npx wrangler r2 bucket catalog status <BUCKET_NAME>
+```
+
+**Output:**
+```
+Catalog Status: enabled
+Catalog URI: https://<account-id>.r2.cloudflarestorage.com/iceberg/my-bucket
+Warehouse: my-bucket
+```
+
+## Disable Catalog (If Needed)
+
+```bash
+npx wrangler r2 bucket catalog disable <BUCKET_NAME>
+```
+
+⚠️ **Warning:** Disabling does NOT delete tables/data. Files remain in bucket. Metadata becomes inaccessible until re-enabled.
+
+## API Token Creation
+
+R2 Data Catalog requires API token with **both** R2 Storage + R2 Data Catalog permissions.
+
+### Dashboard Method (Recommended)
+
+1. Go to **R2** → **Manage R2 API Tokens** → **Create API Token**
+2. Select permission level:
+   - **Admin Read & Write** - Full catalog + storage access (read/write)
+   - **Admin Read only** - Read-only access (for query engines)
+3. Copy token value immediately (shown only once)
+
+**Permission groups included:**
+- `Workers R2 Data Catalog Write` (or Read)
+- `Workers R2 Storage Bucket Item Write` (or Read)
+
+### API Method (Programmatic)
+
+Use Cloudflare API to create tokens programmatically. Required permissions:
+- `Workers R2 Data Catalog Write` (or Read)
+- `Workers R2 Storage Bucket Item Write` (or Read)
+
+## Client Configuration
+
+### PyIceberg
+
+```python
+from pyiceberg.catalog.rest import RestCatalog
+
+catalog = RestCatalog(
+    name="my_catalog",
+    warehouse="<bucket-name>",           # Same as bucket name
+    uri="<catalog-uri>",                 # From enable command
+    token="<api-token>",                 # From token creation
+)
+```
+
+**Full example with credentials:**
+```python
+import os
+from pyiceberg.catalog.rest import RestCatalog
+
+# Store credentials in environment variables
+WAREHOUSE = os.getenv("R2_WAREHOUSE")      # e.g., "my-bucket"
+CATALOG_URI = os.getenv("R2_CATALOG_URI")  # e.g., "https://abc123.r2.cloudflarestorage.com/iceberg/my-bucket"
+TOKEN = os.getenv("R2_TOKEN")              # API token
+
+catalog = RestCatalog(
+    name="r2_catalog",
+    warehouse=WAREHOUSE,
+    uri=CATALOG_URI,
+    token=TOKEN,
+)
+
+# Test connection
+print(catalog.list_namespaces())
+```
+
+### Spark / Trino / DuckDB
+
+See [patterns.md](patterns.md) for integration examples with other query engines.
+
+## Connection String Format
+
+For quick reference:
+
+```
+Catalog URI:  https://<account-id>.r2.cloudflarestorage.com/iceberg/<bucket>
+Warehouse:    <bucket-name>
+Token:        <r2-api-token>
+```
+
+**Where to find values:**
+
+| Value | Source |
+|-------|--------|
+| `<account-id>` | Dashboard URL or `wrangler whoami` |
+| `<bucket>` | R2 bucket name |
+| Catalog URI | Output from `wrangler r2 bucket catalog enable` |
+| Token | R2 API Token creation page |
+
+## Security Best Practices
+
+1. **Store tokens securely** - Use environment variables or secret managers, never hardcode
+2. **Use least privilege** - Read-only tokens for query engines, write tokens only where needed
+3. **Rotate tokens regularly** - Create new tokens, test, then revoke old ones
+4. **One token per application** - Easier to track and revoke if compromised
+5. **Monitor token usage** - Check R2 analytics for unexpected patterns
+6. **Bucket-scoped tokens** - Create tokens per bucket, not account-wide
+
+## Environment Variables Pattern
+
+```bash
+# .env (never commit)
+R2_CATALOG_URI=https://<account-id>.r2.cloudflarestorage.com/iceberg/<bucket>
+R2_WAREHOUSE=<bucket-name>
+R2_TOKEN=<api-token>
+```
+
+```python
+import os
+from pyiceberg.catalog.rest import RestCatalog
+
+catalog = RestCatalog(
+    name="r2",
+    uri=os.getenv("R2_CATALOG_URI"),
+    warehouse=os.getenv("R2_WAREHOUSE"),
+    token=os.getenv("R2_TOKEN"),
+)
+```
+
+## Troubleshooting
+
+| Problem | Solution |
+|---------|----------|
+| 404 "catalog not found" | Run `wrangler r2 bucket catalog enable <bucket>` |
+| 401 "unauthorized" | Check token has both Catalog + Storage permissions |
+| 403 on data files | Token needs both permission groups |
+
+See [gotchas.md](gotchas.md) for detailed troubleshooting.
@@ -0,0 +1,170 @@
+# Gotchas & Troubleshooting
+
+Common problems → causes → solutions.
+
+## Permission Errors
+
+### 401 Unauthorized
+
+**Error:** `"401 Unauthorized"`  
+**Cause:** Token missing R2 Data Catalog permissions.  
+**Solution:** Use "Admin Read & Write" token (includes catalog + storage permissions). Test with `catalog.list_namespaces()`.
+
+### 403 Forbidden
+
+**Error:** `"403 Forbidden"` on data files  
+**Cause:** Token lacks storage permissions.  
+**Solution:** Token needs both R2 Data Catalog + R2 Storage Bucket Item permissions.
+
+### Token Rotation Issues
+
+**Error:** New token fails after rotation.  
+**Solution:** Create new token → test in staging → update prod → monitor 24h → revoke old.
+
+## Catalog URI Issues
+
+### 404 Not Found
+
+**Error:** `"404 Catalog not found"`  
+**Cause:** Catalog not enabled or wrong URI.  
+**Solution:** Run `wrangler r2 bucket catalog enable <bucket>`. URI must be HTTPS with `/iceberg/` and case-sensitive bucket name.
+
+### Wrong Warehouse
+
+**Error:** Cannot create/load tables.  
+**Cause:** Warehouse ≠ bucket name.  
+**Solution:** Set `warehouse="bucket-name"` to match bucket exactly.
+
+## Table and Schema Issues
+
+### Table/Namespace Already Exists
+
+**Error:** `"TableAlreadyExistsError"`  
+**Solution:** Use try/except to load existing or check first.
+
+### Namespace Not Found
+
+**Error:** Cannot create table.  
+**Solution:** Create namespace first: `catalog.create_namespace("ns")`
+
+### Schema Evolution Errors
+
+**Error:** `"422 Validation"` on schema update.  
+**Cause:** Incompatible change (required field, type shrink).  
+**Solution:** Only add nullable columns, compatible type widening (int→long, float→double).
+
+## Data and Query Issues
+
+### Empty Scan Results
+
+**Error:** Scan returns no data.  
+**Cause:** Incorrect filter or partition column.  
+**Solution:** Test without filter first: `table.scan().to_pandas()`. Verify partition column names.
+
+### Slow Queries
+
+**Error:** Performance degrades over time.  
+**Cause:** Too many small files.  
+**Solution:** Check file count, compact if >1000 or avg <10MB. See [api.md](api.md#compaction).
+
+### Type Mismatch
+
+**Error:** `"Cannot cast"` on append.  
+**Cause:** PyArrow types don't match Iceberg schema.  
+**Solution:** Cast to int64 (Iceberg default), not int32. Check `table.schema()`.
+
+## Compaction Issues
+
+### Compaction Issues
+
+**Problem:** File count unchanged or compaction takes hours.  
+**Cause:** Target size too large, or table too big for PyIceberg.  
+**Solution:** Only compact if avg <50MB. For >1TB tables, use Spark. Run during low-traffic periods.
+
+## Maintenance Issues
+
+### Snapshot/Orphan Issues
+
+**Problem:** Expiration fails or orphan cleanup deletes active data.  
+**Cause:** Too aggressive retention or wrong order.  
+**Solution:** Always expire snapshots first with `retain_last=10`, then cleanup orphans with 3+ day threshold.
+
+## Concurrency Issues
+
+### Concurrent Write Conflicts
+
+**Problem:** `CommitFailedException` with multiple writers.  
+**Cause:** Optimistic locking - simultaneous commits.  
+**Solution:** Add retry with exponential backoff (see [patterns.md](patterns.md#pattern-6-concurrent-writes-with-retry)).
+
+### Stale Metadata
+
+**Problem:** Old schema/data after external update.  
+**Cause:** Cached metadata.  
+**Solution:** Reload table: `table = catalog.load_table(("ns", "table"))`
+
+## Performance Optimization
+
+### Performance Tips
+
+**Scans:** Use `row_filter` and `selected_fields` to reduce data scanned.  
+**Partitions:** 100-1000 optimal. Avoid high cardinality (millions) or low (<10).  
+**Files:** Keep 100-500MB avg. Compact if <10MB or >10k files.
+
+## Limits
+
+| Resource | Recommended | Impact if Exceeded |
+|----------|-------------|-------------------|
+| Tables/namespace | <10k | Slow list ops |
+| Files/table | <100k | Slow query planning |
+| Partitions/table | 100-1k | Metadata overhead |
+| Snapshots/table | Expire >7d | Metadata bloat |
+
+## Common Error Messages Reference
+
+| Error Message | Likely Cause | Fix |
+|---------------|--------------|-----|
+| `401 Unauthorized` | Missing/invalid token | Check token has catalog+storage permissions |
+| `403 Forbidden` | Token lacks storage permissions | Add R2 Storage Bucket Item permission |
+| `404 Not Found` | Catalog not enabled or wrong URI | Run `wrangler r2 bucket catalog enable` |
+| `409 Conflict` | Table/namespace already exists | Use try/except or load existing |
+| `422 Unprocessable Entity` | Schema validation failed | Check type compatibility, required fields |
+| `CommitFailedException` | Concurrent write conflict | Add retry logic with backoff |
+| `NamespaceAlreadyExistsError` | Namespace exists | Use try/except or load existing |
+| `NoSuchTableError` | Table doesn't exist | Check namespace+table name, create first |
+| `TypeError: Cannot cast` | PyArrow type mismatch | Cast data to match Iceberg schema |
+
+## Debugging Checklist
+
+When things go wrong, check in order:
+
+1. ✅ **Catalog enabled:** `npx wrangler r2 bucket catalog status <bucket>`
+2. ✅ **Token permissions:** Both R2 Data Catalog + R2 Storage in dashboard
+3. ✅ **Connection test:** `catalog.list_namespaces()` succeeds
+4. ✅ **URI format:** HTTPS, includes `/iceberg/`, correct bucket name
+5. ✅ **Warehouse name:** Matches bucket name exactly
+6. ✅ **Namespace exists:** Create before `create_table()`
+7. ✅ **Enable debug logging:** `logging.basicConfig(level=logging.DEBUG)`
+8. ✅ **PyIceberg version:** `pip install --upgrade pyiceberg` (≥0.5.0)
+9. ✅ **File health:** Compact if >1000 files or avg <10MB
+10. ✅ **Snapshot count:** Expire if >100 snapshots
+
+## Enable Debug Logging
+
+```python
+import logging
+logging.basicConfig(level=logging.DEBUG)
+# Now operations show HTTP requests/responses
+```
+
+## Resources
+
+- [Cloudflare Community](https://community.cloudflare.com/c/developers/workers/40)
+- [Cloudflare Discord](https://discord.cloudflare.com) - #r2 channel
+- [PyIceberg GitHub](https://github.com/apache/iceberg-python/issues)
+- [Apache Iceberg Slack](https://iceberg.apache.org/community/)
+
+## Next Steps
+
+- [patterns.md](patterns.md) - Working examples
+- [api.md](api.md) - API reference
@@ -0,0 +1,191 @@
+# Common Patterns
+
+Practical patterns for R2 Data Catalog with PyIceberg.
+
+## PyIceberg Connection
+
+```python
+import os
+from pyiceberg.catalog.rest import RestCatalog
+from pyiceberg.exceptions import NamespaceAlreadyExistsError
+
+catalog = RestCatalog(
+    name="r2_catalog",
+    warehouse=os.getenv("R2_WAREHOUSE"),      # bucket name
+    uri=os.getenv("R2_CATALOG_URI"),          # catalog endpoint
+    token=os.getenv("R2_TOKEN"),              # API token
+)
+
+# Create namespace (idempotent)
+try:
+    catalog.create_namespace("default")
+except NamespaceAlreadyExistsError:
+    pass
+```
+
+## Pattern 1: Log Analytics Pipeline
+
+Ingest logs incrementally, query by time/level.
+
+```python
+import pyarrow as pa
+from datetime import datetime
+from pyiceberg.schema import Schema
+from pyiceberg.types import NestedField, TimestampType, StringType, IntegerType
+from pyiceberg.partitioning import PartitionSpec, PartitionField
+from pyiceberg.transforms import DayTransform
+
+# Create partitioned table (once)
+schema = Schema(
+    NestedField(1, "timestamp", TimestampType(), required=True),
+    NestedField(2, "level", StringType(), required=True),
+    NestedField(3, "service", StringType(), required=True),
+    NestedField(4, "message", StringType(), required=False),
+)
+
+partition_spec = PartitionSpec(
+    PartitionField(source_id=1, field_id=1000, transform=DayTransform(), name="day")
+)
+
+catalog.create_namespace("logs")
+table = catalog.create_table(("logs", "app_logs"), schema=schema, partition_spec=partition_spec)
+
+# Append logs (incremental)
+data = pa.table({
+    "timestamp": [datetime(2026, 1, 27, 10, 30, 0)],
+    "level": ["ERROR"],
+    "service": ["auth-service"],
+    "message": ["Failed login"],
+})
+table.append(data)
+
+# Query by time + level (leverages partitioning)
+scan = table.scan(row_filter="level = 'ERROR' AND day = '2026-01-27'")
+errors = scan.to_pandas()
+```
+
+## Pattern 2: Time-Travel Queries
+
+```python
+from datetime import datetime, timedelta
+
+table = catalog.load_table(("logs", "app_logs"))
+
+# Query specific snapshot
+snapshot_id = table.current_snapshot().snapshot_id
+data = table.scan(snapshot_id=snapshot_id).to_pandas()
+
+# Query as of timestamp (yesterday)
+yesterday_ms = int((datetime.now() - timedelta(days=1)).timestamp() * 1000)
+data = table.scan(as_of_timestamp=yesterday_ms).to_pandas()
+```
+
+## Pattern 3: Schema Evolution
+
+```python
+from pyiceberg.types import StringType
+
+table = catalog.load_table(("users", "profiles"))
+
+with table.update_schema() as update:
+    update.add_column("email", StringType(), required=False)
+    update.rename_column("name", "full_name")
+# Old readers ignore new columns, new readers see nulls for old data
+```
+
+## Pattern 4: Partitioned Tables
+
+```python
+from pyiceberg.partitioning import PartitionSpec, PartitionField
+from pyiceberg.transforms import DayTransform, IdentityTransform
+
+# Partition by day + country
+partition_spec = PartitionSpec(
+    PartitionField(source_id=1, field_id=1000, transform=DayTransform(), name="day"),
+    PartitionField(source_id=2, field_id=1001, transform=IdentityTransform(), name="country"),
+)
+table = catalog.create_table(("events", "user_events"), schema=schema, partition_spec=partition_spec)
+
+# Queries prune partitions automatically
+scan = table.scan(row_filter="country = 'US' AND day = '2026-01-27'")
+```
+
+## Pattern 5: Table Maintenance
+
+```python
+from datetime import datetime, timedelta
+
+table = catalog.load_table(("logs", "app_logs"))
+
+# Compact → expire → cleanup (in order)
+table.rewrite_data_files(target_file_size_bytes=128 * 1024 * 1024)
+seven_days_ms = int((datetime.now() - timedelta(days=7)).timestamp() * 1000)
+table.expire_snapshots(older_than=seven_days_ms, retain_last=10)
+three_days_ms = int((datetime.now() - timedelta(days=3)).timestamp() * 1000)
+table.delete_orphan_files(older_than=three_days_ms)
+```
+
+See [api.md](api.md#table-maintenance) for detailed parameters.
+
+## Pattern 6: Concurrent Writes with Retry
+
+```python
+from pyiceberg.exceptions import CommitFailedException
+import time
+
+def append_with_retry(table, data, max_retries=3):
+    for attempt in range(max_retries):
+        try:
+            table.append(data)
+            return
+        except CommitFailedException:
+            if attempt == max_retries - 1:
+                raise
+            time.sleep(2 ** attempt)
+```
+
+## Pattern 7: Upsert Simulation
+
+```python
+import pandas as pd
+import pyarrow as pa
+
+# Read → merge → overwrite (not atomic, use Spark MERGE INTO for production)
+existing = table.scan().to_pandas()
+new_data = pd.DataFrame({"id": [1, 3], "value": [100, 300]})
+merged = pd.concat([existing, new_data]).drop_duplicates(subset=["id"], keep="last")
+table.overwrite(pa.Table.from_pandas(merged))
+```
+
+## Pattern 8: DuckDB Integration
+
+```python
+import duckdb
+
+arrow_table = table.scan().to_arrow()
+con = duckdb.connect()
+con.register("logs", arrow_table)
+result = con.execute("SELECT level, COUNT(*) FROM logs GROUP BY level").fetchdf()
+```
+
+## Pattern 9: Monitor Table Health
+
+```python
+files = table.scan().plan_files()
+avg_mb = sum(f.file_size_in_bytes for f in files) / len(files) / (1024**2)
+print(f"Files: {len(files)}, Avg: {avg_mb:.1f}MB, Snapshots: {len(table.snapshots())}")
+
+if avg_mb < 10 or len(files) > 1000:
+    print("⚠️ Needs compaction")
+```
+
+## Best Practices
+
+| Area | Guideline |
+|------|-----------|
+| **Partitioning** | Use day/hour for time-series; 100-1000 partitions; avoid high cardinality |
+| **File sizes** | Target 128-512MB; compact when avg <10MB or >10k files |
+| **Schema** | Add columns as nullable (`required=False`); batch changes |
+| **Maintenance** | Compact high-write daily/weekly; expire snapshots 7-30d; cleanup orphans after |
+| **Concurrency** | Reads automatic; writes to different partitions safe; retry same partition |
+| **Performance** | Filter on partitions; select only needed columns; batch appends 100MB+ |