mirror of
https://github.com/ksyasuda/dotfiles.git
synced 2026-03-21 18:11:27 -07:00
update skills
This commit is contained in:
@@ -0,0 +1,149 @@
|
||||
# Cloudflare R2 Data Catalog Skill Reference
|
||||
|
||||
Expert guidance for Cloudflare R2 Data Catalog - Apache Iceberg catalog built into R2 buckets.
|
||||
|
||||
## Reading Order
|
||||
|
||||
**New to R2 Data Catalog?** Start here:
|
||||
1. Read "What is R2 Data Catalog?" and "When to Use" below
|
||||
2. [configuration.md](configuration.md) - Enable catalog, create tokens
|
||||
3. [patterns.md](patterns.md) - PyIceberg setup and common patterns
|
||||
4. [api.md](api.md) - REST API reference as needed
|
||||
5. [gotchas.md](gotchas.md) - Troubleshooting when issues arise
|
||||
|
||||
**Quick reference?** Jump to:
|
||||
- [Enable catalog on bucket](configuration.md#enable-catalog-on-bucket)
|
||||
- [PyIceberg connection pattern](patterns.md#pyiceberg-connection-pattern)
|
||||
- [Permission errors](gotchas.md#permission-errors)
|
||||
|
||||
## What is R2 Data Catalog?
|
||||
|
||||
R2 Data Catalog is a **managed Apache Iceberg REST catalog** built directly into R2 buckets. It provides:
|
||||
|
||||
- **Apache Iceberg tables** - ACID transactions, schema evolution, time-travel queries
|
||||
- **Zero-egress costs** - Query from any cloud/region without data transfer fees
|
||||
- **Standard REST API** - Works with Spark, PyIceberg, Snowflake, Trino, DuckDB
|
||||
- **No infrastructure** - Fully managed, no catalog servers to run
|
||||
- **Public beta** - Available to all R2 subscribers, no extra cost beyond R2 storage
|
||||
|
||||
### What is Apache Iceberg?
|
||||
|
||||
Open table format for analytics datasets in object storage. Features:
|
||||
- **ACID transactions** - Safe concurrent reads/writes
|
||||
- **Metadata optimization** - Fast queries without full scans
|
||||
- **Schema evolution** - Add/rename/delete columns without rewrites
|
||||
- **Time-travel** - Query historical snapshots
|
||||
- **Partitioning** - Organize data for efficient queries
|
||||
|
||||
## When to Use
|
||||
|
||||
**Use R2 Data Catalog for:**
|
||||
- **Log analytics** - Store and query application/system logs
|
||||
- **Data lakes/warehouses** - Analytical datasets queried by multiple engines
|
||||
- **BI pipelines** - Aggregate data for dashboards and reports
|
||||
- **Multi-cloud analytics** - Share data across clouds without egress fees
|
||||
- **Time-series data** - Event streams, metrics, sensor data
|
||||
|
||||
**Don't use for:**
|
||||
- **Transactional workloads** - Use D1 or external database instead
|
||||
- **Sub-second latency** - Iceberg optimized for batch/analytical queries
|
||||
- **Small datasets (<1GB)** - Setup overhead not worth it
|
||||
- **Unstructured data** - Store files directly in R2, not as Iceberg tables
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────┐
|
||||
│ Query Engines │
|
||||
│ (PyIceberg, Spark, Trino, Snowflake, DuckDB) │
|
||||
└────────────────┬────────────────────────────────┘
|
||||
│
|
||||
│ REST API (OAuth2 token)
|
||||
▼
|
||||
┌─────────────────────────────────────────────────┐
|
||||
│ R2 Data Catalog (Managed Iceberg REST Catalog)│
|
||||
│ • Namespace/table metadata │
|
||||
│ • Transaction coordination │
|
||||
│ • Snapshot management │
|
||||
└────────────────┬────────────────────────────────┘
|
||||
│
|
||||
│ Vended credentials
|
||||
▼
|
||||
┌─────────────────────────────────────────────────┐
|
||||
│ R2 Bucket Storage │
|
||||
│ • Parquet data files │
|
||||
│ • Metadata files │
|
||||
│ • Manifest files │
|
||||
└─────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
**Key concepts:**
|
||||
- **Catalog URI** - REST endpoint for catalog operations (e.g., `https://<account-id>.r2.cloudflarestorage.com/iceberg/<bucket>`)
|
||||
- **Warehouse** - Logical grouping of tables (typically same as bucket name)
|
||||
- **Namespace** - Schema/database containing tables (e.g., `logs`, `analytics`)
|
||||
- **Table** - Iceberg table with schema, data files, snapshots
|
||||
- **Vended credentials** - Temporary S3 credentials catalog provides for data access
|
||||
|
||||
## Limits
|
||||
|
||||
| Resource | Limit | Notes |
|
||||
|----------|-------|-------|
|
||||
| Namespaces per catalog | No hard limit | Organize tables logically |
|
||||
| Tables per namespace | <10,000 recommended | Performance degrades beyond this |
|
||||
| Files per table | <100,000 recommended | Run compaction regularly |
|
||||
| Snapshots per table | Configurable retention | Expire >7 days old |
|
||||
| Partitions per table | 100-1,000 optimal | Too many = slow metadata ops |
|
||||
| Table size | Same as R2 bucket | 10GB-10TB+ common |
|
||||
| API rate limits | Standard R2 API limits | Shared with R2 storage operations |
|
||||
| Target file size | 128-512 MB | After compaction |
|
||||
|
||||
## Current Status
|
||||
|
||||
**Public Beta** (as of Jan 2026)
|
||||
- Available to all R2 subscribers
|
||||
- No extra cost beyond standard R2 storage/operations
|
||||
- Production-ready, but breaking changes possible
|
||||
- Supports: namespaces, tables, snapshots, compaction, time-travel, table maintenance
|
||||
|
||||
## Decision Tree: Is R2 Data Catalog Right For You?
|
||||
|
||||
```
|
||||
Start → Need analytics on object storage data?
|
||||
│
|
||||
├─ No → Use R2 directly for object storage
|
||||
│
|
||||
└─ Yes → Dataset >1GB with structured schema?
|
||||
│
|
||||
├─ No → Too small, use R2 + ad-hoc queries
|
||||
│
|
||||
└─ Yes → Need ACID transactions or schema evolution?
|
||||
│
|
||||
├─ No → Consider simpler solutions (Parquet on R2)
|
||||
│
|
||||
└─ Yes → Need multi-cloud/multi-tool access?
|
||||
│
|
||||
├─ No → D1 or external DB may be simpler
|
||||
│
|
||||
└─ Yes → ✅ Use R2 Data Catalog
|
||||
```
|
||||
|
||||
**Quick check:** If you answer "yes" to all:
|
||||
- Dataset >1GB and growing
|
||||
- Structured/tabular data (logs, events, metrics)
|
||||
- Multiple query tools or cloud environments
|
||||
- Need versioning, schema changes, or concurrent access
|
||||
|
||||
→ R2 Data Catalog is a good fit.
|
||||
|
||||
## In This Reference
|
||||
|
||||
- **[configuration.md](configuration.md)** - Enable catalog, create API tokens, connect clients
|
||||
- **[api.md](api.md)** - REST endpoints, operations, maintenance
|
||||
- **[patterns.md](patterns.md)** - PyIceberg examples, common use cases
|
||||
- **[gotchas.md](gotchas.md)** - Troubleshooting, best practices, limitations
|
||||
|
||||
## See Also
|
||||
|
||||
- [Cloudflare R2 Data Catalog Docs](https://developers.cloudflare.com/r2/data-catalog/)
|
||||
- [Apache Iceberg Docs](https://iceberg.apache.org/)
|
||||
- [PyIceberg Docs](https://py.iceberg.apache.org/)
|
||||
@@ -0,0 +1,199 @@
|
||||
# API Reference
|
||||
|
||||
R2 Data Catalog exposes standard [Apache Iceberg REST Catalog API](https://github.com/apache/iceberg/blob/main/open-api/rest-catalog-open-api.yaml).
|
||||
|
||||
## Quick Reference
|
||||
|
||||
**Most common operations:**
|
||||
|
||||
| Task | PyIceberg Code |
|
||||
|------|----------------|
|
||||
| Connect | `RestCatalog(name="r2", warehouse=bucket, uri=uri, token=token)` |
|
||||
| List namespaces | `catalog.list_namespaces()` |
|
||||
| Create namespace | `catalog.create_namespace("logs")` |
|
||||
| Create table | `catalog.create_table(("ns", "table"), schema=schema)` |
|
||||
| Load table | `catalog.load_table(("ns", "table"))` |
|
||||
| Append data | `table.append(pyarrow_table)` |
|
||||
| Query data | `table.scan().to_pandas()` |
|
||||
| Compact files | `table.rewrite_data_files(target_file_size_bytes=128*1024*1024)` |
|
||||
| Expire snapshots | `table.expire_snapshots(older_than=timestamp_ms, retain_last=10)` |
|
||||
|
||||
## REST Endpoints
|
||||
|
||||
Base: `https://<account-id>.r2.cloudflarestorage.com/iceberg/<bucket-name>`
|
||||
|
||||
| Operation | Method | Path |
|
||||
|-----------|--------|------|
|
||||
| Catalog config | GET | `/v1/config` |
|
||||
| List namespaces | GET | `/v1/namespaces` |
|
||||
| Create namespace | POST | `/v1/namespaces` |
|
||||
| Delete namespace | DELETE | `/v1/namespaces/{ns}` |
|
||||
| List tables | GET | `/v1/namespaces/{ns}/tables` |
|
||||
| Create table | POST | `/v1/namespaces/{ns}/tables` |
|
||||
| Load table | GET | `/v1/namespaces/{ns}/tables/{table}` |
|
||||
| Update table | POST | `/v1/namespaces/{ns}/tables/{table}` |
|
||||
| Delete table | DELETE | `/v1/namespaces/{ns}/tables/{table}` |
|
||||
| Rename table | POST | `/v1/tables/rename` |
|
||||
|
||||
**Authentication:** Bearer token in header: `Authorization: Bearer <token>`
|
||||
|
||||
## PyIceberg Client API
|
||||
|
||||
Most users use PyIceberg, not raw REST.
|
||||
|
||||
### Connection
|
||||
|
||||
```python
|
||||
from pyiceberg.catalog.rest import RestCatalog
|
||||
|
||||
catalog = RestCatalog(
|
||||
name="my_catalog",
|
||||
warehouse="<bucket-name>",
|
||||
uri="<catalog-uri>",
|
||||
token="<api-token>",
|
||||
)
|
||||
```
|
||||
|
||||
### Namespace Operations
|
||||
|
||||
```python
|
||||
from pyiceberg.exceptions import NamespaceAlreadyExistsError
|
||||
|
||||
namespaces = catalog.list_namespaces() # [('default',), ('logs',)]
|
||||
catalog.create_namespace("logs", properties={"owner": "team"})
|
||||
catalog.drop_namespace("logs") # Must be empty
|
||||
```
|
||||
|
||||
### Table Operations
|
||||
|
||||
```python
|
||||
from pyiceberg.schema import Schema
|
||||
from pyiceberg.types import NestedField, StringType, IntegerType
|
||||
|
||||
schema = Schema(
|
||||
NestedField(1, "id", IntegerType(), required=True),
|
||||
NestedField(2, "name", StringType(), required=False),
|
||||
)
|
||||
table = catalog.create_table(("logs", "app_logs"), schema=schema)
|
||||
tables = catalog.list_tables("logs")
|
||||
table = catalog.load_table(("logs", "app_logs"))
|
||||
catalog.rename_table(("logs", "old"), ("logs", "new"))
|
||||
```
|
||||
|
||||
### Data Operations
|
||||
|
||||
```python
|
||||
import pyarrow as pa
|
||||
|
||||
data = pa.table({"id": [1, 2], "name": ["Alice", "Bob"]})
|
||||
table.append(data)
|
||||
table.overwrite(data)
|
||||
|
||||
# Read with filters
|
||||
scan = table.scan(row_filter="id > 100", selected_fields=["id", "name"])
|
||||
df = scan.to_pandas()
|
||||
```
|
||||
|
||||
### Schema Evolution
|
||||
|
||||
```python
|
||||
from pyiceberg.types import IntegerType, LongType
|
||||
|
||||
with table.update_schema() as update:
|
||||
update.add_column("user_id", IntegerType(), doc="User ID")
|
||||
update.rename_column("msg", "message")
|
||||
update.delete_column("old_field")
|
||||
update.update_column("id", field_type=LongType()) # int→long only
|
||||
```
|
||||
|
||||
### Time-Travel
|
||||
|
||||
```python
|
||||
from datetime import datetime, timedelta
|
||||
|
||||
# Query specific snapshot or timestamp
|
||||
scan = table.scan(snapshot_id=table.snapshots()[-2].snapshot_id)
|
||||
yesterday_ms = int((datetime.now() - timedelta(days=1)).timestamp() * 1000)
|
||||
scan = table.scan(as_of_timestamp=yesterday_ms)
|
||||
```
|
||||
|
||||
### Partitioning
|
||||
|
||||
```python
|
||||
from pyiceberg.partitioning import PartitionSpec, PartitionField
|
||||
from pyiceberg.transforms import DayTransform
|
||||
from pyiceberg.types import TimestampType
|
||||
|
||||
partition_spec = PartitionSpec(
|
||||
PartitionField(source_id=1, field_id=1000, transform=DayTransform(), name="day")
|
||||
)
|
||||
table = catalog.create_table(("events", "actions"), schema=schema, partition_spec=partition_spec)
|
||||
scan = table.scan(row_filter="day = '2026-01-27'") # Prunes partitions
|
||||
```
|
||||
|
||||
## Table Maintenance
|
||||
|
||||
### Compaction
|
||||
|
||||
```python
|
||||
files = table.scan().plan_files()
|
||||
avg_mb = sum(f.file_size_in_bytes for f in files) / len(files) / (1024**2)
|
||||
print(f"Files: {len(files)}, Avg: {avg_mb:.1f} MB")
|
||||
|
||||
table.rewrite_data_files(target_file_size_bytes=128 * 1024 * 1024)
|
||||
```
|
||||
|
||||
**When:** Avg <10MB or >1000 files. **Frequency:** High-write daily, medium weekly.
|
||||
|
||||
### Snapshot Expiration
|
||||
|
||||
```python
|
||||
from datetime import datetime, timedelta
|
||||
|
||||
seven_days_ms = int((datetime.now() - timedelta(days=7)).timestamp() * 1000)
|
||||
table.expire_snapshots(older_than=seven_days_ms, retain_last=10)
|
||||
```
|
||||
|
||||
**Retention:** Production 7-30d, dev 1-7d, audit 90+d.
|
||||
|
||||
### Orphan Cleanup
|
||||
|
||||
```python
|
||||
three_days_ms = int((datetime.now() - timedelta(days=3)).timestamp() * 1000)
|
||||
table.delete_orphan_files(older_than=three_days_ms)
|
||||
```
|
||||
|
||||
⚠️ Always expire snapshots first, use 3+ day threshold, run during low traffic.
|
||||
|
||||
### Full Maintenance
|
||||
|
||||
```python
|
||||
# Compact → Expire → Cleanup (in order)
|
||||
if len(table.scan().plan_files()) > 1000:
|
||||
table.rewrite_data_files(target_file_size_bytes=128 * 1024 * 1024)
|
||||
seven_days_ms = int((datetime.now() - timedelta(days=7)).timestamp() * 1000)
|
||||
table.expire_snapshots(older_than=seven_days_ms, retain_last=10)
|
||||
three_days_ms = int((datetime.now() - timedelta(days=3)).timestamp() * 1000)
|
||||
table.delete_orphan_files(older_than=three_days_ms)
|
||||
```
|
||||
|
||||
## Metadata Inspection
|
||||
|
||||
```python
|
||||
table = catalog.load_table(("logs", "app_logs"))
|
||||
print(table.schema())
|
||||
print(table.current_snapshot())
|
||||
print(table.properties)
|
||||
print(f"Files: {len(table.scan().plan_files())}")
|
||||
```
|
||||
|
||||
## Error Codes
|
||||
|
||||
| Code | Meaning | Common Causes |
|
||||
|------|---------|---------------|
|
||||
| 401 | Unauthorized | Invalid/missing token |
|
||||
| 404 | Not Found | Catalog not enabled, namespace/table missing |
|
||||
| 409 | Conflict | Already exists, concurrent update |
|
||||
| 422 | Validation | Invalid schema, incompatible type |
|
||||
|
||||
See [gotchas.md](gotchas.md) for detailed troubleshooting.
|
||||
@@ -0,0 +1,198 @@
|
||||
# Configuration
|
||||
|
||||
How to enable R2 Data Catalog and configure authentication.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- Cloudflare account with [R2 subscription](https://developers.cloudflare.com/r2/pricing/)
|
||||
- R2 bucket created
|
||||
- Access to Cloudflare dashboard or Wrangler CLI
|
||||
|
||||
## Enable Catalog on Bucket
|
||||
|
||||
Choose one method:
|
||||
|
||||
### Via Wrangler (Recommended)
|
||||
|
||||
```bash
|
||||
npx wrangler r2 bucket catalog enable <BUCKET_NAME>
|
||||
```
|
||||
|
||||
**Output:**
|
||||
```
|
||||
✅ Data Catalog enabled for bucket 'my-bucket'
|
||||
Catalog URI: https://<account-id>.r2.cloudflarestorage.com/iceberg/my-bucket
|
||||
Warehouse: my-bucket
|
||||
```
|
||||
|
||||
### Via Dashboard
|
||||
|
||||
1. Navigate to **R2** → Select your bucket → **Settings** tab
|
||||
2. Scroll to "R2 Data Catalog" section → Click **Enable**
|
||||
3. Note the **Catalog URI** and **Warehouse name** shown
|
||||
|
||||
**Result:**
|
||||
- Catalog URI: `https://<account-id>.r2.cloudflarestorage.com/iceberg/<bucket-name>`
|
||||
- Warehouse: `<bucket-name>` (same as bucket name)
|
||||
|
||||
### Via API (Programmatic)
|
||||
|
||||
```bash
|
||||
curl -X POST \
|
||||
"https://api.cloudflare.com/client/v4/accounts/<account-id>/r2/buckets/<bucket>/catalog" \
|
||||
-H "Authorization: Bearer <api-token>" \
|
||||
-H "Content-Type: application/json"
|
||||
```
|
||||
|
||||
**Response:**
|
||||
```json
|
||||
{
|
||||
"result": {
|
||||
"catalog_uri": "https://<account-id>.r2.cloudflarestorage.com/iceberg/<bucket>",
|
||||
"warehouse": "<bucket>"
|
||||
},
|
||||
"success": true
|
||||
}
|
||||
```
|
||||
|
||||
## Check Catalog Status
|
||||
|
||||
```bash
|
||||
npx wrangler r2 bucket catalog status <BUCKET_NAME>
|
||||
```
|
||||
|
||||
**Output:**
|
||||
```
|
||||
Catalog Status: enabled
|
||||
Catalog URI: https://<account-id>.r2.cloudflarestorage.com/iceberg/my-bucket
|
||||
Warehouse: my-bucket
|
||||
```
|
||||
|
||||
## Disable Catalog (If Needed)
|
||||
|
||||
```bash
|
||||
npx wrangler r2 bucket catalog disable <BUCKET_NAME>
|
||||
```
|
||||
|
||||
⚠️ **Warning:** Disabling does NOT delete tables/data. Files remain in bucket. Metadata becomes inaccessible until re-enabled.
|
||||
|
||||
## API Token Creation
|
||||
|
||||
R2 Data Catalog requires API token with **both** R2 Storage + R2 Data Catalog permissions.
|
||||
|
||||
### Dashboard Method (Recommended)
|
||||
|
||||
1. Go to **R2** → **Manage R2 API Tokens** → **Create API Token**
|
||||
2. Select permission level:
|
||||
- **Admin Read & Write** - Full catalog + storage access (read/write)
|
||||
- **Admin Read only** - Read-only access (for query engines)
|
||||
3. Copy token value immediately (shown only once)
|
||||
|
||||
**Permission groups included:**
|
||||
- `Workers R2 Data Catalog Write` (or Read)
|
||||
- `Workers R2 Storage Bucket Item Write` (or Read)
|
||||
|
||||
### API Method (Programmatic)
|
||||
|
||||
Use Cloudflare API to create tokens programmatically. Required permissions:
|
||||
- `Workers R2 Data Catalog Write` (or Read)
|
||||
- `Workers R2 Storage Bucket Item Write` (or Read)
|
||||
|
||||
## Client Configuration
|
||||
|
||||
### PyIceberg
|
||||
|
||||
```python
|
||||
from pyiceberg.catalog.rest import RestCatalog
|
||||
|
||||
catalog = RestCatalog(
|
||||
name="my_catalog",
|
||||
warehouse="<bucket-name>", # Same as bucket name
|
||||
uri="<catalog-uri>", # From enable command
|
||||
token="<api-token>", # From token creation
|
||||
)
|
||||
```
|
||||
|
||||
**Full example with credentials:**
|
||||
```python
|
||||
import os
|
||||
from pyiceberg.catalog.rest import RestCatalog
|
||||
|
||||
# Store credentials in environment variables
|
||||
WAREHOUSE = os.getenv("R2_WAREHOUSE") # e.g., "my-bucket"
|
||||
CATALOG_URI = os.getenv("R2_CATALOG_URI") # e.g., "https://abc123.r2.cloudflarestorage.com/iceberg/my-bucket"
|
||||
TOKEN = os.getenv("R2_TOKEN") # API token
|
||||
|
||||
catalog = RestCatalog(
|
||||
name="r2_catalog",
|
||||
warehouse=WAREHOUSE,
|
||||
uri=CATALOG_URI,
|
||||
token=TOKEN,
|
||||
)
|
||||
|
||||
# Test connection
|
||||
print(catalog.list_namespaces())
|
||||
```
|
||||
|
||||
### Spark / Trino / DuckDB
|
||||
|
||||
See [patterns.md](patterns.md) for integration examples with other query engines.
|
||||
|
||||
## Connection String Format
|
||||
|
||||
For quick reference:
|
||||
|
||||
```
|
||||
Catalog URI: https://<account-id>.r2.cloudflarestorage.com/iceberg/<bucket>
|
||||
Warehouse: <bucket-name>
|
||||
Token: <r2-api-token>
|
||||
```
|
||||
|
||||
**Where to find values:**
|
||||
|
||||
| Value | Source |
|
||||
|-------|--------|
|
||||
| `<account-id>` | Dashboard URL or `wrangler whoami` |
|
||||
| `<bucket>` | R2 bucket name |
|
||||
| Catalog URI | Output from `wrangler r2 bucket catalog enable` |
|
||||
| Token | R2 API Token creation page |
|
||||
|
||||
## Security Best Practices
|
||||
|
||||
1. **Store tokens securely** - Use environment variables or secret managers, never hardcode
|
||||
2. **Use least privilege** - Read-only tokens for query engines, write tokens only where needed
|
||||
3. **Rotate tokens regularly** - Create new tokens, test, then revoke old ones
|
||||
4. **One token per application** - Easier to track and revoke if compromised
|
||||
5. **Monitor token usage** - Check R2 analytics for unexpected patterns
|
||||
6. **Bucket-scoped tokens** - Create tokens per bucket, not account-wide
|
||||
|
||||
## Environment Variables Pattern
|
||||
|
||||
```bash
|
||||
# .env (never commit)
|
||||
R2_CATALOG_URI=https://<account-id>.r2.cloudflarestorage.com/iceberg/<bucket>
|
||||
R2_WAREHOUSE=<bucket-name>
|
||||
R2_TOKEN=<api-token>
|
||||
```
|
||||
|
||||
```python
|
||||
import os
|
||||
from pyiceberg.catalog.rest import RestCatalog
|
||||
|
||||
catalog = RestCatalog(
|
||||
name="r2",
|
||||
uri=os.getenv("R2_CATALOG_URI"),
|
||||
warehouse=os.getenv("R2_WAREHOUSE"),
|
||||
token=os.getenv("R2_TOKEN"),
|
||||
)
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
| Problem | Solution |
|
||||
|---------|----------|
|
||||
| 404 "catalog not found" | Run `wrangler r2 bucket catalog enable <bucket>` |
|
||||
| 401 "unauthorized" | Check token has both Catalog + Storage permissions |
|
||||
| 403 on data files | Token needs both permission groups |
|
||||
|
||||
See [gotchas.md](gotchas.md) for detailed troubleshooting.
|
||||
@@ -0,0 +1,170 @@
|
||||
# Gotchas & Troubleshooting
|
||||
|
||||
Common problems → causes → solutions.
|
||||
|
||||
## Permission Errors
|
||||
|
||||
### 401 Unauthorized
|
||||
|
||||
**Error:** `"401 Unauthorized"`
|
||||
**Cause:** Token missing R2 Data Catalog permissions.
|
||||
**Solution:** Use "Admin Read & Write" token (includes catalog + storage permissions). Test with `catalog.list_namespaces()`.
|
||||
|
||||
### 403 Forbidden
|
||||
|
||||
**Error:** `"403 Forbidden"` on data files
|
||||
**Cause:** Token lacks storage permissions.
|
||||
**Solution:** Token needs both R2 Data Catalog + R2 Storage Bucket Item permissions.
|
||||
|
||||
### Token Rotation Issues
|
||||
|
||||
**Error:** New token fails after rotation.
|
||||
**Solution:** Create new token → test in staging → update prod → monitor 24h → revoke old.
|
||||
|
||||
## Catalog URI Issues
|
||||
|
||||
### 404 Not Found
|
||||
|
||||
**Error:** `"404 Catalog not found"`
|
||||
**Cause:** Catalog not enabled or wrong URI.
|
||||
**Solution:** Run `wrangler r2 bucket catalog enable <bucket>`. URI must be HTTPS with `/iceberg/` and case-sensitive bucket name.
|
||||
|
||||
### Wrong Warehouse
|
||||
|
||||
**Error:** Cannot create/load tables.
|
||||
**Cause:** Warehouse ≠ bucket name.
|
||||
**Solution:** Set `warehouse="bucket-name"` to match bucket exactly.
|
||||
|
||||
## Table and Schema Issues
|
||||
|
||||
### Table/Namespace Already Exists
|
||||
|
||||
**Error:** `"TableAlreadyExistsError"`
|
||||
**Solution:** Use try/except to load existing or check first.
|
||||
|
||||
### Namespace Not Found
|
||||
|
||||
**Error:** Cannot create table.
|
||||
**Solution:** Create namespace first: `catalog.create_namespace("ns")`
|
||||
|
||||
### Schema Evolution Errors
|
||||
|
||||
**Error:** `"422 Validation"` on schema update.
|
||||
**Cause:** Incompatible change (required field, type shrink).
|
||||
**Solution:** Only add nullable columns, compatible type widening (int→long, float→double).
|
||||
|
||||
## Data and Query Issues
|
||||
|
||||
### Empty Scan Results
|
||||
|
||||
**Error:** Scan returns no data.
|
||||
**Cause:** Incorrect filter or partition column.
|
||||
**Solution:** Test without filter first: `table.scan().to_pandas()`. Verify partition column names.
|
||||
|
||||
### Slow Queries
|
||||
|
||||
**Error:** Performance degrades over time.
|
||||
**Cause:** Too many small files.
|
||||
**Solution:** Check file count, compact if >1000 or avg <10MB. See [api.md](api.md#compaction).
|
||||
|
||||
### Type Mismatch
|
||||
|
||||
**Error:** `"Cannot cast"` on append.
|
||||
**Cause:** PyArrow types don't match Iceberg schema.
|
||||
**Solution:** Cast to int64 (Iceberg default), not int32. Check `table.schema()`.
|
||||
|
||||
## Compaction Issues
|
||||
|
||||
### Compaction Issues
|
||||
|
||||
**Problem:** File count unchanged or compaction takes hours.
|
||||
**Cause:** Target size too large, or table too big for PyIceberg.
|
||||
**Solution:** Only compact if avg <50MB. For >1TB tables, use Spark. Run during low-traffic periods.
|
||||
|
||||
## Maintenance Issues
|
||||
|
||||
### Snapshot/Orphan Issues
|
||||
|
||||
**Problem:** Expiration fails or orphan cleanup deletes active data.
|
||||
**Cause:** Too aggressive retention or wrong order.
|
||||
**Solution:** Always expire snapshots first with `retain_last=10`, then cleanup orphans with 3+ day threshold.
|
||||
|
||||
## Concurrency Issues
|
||||
|
||||
### Concurrent Write Conflicts
|
||||
|
||||
**Problem:** `CommitFailedException` with multiple writers.
|
||||
**Cause:** Optimistic locking - simultaneous commits.
|
||||
**Solution:** Add retry with exponential backoff (see [patterns.md](patterns.md#pattern-6-concurrent-writes-with-retry)).
|
||||
|
||||
### Stale Metadata
|
||||
|
||||
**Problem:** Old schema/data after external update.
|
||||
**Cause:** Cached metadata.
|
||||
**Solution:** Reload table: `table = catalog.load_table(("ns", "table"))`
|
||||
|
||||
## Performance Optimization
|
||||
|
||||
### Performance Tips
|
||||
|
||||
**Scans:** Use `row_filter` and `selected_fields` to reduce data scanned.
|
||||
**Partitions:** 100-1000 optimal. Avoid high cardinality (millions) or low (<10).
|
||||
**Files:** Keep 100-500MB avg. Compact if <10MB or >10k files.
|
||||
|
||||
## Limits
|
||||
|
||||
| Resource | Recommended | Impact if Exceeded |
|
||||
|----------|-------------|-------------------|
|
||||
| Tables/namespace | <10k | Slow list ops |
|
||||
| Files/table | <100k | Slow query planning |
|
||||
| Partitions/table | 100-1k | Metadata overhead |
|
||||
| Snapshots/table | Expire >7d | Metadata bloat |
|
||||
|
||||
## Common Error Messages Reference
|
||||
|
||||
| Error Message | Likely Cause | Fix |
|
||||
|---------------|--------------|-----|
|
||||
| `401 Unauthorized` | Missing/invalid token | Check token has catalog+storage permissions |
|
||||
| `403 Forbidden` | Token lacks storage permissions | Add R2 Storage Bucket Item permission |
|
||||
| `404 Not Found` | Catalog not enabled or wrong URI | Run `wrangler r2 bucket catalog enable` |
|
||||
| `409 Conflict` | Table/namespace already exists | Use try/except or load existing |
|
||||
| `422 Unprocessable Entity` | Schema validation failed | Check type compatibility, required fields |
|
||||
| `CommitFailedException` | Concurrent write conflict | Add retry logic with backoff |
|
||||
| `NamespaceAlreadyExistsError` | Namespace exists | Use try/except or load existing |
|
||||
| `NoSuchTableError` | Table doesn't exist | Check namespace+table name, create first |
|
||||
| `TypeError: Cannot cast` | PyArrow type mismatch | Cast data to match Iceberg schema |
|
||||
|
||||
## Debugging Checklist
|
||||
|
||||
When things go wrong, check in order:
|
||||
|
||||
1. ✅ **Catalog enabled:** `npx wrangler r2 bucket catalog status <bucket>`
|
||||
2. ✅ **Token permissions:** Both R2 Data Catalog + R2 Storage in dashboard
|
||||
3. ✅ **Connection test:** `catalog.list_namespaces()` succeeds
|
||||
4. ✅ **URI format:** HTTPS, includes `/iceberg/`, correct bucket name
|
||||
5. ✅ **Warehouse name:** Matches bucket name exactly
|
||||
6. ✅ **Namespace exists:** Create before `create_table()`
|
||||
7. ✅ **Enable debug logging:** `logging.basicConfig(level=logging.DEBUG)`
|
||||
8. ✅ **PyIceberg version:** `pip install --upgrade pyiceberg` (≥0.5.0)
|
||||
9. ✅ **File health:** Compact if >1000 files or avg <10MB
|
||||
10. ✅ **Snapshot count:** Expire if >100 snapshots
|
||||
|
||||
## Enable Debug Logging
|
||||
|
||||
```python
|
||||
import logging
|
||||
logging.basicConfig(level=logging.DEBUG)
|
||||
# Now operations show HTTP requests/responses
|
||||
```
|
||||
|
||||
## Resources
|
||||
|
||||
- [Cloudflare Community](https://community.cloudflare.com/c/developers/workers/40)
|
||||
- [Cloudflare Discord](https://discord.cloudflare.com) - #r2 channel
|
||||
- [PyIceberg GitHub](https://github.com/apache/iceberg-python/issues)
|
||||
- [Apache Iceberg Slack](https://iceberg.apache.org/community/)
|
||||
|
||||
## Next Steps
|
||||
|
||||
- [patterns.md](patterns.md) - Working examples
|
||||
- [api.md](api.md) - API reference
|
||||
@@ -0,0 +1,191 @@
|
||||
# Common Patterns
|
||||
|
||||
Practical patterns for R2 Data Catalog with PyIceberg.
|
||||
|
||||
## PyIceberg Connection
|
||||
|
||||
```python
|
||||
import os
|
||||
from pyiceberg.catalog.rest import RestCatalog
|
||||
from pyiceberg.exceptions import NamespaceAlreadyExistsError
|
||||
|
||||
catalog = RestCatalog(
|
||||
name="r2_catalog",
|
||||
warehouse=os.getenv("R2_WAREHOUSE"), # bucket name
|
||||
uri=os.getenv("R2_CATALOG_URI"), # catalog endpoint
|
||||
token=os.getenv("R2_TOKEN"), # API token
|
||||
)
|
||||
|
||||
# Create namespace (idempotent)
|
||||
try:
|
||||
catalog.create_namespace("default")
|
||||
except NamespaceAlreadyExistsError:
|
||||
pass
|
||||
```
|
||||
|
||||
## Pattern 1: Log Analytics Pipeline
|
||||
|
||||
Ingest logs incrementally, query by time/level.
|
||||
|
||||
```python
|
||||
import pyarrow as pa
|
||||
from datetime import datetime
|
||||
from pyiceberg.schema import Schema
|
||||
from pyiceberg.types import NestedField, TimestampType, StringType, IntegerType
|
||||
from pyiceberg.partitioning import PartitionSpec, PartitionField
|
||||
from pyiceberg.transforms import DayTransform
|
||||
|
||||
# Create partitioned table (once)
|
||||
schema = Schema(
|
||||
NestedField(1, "timestamp", TimestampType(), required=True),
|
||||
NestedField(2, "level", StringType(), required=True),
|
||||
NestedField(3, "service", StringType(), required=True),
|
||||
NestedField(4, "message", StringType(), required=False),
|
||||
)
|
||||
|
||||
partition_spec = PartitionSpec(
|
||||
PartitionField(source_id=1, field_id=1000, transform=DayTransform(), name="day")
|
||||
)
|
||||
|
||||
catalog.create_namespace("logs")
|
||||
table = catalog.create_table(("logs", "app_logs"), schema=schema, partition_spec=partition_spec)
|
||||
|
||||
# Append logs (incremental)
|
||||
data = pa.table({
|
||||
"timestamp": [datetime(2026, 1, 27, 10, 30, 0)],
|
||||
"level": ["ERROR"],
|
||||
"service": ["auth-service"],
|
||||
"message": ["Failed login"],
|
||||
})
|
||||
table.append(data)
|
||||
|
||||
# Query by time + level (leverages partitioning)
|
||||
scan = table.scan(row_filter="level = 'ERROR' AND day = '2026-01-27'")
|
||||
errors = scan.to_pandas()
|
||||
```
|
||||
|
||||
## Pattern 2: Time-Travel Queries
|
||||
|
||||
```python
|
||||
from datetime import datetime, timedelta
|
||||
|
||||
table = catalog.load_table(("logs", "app_logs"))
|
||||
|
||||
# Query specific snapshot
|
||||
snapshot_id = table.current_snapshot().snapshot_id
|
||||
data = table.scan(snapshot_id=snapshot_id).to_pandas()
|
||||
|
||||
# Query as of timestamp (yesterday)
|
||||
yesterday_ms = int((datetime.now() - timedelta(days=1)).timestamp() * 1000)
|
||||
data = table.scan(as_of_timestamp=yesterday_ms).to_pandas()
|
||||
```
|
||||
|
||||
## Pattern 3: Schema Evolution
|
||||
|
||||
```python
|
||||
from pyiceberg.types import StringType
|
||||
|
||||
table = catalog.load_table(("users", "profiles"))
|
||||
|
||||
with table.update_schema() as update:
|
||||
update.add_column("email", StringType(), required=False)
|
||||
update.rename_column("name", "full_name")
|
||||
# Old readers ignore new columns, new readers see nulls for old data
|
||||
```
|
||||
|
||||
## Pattern 4: Partitioned Tables
|
||||
|
||||
```python
|
||||
from pyiceberg.partitioning import PartitionSpec, PartitionField
|
||||
from pyiceberg.transforms import DayTransform, IdentityTransform
|
||||
|
||||
# Partition by day + country
|
||||
partition_spec = PartitionSpec(
|
||||
PartitionField(source_id=1, field_id=1000, transform=DayTransform(), name="day"),
|
||||
PartitionField(source_id=2, field_id=1001, transform=IdentityTransform(), name="country"),
|
||||
)
|
||||
table = catalog.create_table(("events", "user_events"), schema=schema, partition_spec=partition_spec)
|
||||
|
||||
# Queries prune partitions automatically
|
||||
scan = table.scan(row_filter="country = 'US' AND day = '2026-01-27'")
|
||||
```
|
||||
|
||||
## Pattern 5: Table Maintenance
|
||||
|
||||
```python
|
||||
from datetime import datetime, timedelta
|
||||
|
||||
table = catalog.load_table(("logs", "app_logs"))
|
||||
|
||||
# Compact → expire → cleanup (in order)
|
||||
table.rewrite_data_files(target_file_size_bytes=128 * 1024 * 1024)
|
||||
seven_days_ms = int((datetime.now() - timedelta(days=7)).timestamp() * 1000)
|
||||
table.expire_snapshots(older_than=seven_days_ms, retain_last=10)
|
||||
three_days_ms = int((datetime.now() - timedelta(days=3)).timestamp() * 1000)
|
||||
table.delete_orphan_files(older_than=three_days_ms)
|
||||
```
|
||||
|
||||
See [api.md](api.md#table-maintenance) for detailed parameters.
|
||||
|
||||
## Pattern 6: Concurrent Writes with Retry
|
||||
|
||||
```python
|
||||
from pyiceberg.exceptions import CommitFailedException
|
||||
import time
|
||||
|
||||
def append_with_retry(table, data, max_retries=3):
|
||||
for attempt in range(max_retries):
|
||||
try:
|
||||
table.append(data)
|
||||
return
|
||||
except CommitFailedException:
|
||||
if attempt == max_retries - 1:
|
||||
raise
|
||||
time.sleep(2 ** attempt)
|
||||
```
|
||||
|
||||
## Pattern 7: Upsert Simulation
|
||||
|
||||
```python
|
||||
import pandas as pd
|
||||
import pyarrow as pa
|
||||
|
||||
# Read → merge → overwrite (not atomic, use Spark MERGE INTO for production)
|
||||
existing = table.scan().to_pandas()
|
||||
new_data = pd.DataFrame({"id": [1, 3], "value": [100, 300]})
|
||||
merged = pd.concat([existing, new_data]).drop_duplicates(subset=["id"], keep="last")
|
||||
table.overwrite(pa.Table.from_pandas(merged))
|
||||
```
|
||||
|
||||
## Pattern 8: DuckDB Integration
|
||||
|
||||
```python
|
||||
import duckdb
|
||||
|
||||
arrow_table = table.scan().to_arrow()
|
||||
con = duckdb.connect()
|
||||
con.register("logs", arrow_table)
|
||||
result = con.execute("SELECT level, COUNT(*) FROM logs GROUP BY level").fetchdf()
|
||||
```
|
||||
|
||||
## Pattern 9: Monitor Table Health
|
||||
|
||||
```python
|
||||
files = table.scan().plan_files()
|
||||
avg_mb = sum(f.file_size_in_bytes for f in files) / len(files) / (1024**2)
|
||||
print(f"Files: {len(files)}, Avg: {avg_mb:.1f}MB, Snapshots: {len(table.snapshots())}")
|
||||
|
||||
if avg_mb < 10 or len(files) > 1000:
|
||||
print("⚠️ Needs compaction")
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
| Area | Guideline |
|
||||
|------|-----------|
|
||||
| **Partitioning** | Use day/hour for time-series; 100-1000 partitions; avoid high cardinality |
|
||||
| **File sizes** | Target 128-512MB; compact when avg <10MB or >10k files |
|
||||
| **Schema** | Add columns as nullable (`required=False`); batch changes |
|
||||
| **Maintenance** | Compact high-write daily/weekly; expire snapshots 7-30d; cleanup orphans after |
|
||||
| **Concurrency** | Reads automatic; writes to different partitions safe; retry same partition |
|
||||
| **Performance** | Filter on partitions; select only needed columns; batch appends 100MB+ |
|
||||
Reference in New Issue
Block a user