update skills

This commit is contained in:
2026-03-17 16:53:22 -07:00
parent 0b0783ef8e
commit f9a530667e
389 changed files with 54512 additions and 1 deletions

View File

@@ -0,0 +1,149 @@
# Cloudflare R2 Data Catalog Skill Reference
Expert guidance for Cloudflare R2 Data Catalog - Apache Iceberg catalog built into R2 buckets.
## Reading Order
**New to R2 Data Catalog?** Start here:
1. Read "What is R2 Data Catalog?" and "When to Use" below
2. [configuration.md](configuration.md) - Enable catalog, create tokens
3. [patterns.md](patterns.md) - PyIceberg setup and common patterns
4. [api.md](api.md) - REST API reference as needed
5. [gotchas.md](gotchas.md) - Troubleshooting when issues arise
**Quick reference?** Jump to:
- [Enable catalog on bucket](configuration.md#enable-catalog-on-bucket)
- [PyIceberg connection pattern](patterns.md#pyiceberg-connection-pattern)
- [Permission errors](gotchas.md#permission-errors)
## What is R2 Data Catalog?
R2 Data Catalog is a **managed Apache Iceberg REST catalog** built directly into R2 buckets. It provides:
- **Apache Iceberg tables** - ACID transactions, schema evolution, time-travel queries
- **Zero-egress costs** - Query from any cloud/region without data transfer fees
- **Standard REST API** - Works with Spark, PyIceberg, Snowflake, Trino, DuckDB
- **No infrastructure** - Fully managed, no catalog servers to run
- **Public beta** - Available to all R2 subscribers, no extra cost beyond R2 storage
### What is Apache Iceberg?
Open table format for analytics datasets in object storage. Features:
- **ACID transactions** - Safe concurrent reads/writes
- **Metadata optimization** - Fast queries without full scans
- **Schema evolution** - Add/rename/delete columns without rewrites
- **Time-travel** - Query historical snapshots
- **Partitioning** - Organize data for efficient queries
## When to Use
**Use R2 Data Catalog for:**
- **Log analytics** - Store and query application/system logs
- **Data lakes/warehouses** - Analytical datasets queried by multiple engines
- **BI pipelines** - Aggregate data for dashboards and reports
- **Multi-cloud analytics** - Share data across clouds without egress fees
- **Time-series data** - Event streams, metrics, sensor data
**Don't use for:**
- **Transactional workloads** - Use D1 or external database instead
- **Sub-second latency** - Iceberg optimized for batch/analytical queries
- **Small datasets (<1GB)** - Setup overhead not worth it
- **Unstructured data** - Store files directly in R2, not as Iceberg tables
## Architecture
```
┌─────────────────────────────────────────────────┐
│ Query Engines │
│ (PyIceberg, Spark, Trino, Snowflake, DuckDB) │
└────────────────┬────────────────────────────────┘
│ REST API (OAuth2 token)
┌─────────────────────────────────────────────────┐
│ R2 Data Catalog (Managed Iceberg REST Catalog)│
│ • Namespace/table metadata │
│ • Transaction coordination │
│ • Snapshot management │
└────────────────┬────────────────────────────────┘
│ Vended credentials
┌─────────────────────────────────────────────────┐
│ R2 Bucket Storage │
│ • Parquet data files │
│ • Metadata files │
│ • Manifest files │
└─────────────────────────────────────────────────┘
```
**Key concepts:**
- **Catalog URI** - REST endpoint for catalog operations (e.g., `https://<account-id>.r2.cloudflarestorage.com/iceberg/<bucket>`)
- **Warehouse** - Logical grouping of tables (typically same as bucket name)
- **Namespace** - Schema/database containing tables (e.g., `logs`, `analytics`)
- **Table** - Iceberg table with schema, data files, snapshots
- **Vended credentials** - Temporary S3 credentials catalog provides for data access
## Limits
| Resource | Limit | Notes |
|----------|-------|-------|
| Namespaces per catalog | No hard limit | Organize tables logically |
| Tables per namespace | <10,000 recommended | Performance degrades beyond this |
| Files per table | <100,000 recommended | Run compaction regularly |
| Snapshots per table | Configurable retention | Expire >7 days old |
| Partitions per table | 100-1,000 optimal | Too many = slow metadata ops |
| Table size | Same as R2 bucket | 10GB-10TB+ common |
| API rate limits | Standard R2 API limits | Shared with R2 storage operations |
| Target file size | 128-512 MB | After compaction |
## Current Status
**Public Beta** (as of Jan 2026)
- Available to all R2 subscribers
- No extra cost beyond standard R2 storage/operations
- Production-ready, but breaking changes possible
- Supports: namespaces, tables, snapshots, compaction, time-travel, table maintenance
## Decision Tree: Is R2 Data Catalog Right For You?
```
Start → Need analytics on object storage data?
├─ No → Use R2 directly for object storage
└─ Yes → Dataset >1GB with structured schema?
├─ No → Too small, use R2 + ad-hoc queries
└─ Yes → Need ACID transactions or schema evolution?
├─ No → Consider simpler solutions (Parquet on R2)
└─ Yes → Need multi-cloud/multi-tool access?
├─ No → D1 or external DB may be simpler
└─ Yes → ✅ Use R2 Data Catalog
```
**Quick check:** If you answer "yes" to all:
- Dataset >1GB and growing
- Structured/tabular data (logs, events, metrics)
- Multiple query tools or cloud environments
- Need versioning, schema changes, or concurrent access
→ R2 Data Catalog is a good fit.
## In This Reference
- **[configuration.md](configuration.md)** - Enable catalog, create API tokens, connect clients
- **[api.md](api.md)** - REST endpoints, operations, maintenance
- **[patterns.md](patterns.md)** - PyIceberg examples, common use cases
- **[gotchas.md](gotchas.md)** - Troubleshooting, best practices, limitations
## See Also
- [Cloudflare R2 Data Catalog Docs](https://developers.cloudflare.com/r2/data-catalog/)
- [Apache Iceberg Docs](https://iceberg.apache.org/)
- [PyIceberg Docs](https://py.iceberg.apache.org/)

View File

@@ -0,0 +1,199 @@
# API Reference
R2 Data Catalog exposes standard [Apache Iceberg REST Catalog API](https://github.com/apache/iceberg/blob/main/open-api/rest-catalog-open-api.yaml).
## Quick Reference
**Most common operations:**
| Task | PyIceberg Code |
|------|----------------|
| Connect | `RestCatalog(name="r2", warehouse=bucket, uri=uri, token=token)` |
| List namespaces | `catalog.list_namespaces()` |
| Create namespace | `catalog.create_namespace("logs")` |
| Create table | `catalog.create_table(("ns", "table"), schema=schema)` |
| Load table | `catalog.load_table(("ns", "table"))` |
| Append data | `table.append(pyarrow_table)` |
| Query data | `table.scan().to_pandas()` |
| Compact files | `table.rewrite_data_files(target_file_size_bytes=128*1024*1024)` |
| Expire snapshots | `table.expire_snapshots(older_than=timestamp_ms, retain_last=10)` |
## REST Endpoints
Base: `https://<account-id>.r2.cloudflarestorage.com/iceberg/<bucket-name>`
| Operation | Method | Path |
|-----------|--------|------|
| Catalog config | GET | `/v1/config` |
| List namespaces | GET | `/v1/namespaces` |
| Create namespace | POST | `/v1/namespaces` |
| Delete namespace | DELETE | `/v1/namespaces/{ns}` |
| List tables | GET | `/v1/namespaces/{ns}/tables` |
| Create table | POST | `/v1/namespaces/{ns}/tables` |
| Load table | GET | `/v1/namespaces/{ns}/tables/{table}` |
| Update table | POST | `/v1/namespaces/{ns}/tables/{table}` |
| Delete table | DELETE | `/v1/namespaces/{ns}/tables/{table}` |
| Rename table | POST | `/v1/tables/rename` |
**Authentication:** Bearer token in header: `Authorization: Bearer <token>`
## PyIceberg Client API
Most users use PyIceberg, not raw REST.
### Connection
```python
from pyiceberg.catalog.rest import RestCatalog
catalog = RestCatalog(
name="my_catalog",
warehouse="<bucket-name>",
uri="<catalog-uri>",
token="<api-token>",
)
```
### Namespace Operations
```python
from pyiceberg.exceptions import NamespaceAlreadyExistsError
namespaces = catalog.list_namespaces() # [('default',), ('logs',)]
catalog.create_namespace("logs", properties={"owner": "team"})
catalog.drop_namespace("logs") # Must be empty
```
### Table Operations
```python
from pyiceberg.schema import Schema
from pyiceberg.types import NestedField, StringType, IntegerType
schema = Schema(
NestedField(1, "id", IntegerType(), required=True),
NestedField(2, "name", StringType(), required=False),
)
table = catalog.create_table(("logs", "app_logs"), schema=schema)
tables = catalog.list_tables("logs")
table = catalog.load_table(("logs", "app_logs"))
catalog.rename_table(("logs", "old"), ("logs", "new"))
```
### Data Operations
```python
import pyarrow as pa
data = pa.table({"id": [1, 2], "name": ["Alice", "Bob"]})
table.append(data)
table.overwrite(data)
# Read with filters
scan = table.scan(row_filter="id > 100", selected_fields=["id", "name"])
df = scan.to_pandas()
```
### Schema Evolution
```python
from pyiceberg.types import IntegerType, LongType
with table.update_schema() as update:
update.add_column("user_id", IntegerType(), doc="User ID")
update.rename_column("msg", "message")
update.delete_column("old_field")
update.update_column("id", field_type=LongType()) # int→long only
```
### Time-Travel
```python
from datetime import datetime, timedelta
# Query specific snapshot or timestamp
scan = table.scan(snapshot_id=table.snapshots()[-2].snapshot_id)
yesterday_ms = int((datetime.now() - timedelta(days=1)).timestamp() * 1000)
scan = table.scan(as_of_timestamp=yesterday_ms)
```
### Partitioning
```python
from pyiceberg.partitioning import PartitionSpec, PartitionField
from pyiceberg.transforms import DayTransform
from pyiceberg.types import TimestampType
partition_spec = PartitionSpec(
PartitionField(source_id=1, field_id=1000, transform=DayTransform(), name="day")
)
table = catalog.create_table(("events", "actions"), schema=schema, partition_spec=partition_spec)
scan = table.scan(row_filter="day = '2026-01-27'") # Prunes partitions
```
## Table Maintenance
### Compaction
```python
files = table.scan().plan_files()
avg_mb = sum(f.file_size_in_bytes for f in files) / len(files) / (1024**2)
print(f"Files: {len(files)}, Avg: {avg_mb:.1f} MB")
table.rewrite_data_files(target_file_size_bytes=128 * 1024 * 1024)
```
**When:** Avg <10MB or >1000 files. **Frequency:** High-write daily, medium weekly.
### Snapshot Expiration
```python
from datetime import datetime, timedelta
seven_days_ms = int((datetime.now() - timedelta(days=7)).timestamp() * 1000)
table.expire_snapshots(older_than=seven_days_ms, retain_last=10)
```
**Retention:** Production 7-30d, dev 1-7d, audit 90+d.
### Orphan Cleanup
```python
three_days_ms = int((datetime.now() - timedelta(days=3)).timestamp() * 1000)
table.delete_orphan_files(older_than=three_days_ms)
```
⚠️ Always expire snapshots first, use 3+ day threshold, run during low traffic.
### Full Maintenance
```python
# Compact → Expire → Cleanup (in order)
if len(table.scan().plan_files()) > 1000:
table.rewrite_data_files(target_file_size_bytes=128 * 1024 * 1024)
seven_days_ms = int((datetime.now() - timedelta(days=7)).timestamp() * 1000)
table.expire_snapshots(older_than=seven_days_ms, retain_last=10)
three_days_ms = int((datetime.now() - timedelta(days=3)).timestamp() * 1000)
table.delete_orphan_files(older_than=three_days_ms)
```
## Metadata Inspection
```python
table = catalog.load_table(("logs", "app_logs"))
print(table.schema())
print(table.current_snapshot())
print(table.properties)
print(f"Files: {len(table.scan().plan_files())}")
```
## Error Codes
| Code | Meaning | Common Causes |
|------|---------|---------------|
| 401 | Unauthorized | Invalid/missing token |
| 404 | Not Found | Catalog not enabled, namespace/table missing |
| 409 | Conflict | Already exists, concurrent update |
| 422 | Validation | Invalid schema, incompatible type |
See [gotchas.md](gotchas.md) for detailed troubleshooting.

View File

@@ -0,0 +1,198 @@
# Configuration
How to enable R2 Data Catalog and configure authentication.
## Prerequisites
- Cloudflare account with [R2 subscription](https://developers.cloudflare.com/r2/pricing/)
- R2 bucket created
- Access to Cloudflare dashboard or Wrangler CLI
## Enable Catalog on Bucket
Choose one method:
### Via Wrangler (Recommended)
```bash
npx wrangler r2 bucket catalog enable <BUCKET_NAME>
```
**Output:**
```
✅ Data Catalog enabled for bucket 'my-bucket'
Catalog URI: https://<account-id>.r2.cloudflarestorage.com/iceberg/my-bucket
Warehouse: my-bucket
```
### Via Dashboard
1. Navigate to **R2** → Select your bucket → **Settings** tab
2. Scroll to "R2 Data Catalog" section → Click **Enable**
3. Note the **Catalog URI** and **Warehouse name** shown
**Result:**
- Catalog URI: `https://<account-id>.r2.cloudflarestorage.com/iceberg/<bucket-name>`
- Warehouse: `<bucket-name>` (same as bucket name)
### Via API (Programmatic)
```bash
curl -X POST \
"https://api.cloudflare.com/client/v4/accounts/<account-id>/r2/buckets/<bucket>/catalog" \
-H "Authorization: Bearer <api-token>" \
-H "Content-Type: application/json"
```
**Response:**
```json
{
"result": {
"catalog_uri": "https://<account-id>.r2.cloudflarestorage.com/iceberg/<bucket>",
"warehouse": "<bucket>"
},
"success": true
}
```
## Check Catalog Status
```bash
npx wrangler r2 bucket catalog status <BUCKET_NAME>
```
**Output:**
```
Catalog Status: enabled
Catalog URI: https://<account-id>.r2.cloudflarestorage.com/iceberg/my-bucket
Warehouse: my-bucket
```
## Disable Catalog (If Needed)
```bash
npx wrangler r2 bucket catalog disable <BUCKET_NAME>
```
⚠️ **Warning:** Disabling does NOT delete tables/data. Files remain in bucket. Metadata becomes inaccessible until re-enabled.
## API Token Creation
R2 Data Catalog requires API token with **both** R2 Storage + R2 Data Catalog permissions.
### Dashboard Method (Recommended)
1. Go to **R2****Manage R2 API Tokens****Create API Token**
2. Select permission level:
- **Admin Read & Write** - Full catalog + storage access (read/write)
- **Admin Read only** - Read-only access (for query engines)
3. Copy token value immediately (shown only once)
**Permission groups included:**
- `Workers R2 Data Catalog Write` (or Read)
- `Workers R2 Storage Bucket Item Write` (or Read)
### API Method (Programmatic)
Use Cloudflare API to create tokens programmatically. Required permissions:
- `Workers R2 Data Catalog Write` (or Read)
- `Workers R2 Storage Bucket Item Write` (or Read)
## Client Configuration
### PyIceberg
```python
from pyiceberg.catalog.rest import RestCatalog
catalog = RestCatalog(
name="my_catalog",
warehouse="<bucket-name>", # Same as bucket name
uri="<catalog-uri>", # From enable command
token="<api-token>", # From token creation
)
```
**Full example with credentials:**
```python
import os
from pyiceberg.catalog.rest import RestCatalog
# Store credentials in environment variables
WAREHOUSE = os.getenv("R2_WAREHOUSE") # e.g., "my-bucket"
CATALOG_URI = os.getenv("R2_CATALOG_URI") # e.g., "https://abc123.r2.cloudflarestorage.com/iceberg/my-bucket"
TOKEN = os.getenv("R2_TOKEN") # API token
catalog = RestCatalog(
name="r2_catalog",
warehouse=WAREHOUSE,
uri=CATALOG_URI,
token=TOKEN,
)
# Test connection
print(catalog.list_namespaces())
```
### Spark / Trino / DuckDB
See [patterns.md](patterns.md) for integration examples with other query engines.
## Connection String Format
For quick reference:
```
Catalog URI: https://<account-id>.r2.cloudflarestorage.com/iceberg/<bucket>
Warehouse: <bucket-name>
Token: <r2-api-token>
```
**Where to find values:**
| Value | Source |
|-------|--------|
| `<account-id>` | Dashboard URL or `wrangler whoami` |
| `<bucket>` | R2 bucket name |
| Catalog URI | Output from `wrangler r2 bucket catalog enable` |
| Token | R2 API Token creation page |
## Security Best Practices
1. **Store tokens securely** - Use environment variables or secret managers, never hardcode
2. **Use least privilege** - Read-only tokens for query engines, write tokens only where needed
3. **Rotate tokens regularly** - Create new tokens, test, then revoke old ones
4. **One token per application** - Easier to track and revoke if compromised
5. **Monitor token usage** - Check R2 analytics for unexpected patterns
6. **Bucket-scoped tokens** - Create tokens per bucket, not account-wide
## Environment Variables Pattern
```bash
# .env (never commit)
R2_CATALOG_URI=https://<account-id>.r2.cloudflarestorage.com/iceberg/<bucket>
R2_WAREHOUSE=<bucket-name>
R2_TOKEN=<api-token>
```
```python
import os
from pyiceberg.catalog.rest import RestCatalog
catalog = RestCatalog(
name="r2",
uri=os.getenv("R2_CATALOG_URI"),
warehouse=os.getenv("R2_WAREHOUSE"),
token=os.getenv("R2_TOKEN"),
)
```
## Troubleshooting
| Problem | Solution |
|---------|----------|
| 404 "catalog not found" | Run `wrangler r2 bucket catalog enable <bucket>` |
| 401 "unauthorized" | Check token has both Catalog + Storage permissions |
| 403 on data files | Token needs both permission groups |
See [gotchas.md](gotchas.md) for detailed troubleshooting.

View File

@@ -0,0 +1,170 @@
# Gotchas & Troubleshooting
Common problems → causes → solutions.
## Permission Errors
### 401 Unauthorized
**Error:** `"401 Unauthorized"`
**Cause:** Token missing R2 Data Catalog permissions.
**Solution:** Use "Admin Read & Write" token (includes catalog + storage permissions). Test with `catalog.list_namespaces()`.
### 403 Forbidden
**Error:** `"403 Forbidden"` on data files
**Cause:** Token lacks storage permissions.
**Solution:** Token needs both R2 Data Catalog + R2 Storage Bucket Item permissions.
### Token Rotation Issues
**Error:** New token fails after rotation.
**Solution:** Create new token → test in staging → update prod → monitor 24h → revoke old.
## Catalog URI Issues
### 404 Not Found
**Error:** `"404 Catalog not found"`
**Cause:** Catalog not enabled or wrong URI.
**Solution:** Run `wrangler r2 bucket catalog enable <bucket>`. URI must be HTTPS with `/iceberg/` and case-sensitive bucket name.
### Wrong Warehouse
**Error:** Cannot create/load tables.
**Cause:** Warehouse ≠ bucket name.
**Solution:** Set `warehouse="bucket-name"` to match bucket exactly.
## Table and Schema Issues
### Table/Namespace Already Exists
**Error:** `"TableAlreadyExistsError"`
**Solution:** Use try/except to load existing or check first.
### Namespace Not Found
**Error:** Cannot create table.
**Solution:** Create namespace first: `catalog.create_namespace("ns")`
### Schema Evolution Errors
**Error:** `"422 Validation"` on schema update.
**Cause:** Incompatible change (required field, type shrink).
**Solution:** Only add nullable columns, compatible type widening (int→long, float→double).
## Data and Query Issues
### Empty Scan Results
**Error:** Scan returns no data.
**Cause:** Incorrect filter or partition column.
**Solution:** Test without filter first: `table.scan().to_pandas()`. Verify partition column names.
### Slow Queries
**Error:** Performance degrades over time.
**Cause:** Too many small files.
**Solution:** Check file count, compact if >1000 or avg <10MB. See [api.md](api.md#compaction).
### Type Mismatch
**Error:** `"Cannot cast"` on append.
**Cause:** PyArrow types don't match Iceberg schema.
**Solution:** Cast to int64 (Iceberg default), not int32. Check `table.schema()`.
## Compaction Issues
### Compaction Issues
**Problem:** File count unchanged or compaction takes hours.
**Cause:** Target size too large, or table too big for PyIceberg.
**Solution:** Only compact if avg <50MB. For >1TB tables, use Spark. Run during low-traffic periods.
## Maintenance Issues
### Snapshot/Orphan Issues
**Problem:** Expiration fails or orphan cleanup deletes active data.
**Cause:** Too aggressive retention or wrong order.
**Solution:** Always expire snapshots first with `retain_last=10`, then cleanup orphans with 3+ day threshold.
## Concurrency Issues
### Concurrent Write Conflicts
**Problem:** `CommitFailedException` with multiple writers.
**Cause:** Optimistic locking - simultaneous commits.
**Solution:** Add retry with exponential backoff (see [patterns.md](patterns.md#pattern-6-concurrent-writes-with-retry)).
### Stale Metadata
**Problem:** Old schema/data after external update.
**Cause:** Cached metadata.
**Solution:** Reload table: `table = catalog.load_table(("ns", "table"))`
## Performance Optimization
### Performance Tips
**Scans:** Use `row_filter` and `selected_fields` to reduce data scanned.
**Partitions:** 100-1000 optimal. Avoid high cardinality (millions) or low (<10).
**Files:** Keep 100-500MB avg. Compact if <10MB or >10k files.
## Limits
| Resource | Recommended | Impact if Exceeded |
|----------|-------------|-------------------|
| Tables/namespace | <10k | Slow list ops |
| Files/table | <100k | Slow query planning |
| Partitions/table | 100-1k | Metadata overhead |
| Snapshots/table | Expire >7d | Metadata bloat |
## Common Error Messages Reference
| Error Message | Likely Cause | Fix |
|---------------|--------------|-----|
| `401 Unauthorized` | Missing/invalid token | Check token has catalog+storage permissions |
| `403 Forbidden` | Token lacks storage permissions | Add R2 Storage Bucket Item permission |
| `404 Not Found` | Catalog not enabled or wrong URI | Run `wrangler r2 bucket catalog enable` |
| `409 Conflict` | Table/namespace already exists | Use try/except or load existing |
| `422 Unprocessable Entity` | Schema validation failed | Check type compatibility, required fields |
| `CommitFailedException` | Concurrent write conflict | Add retry logic with backoff |
| `NamespaceAlreadyExistsError` | Namespace exists | Use try/except or load existing |
| `NoSuchTableError` | Table doesn't exist | Check namespace+table name, create first |
| `TypeError: Cannot cast` | PyArrow type mismatch | Cast data to match Iceberg schema |
## Debugging Checklist
When things go wrong, check in order:
1.**Catalog enabled:** `npx wrangler r2 bucket catalog status <bucket>`
2.**Token permissions:** Both R2 Data Catalog + R2 Storage in dashboard
3.**Connection test:** `catalog.list_namespaces()` succeeds
4.**URI format:** HTTPS, includes `/iceberg/`, correct bucket name
5.**Warehouse name:** Matches bucket name exactly
6.**Namespace exists:** Create before `create_table()`
7.**Enable debug logging:** `logging.basicConfig(level=logging.DEBUG)`
8.**PyIceberg version:** `pip install --upgrade pyiceberg` (≥0.5.0)
9.**File health:** Compact if >1000 files or avg <10MB
10.**Snapshot count:** Expire if >100 snapshots
## Enable Debug Logging
```python
import logging
logging.basicConfig(level=logging.DEBUG)
# Now operations show HTTP requests/responses
```
## Resources
- [Cloudflare Community](https://community.cloudflare.com/c/developers/workers/40)
- [Cloudflare Discord](https://discord.cloudflare.com) - #r2 channel
- [PyIceberg GitHub](https://github.com/apache/iceberg-python/issues)
- [Apache Iceberg Slack](https://iceberg.apache.org/community/)
## Next Steps
- [patterns.md](patterns.md) - Working examples
- [api.md](api.md) - API reference

View File

@@ -0,0 +1,191 @@
# Common Patterns
Practical patterns for R2 Data Catalog with PyIceberg.
## PyIceberg Connection
```python
import os
from pyiceberg.catalog.rest import RestCatalog
from pyiceberg.exceptions import NamespaceAlreadyExistsError
catalog = RestCatalog(
name="r2_catalog",
warehouse=os.getenv("R2_WAREHOUSE"), # bucket name
uri=os.getenv("R2_CATALOG_URI"), # catalog endpoint
token=os.getenv("R2_TOKEN"), # API token
)
# Create namespace (idempotent)
try:
catalog.create_namespace("default")
except NamespaceAlreadyExistsError:
pass
```
## Pattern 1: Log Analytics Pipeline
Ingest logs incrementally, query by time/level.
```python
import pyarrow as pa
from datetime import datetime
from pyiceberg.schema import Schema
from pyiceberg.types import NestedField, TimestampType, StringType, IntegerType
from pyiceberg.partitioning import PartitionSpec, PartitionField
from pyiceberg.transforms import DayTransform
# Create partitioned table (once)
schema = Schema(
NestedField(1, "timestamp", TimestampType(), required=True),
NestedField(2, "level", StringType(), required=True),
NestedField(3, "service", StringType(), required=True),
NestedField(4, "message", StringType(), required=False),
)
partition_spec = PartitionSpec(
PartitionField(source_id=1, field_id=1000, transform=DayTransform(), name="day")
)
catalog.create_namespace("logs")
table = catalog.create_table(("logs", "app_logs"), schema=schema, partition_spec=partition_spec)
# Append logs (incremental)
data = pa.table({
"timestamp": [datetime(2026, 1, 27, 10, 30, 0)],
"level": ["ERROR"],
"service": ["auth-service"],
"message": ["Failed login"],
})
table.append(data)
# Query by time + level (leverages partitioning)
scan = table.scan(row_filter="level = 'ERROR' AND day = '2026-01-27'")
errors = scan.to_pandas()
```
## Pattern 2: Time-Travel Queries
```python
from datetime import datetime, timedelta
table = catalog.load_table(("logs", "app_logs"))
# Query specific snapshot
snapshot_id = table.current_snapshot().snapshot_id
data = table.scan(snapshot_id=snapshot_id).to_pandas()
# Query as of timestamp (yesterday)
yesterday_ms = int((datetime.now() - timedelta(days=1)).timestamp() * 1000)
data = table.scan(as_of_timestamp=yesterday_ms).to_pandas()
```
## Pattern 3: Schema Evolution
```python
from pyiceberg.types import StringType
table = catalog.load_table(("users", "profiles"))
with table.update_schema() as update:
update.add_column("email", StringType(), required=False)
update.rename_column("name", "full_name")
# Old readers ignore new columns, new readers see nulls for old data
```
## Pattern 4: Partitioned Tables
```python
from pyiceberg.partitioning import PartitionSpec, PartitionField
from pyiceberg.transforms import DayTransform, IdentityTransform
# Partition by day + country
partition_spec = PartitionSpec(
PartitionField(source_id=1, field_id=1000, transform=DayTransform(), name="day"),
PartitionField(source_id=2, field_id=1001, transform=IdentityTransform(), name="country"),
)
table = catalog.create_table(("events", "user_events"), schema=schema, partition_spec=partition_spec)
# Queries prune partitions automatically
scan = table.scan(row_filter="country = 'US' AND day = '2026-01-27'")
```
## Pattern 5: Table Maintenance
```python
from datetime import datetime, timedelta
table = catalog.load_table(("logs", "app_logs"))
# Compact → expire → cleanup (in order)
table.rewrite_data_files(target_file_size_bytes=128 * 1024 * 1024)
seven_days_ms = int((datetime.now() - timedelta(days=7)).timestamp() * 1000)
table.expire_snapshots(older_than=seven_days_ms, retain_last=10)
three_days_ms = int((datetime.now() - timedelta(days=3)).timestamp() * 1000)
table.delete_orphan_files(older_than=three_days_ms)
```
See [api.md](api.md#table-maintenance) for detailed parameters.
## Pattern 6: Concurrent Writes with Retry
```python
from pyiceberg.exceptions import CommitFailedException
import time
def append_with_retry(table, data, max_retries=3):
for attempt in range(max_retries):
try:
table.append(data)
return
except CommitFailedException:
if attempt == max_retries - 1:
raise
time.sleep(2 ** attempt)
```
## Pattern 7: Upsert Simulation
```python
import pandas as pd
import pyarrow as pa
# Read → merge → overwrite (not atomic, use Spark MERGE INTO for production)
existing = table.scan().to_pandas()
new_data = pd.DataFrame({"id": [1, 3], "value": [100, 300]})
merged = pd.concat([existing, new_data]).drop_duplicates(subset=["id"], keep="last")
table.overwrite(pa.Table.from_pandas(merged))
```
## Pattern 8: DuckDB Integration
```python
import duckdb
arrow_table = table.scan().to_arrow()
con = duckdb.connect()
con.register("logs", arrow_table)
result = con.execute("SELECT level, COUNT(*) FROM logs GROUP BY level").fetchdf()
```
## Pattern 9: Monitor Table Health
```python
files = table.scan().plan_files()
avg_mb = sum(f.file_size_in_bytes for f in files) / len(files) / (1024**2)
print(f"Files: {len(files)}, Avg: {avg_mb:.1f}MB, Snapshots: {len(table.snapshots())}")
if avg_mb < 10 or len(files) > 1000:
print("⚠️ Needs compaction")
```
## Best Practices
| Area | Guideline |
|------|-----------|
| **Partitioning** | Use day/hour for time-series; 100-1000 partitions; avoid high cardinality |
| **File sizes** | Target 128-512MB; compact when avg <10MB or >10k files |
| **Schema** | Add columns as nullable (`required=False`); batch changes |
| **Maintenance** | Compact high-write daily/weekly; expire snapshots 7-30d; cleanup orphans after |
| **Concurrency** | Reads automatic; writes to different partitions safe; retry same partition |
| **Performance** | Filter on partitions; select only needed columns; batch appends 100MB+ |