update skills

2026-06-12 21:13:31 -07:00 · 2026-03-17 16:53:22 -07:00
parent 0b0783ef8e
commit f9a530667e
389 changed files with 54512 additions and 1 deletions
@@ -0,0 +1,149 @@
+# Cloudflare R2 Data Catalog Skill Reference
+
+Expert guidance for Cloudflare R2 Data Catalog - Apache Iceberg catalog built into R2 buckets.
+
+## Reading Order
+
+**New to R2 Data Catalog?** Start here:
+1. Read "What is R2 Data Catalog?" and "When to Use" below
+2. [configuration.md](configuration.md) - Enable catalog, create tokens
+3. [patterns.md](patterns.md) - PyIceberg setup and common patterns
+4. [api.md](api.md) - REST API reference as needed
+5. [gotchas.md](gotchas.md) - Troubleshooting when issues arise
+
+**Quick reference?** Jump to:
+- [Enable catalog on bucket](configuration.md#enable-catalog-on-bucket)
+- [PyIceberg connection pattern](patterns.md#pyiceberg-connection-pattern)
+- [Permission errors](gotchas.md#permission-errors)
+
+## What is R2 Data Catalog?
+
+R2 Data Catalog is a **managed Apache Iceberg REST catalog** built directly into R2 buckets. It provides:
+
+- **Apache Iceberg tables** - ACID transactions, schema evolution, time-travel queries
+- **Zero-egress costs** - Query from any cloud/region without data transfer fees
+- **Standard REST API** - Works with Spark, PyIceberg, Snowflake, Trino, DuckDB
+- **No infrastructure** - Fully managed, no catalog servers to run
+- **Public beta** - Available to all R2 subscribers, no extra cost beyond R2 storage
+
+### What is Apache Iceberg?
+
+Open table format for analytics datasets in object storage. Features:
+- **ACID transactions** - Safe concurrent reads/writes
+- **Metadata optimization** - Fast queries without full scans
+- **Schema evolution** - Add/rename/delete columns without rewrites
+- **Time-travel** - Query historical snapshots
+- **Partitioning** - Organize data for efficient queries
+
+## When to Use
+
+**Use R2 Data Catalog for:**
+- **Log analytics** - Store and query application/system logs
+- **Data lakes/warehouses** - Analytical datasets queried by multiple engines
+- **BI pipelines** - Aggregate data for dashboards and reports
+- **Multi-cloud analytics** - Share data across clouds without egress fees
+- **Time-series data** - Event streams, metrics, sensor data
+
+**Don't use for:**
+- **Transactional workloads** - Use D1 or external database instead
+- **Sub-second latency** - Iceberg optimized for batch/analytical queries
+- **Small datasets (<1GB)** - Setup overhead not worth it
+- **Unstructured data** - Store files directly in R2, not as Iceberg tables
+
+## Architecture
+
+```
+┌─────────────────────────────────────────────────┐
+│  Query Engines                                  │
+│  (PyIceberg, Spark, Trino, Snowflake, DuckDB)  │
+└────────────────┬────────────────────────────────┘
+                 │
+                 │ REST API (OAuth2 token)
+                 ▼
+┌─────────────────────────────────────────────────┐
+│  R2 Data Catalog (Managed Iceberg REST Catalog)│
+│  • Namespace/table metadata                     │
+│  • Transaction coordination                     │
+│  • Snapshot management                          │
+└────────────────┬────────────────────────────────┘
+                 │
+                 │ Vended credentials
+                 ▼
+┌─────────────────────────────────────────────────┐
+│  R2 Bucket Storage                              │
+│  • Parquet data files                           │
+│  • Metadata files                               │
+│  • Manifest files                               │
+└─────────────────────────────────────────────────┘
+```
+
+**Key concepts:**
+- **Catalog URI** - REST endpoint for catalog operations (e.g., `https://<account-id>.r2.cloudflarestorage.com/iceberg/<bucket>`)
+- **Warehouse** - Logical grouping of tables (typically same as bucket name)
+- **Namespace** - Schema/database containing tables (e.g., `logs`, `analytics`)
+- **Table** - Iceberg table with schema, data files, snapshots
+- **Vended credentials** - Temporary S3 credentials catalog provides for data access
+
+## Limits
+
+| Resource | Limit | Notes |
+|----------|-------|-------|
+| Namespaces per catalog | No hard limit | Organize tables logically |
+| Tables per namespace | <10,000 recommended | Performance degrades beyond this |
+| Files per table | <100,000 recommended | Run compaction regularly |
+| Snapshots per table | Configurable retention | Expire >7 days old |
+| Partitions per table | 100-1,000 optimal | Too many = slow metadata ops |
+| Table size | Same as R2 bucket | 10GB-10TB+ common |
+| API rate limits | Standard R2 API limits | Shared with R2 storage operations |
+| Target file size | 128-512 MB | After compaction |
+
+## Current Status
+
+**Public Beta** (as of Jan 2026)
+- Available to all R2 subscribers
+- No extra cost beyond standard R2 storage/operations
+- Production-ready, but breaking changes possible
+- Supports: namespaces, tables, snapshots, compaction, time-travel, table maintenance
+
+## Decision Tree: Is R2 Data Catalog Right For You?
+
+```
+Start → Need analytics on object storage data?
+         │
+         ├─ No → Use R2 directly for object storage
+         │
+         └─ Yes → Dataset >1GB with structured schema?
+                  │
+                  ├─ No → Too small, use R2 + ad-hoc queries
+                  │
+                  └─ Yes → Need ACID transactions or schema evolution?
+                           │
+                           ├─ No → Consider simpler solutions (Parquet on R2)
+                           │
+                           └─ Yes → Need multi-cloud/multi-tool access?
+                                    │
+                                    ├─ No → D1 or external DB may be simpler
+                                    │
+                                    └─ Yes → ✅ Use R2 Data Catalog
+```
+
+**Quick check:** If you answer "yes" to all:
+- Dataset >1GB and growing
+- Structured/tabular data (logs, events, metrics)
+- Multiple query tools or cloud environments
+- Need versioning, schema changes, or concurrent access
+
+→ R2 Data Catalog is a good fit.
+
+## In This Reference
+
+- **[configuration.md](configuration.md)** - Enable catalog, create API tokens, connect clients
+- **[api.md](api.md)** - REST endpoints, operations, maintenance
+- **[patterns.md](patterns.md)** - PyIceberg examples, common use cases
+- **[gotchas.md](gotchas.md)** - Troubleshooting, best practices, limitations
+
+## See Also
+
+- [Cloudflare R2 Data Catalog Docs](https://developers.cloudflare.com/r2/data-catalog/)
+- [Apache Iceberg Docs](https://iceberg.apache.org/)
+- [PyIceberg Docs](https://py.iceberg.apache.org/)