# Cloudflare R2 Data Catalog Skill Reference Expert guidance for Cloudflare R2 Data Catalog - Apache Iceberg catalog built into R2 buckets. ## Reading Order **New to R2 Data Catalog?** Start here: 1. Read "What is R2 Data Catalog?" and "When to Use" below 2. [configuration.md](configuration.md) - Enable catalog, create tokens 3. [patterns.md](patterns.md) - PyIceberg setup and common patterns 4. [api.md](api.md) - REST API reference as needed 5. [gotchas.md](gotchas.md) - Troubleshooting when issues arise **Quick reference?** Jump to: - [Enable catalog on bucket](configuration.md#enable-catalog-on-bucket) - [PyIceberg connection pattern](patterns.md#pyiceberg-connection-pattern) - [Permission errors](gotchas.md#permission-errors) ## What is R2 Data Catalog? R2 Data Catalog is a **managed Apache Iceberg REST catalog** built directly into R2 buckets. It provides: - **Apache Iceberg tables** - ACID transactions, schema evolution, time-travel queries - **Zero-egress costs** - Query from any cloud/region without data transfer fees - **Standard REST API** - Works with Spark, PyIceberg, Snowflake, Trino, DuckDB - **No infrastructure** - Fully managed, no catalog servers to run - **Public beta** - Available to all R2 subscribers, no extra cost beyond R2 storage ### What is Apache Iceberg? Open table format for analytics datasets in object storage. Features: - **ACID transactions** - Safe concurrent reads/writes - **Metadata optimization** - Fast queries without full scans - **Schema evolution** - Add/rename/delete columns without rewrites - **Time-travel** - Query historical snapshots - **Partitioning** - Organize data for efficient queries ## When to Use **Use R2 Data Catalog for:** - **Log analytics** - Store and query application/system logs - **Data lakes/warehouses** - Analytical datasets queried by multiple engines - **BI pipelines** - Aggregate data for dashboards and reports - **Multi-cloud analytics** - Share data across clouds without egress fees - **Time-series data** - Event streams, metrics, sensor data **Don't use for:** - **Transactional workloads** - Use D1 or external database instead - **Sub-second latency** - Iceberg optimized for batch/analytical queries - **Small datasets (<1GB)** - Setup overhead not worth it - **Unstructured data** - Store files directly in R2, not as Iceberg tables ## Architecture ``` ┌─────────────────────────────────────────────────┐ │ Query Engines │ │ (PyIceberg, Spark, Trino, Snowflake, DuckDB) │ └────────────────┬────────────────────────────────┘ │ │ REST API (OAuth2 token) ▼ ┌─────────────────────────────────────────────────┐ │ R2 Data Catalog (Managed Iceberg REST Catalog)│ │ • Namespace/table metadata │ │ • Transaction coordination │ │ • Snapshot management │ └────────────────┬────────────────────────────────┘ │ │ Vended credentials ▼ ┌─────────────────────────────────────────────────┐ │ R2 Bucket Storage │ │ • Parquet data files │ │ • Metadata files │ │ • Manifest files │ └─────────────────────────────────────────────────┘ ``` **Key concepts:** - **Catalog URI** - REST endpoint for catalog operations (e.g., `https://.r2.cloudflarestorage.com/iceberg/`) - **Warehouse** - Logical grouping of tables (typically same as bucket name) - **Namespace** - Schema/database containing tables (e.g., `logs`, `analytics`) - **Table** - Iceberg table with schema, data files, snapshots - **Vended credentials** - Temporary S3 credentials catalog provides for data access ## Limits | Resource | Limit | Notes | |----------|-------|-------| | Namespaces per catalog | No hard limit | Organize tables logically | | Tables per namespace | <10,000 recommended | Performance degrades beyond this | | Files per table | <100,000 recommended | Run compaction regularly | | Snapshots per table | Configurable retention | Expire >7 days old | | Partitions per table | 100-1,000 optimal | Too many = slow metadata ops | | Table size | Same as R2 bucket | 10GB-10TB+ common | | API rate limits | Standard R2 API limits | Shared with R2 storage operations | | Target file size | 128-512 MB | After compaction | ## Current Status **Public Beta** (as of Jan 2026) - Available to all R2 subscribers - No extra cost beyond standard R2 storage/operations - Production-ready, but breaking changes possible - Supports: namespaces, tables, snapshots, compaction, time-travel, table maintenance ## Decision Tree: Is R2 Data Catalog Right For You? ``` Start → Need analytics on object storage data? │ ├─ No → Use R2 directly for object storage │ └─ Yes → Dataset >1GB with structured schema? │ ├─ No → Too small, use R2 + ad-hoc queries │ └─ Yes → Need ACID transactions or schema evolution? │ ├─ No → Consider simpler solutions (Parquet on R2) │ └─ Yes → Need multi-cloud/multi-tool access? │ ├─ No → D1 or external DB may be simpler │ └─ Yes → ✅ Use R2 Data Catalog ``` **Quick check:** If you answer "yes" to all: - Dataset >1GB and growing - Structured/tabular data (logs, events, metrics) - Multiple query tools or cloud environments - Need versioning, schema changes, or concurrent access → R2 Data Catalog is a good fit. ## In This Reference - **[configuration.md](configuration.md)** - Enable catalog, create API tokens, connect clients - **[api.md](api.md)** - REST endpoints, operations, maintenance - **[patterns.md](patterns.md)** - PyIceberg examples, common use cases - **[gotchas.md](gotchas.md)** - Troubleshooting, best practices, limitations ## See Also - [Cloudflare R2 Data Catalog Docs](https://developers.cloudflare.com/r2/data-catalog/) - [Apache Iceberg Docs](https://iceberg.apache.org/) - [PyIceberg Docs](https://py.iceberg.apache.org/)