update skills

This commit is contained in:
2026-03-17 16:53:22 -07:00
parent 0b0783ef8e
commit f9a530667e
389 changed files with 54512 additions and 1 deletions

View File

@@ -0,0 +1,149 @@
# Cloudflare R2 Data Catalog Skill Reference
Expert guidance for Cloudflare R2 Data Catalog - Apache Iceberg catalog built into R2 buckets.
## Reading Order
**New to R2 Data Catalog?** Start here:
1. Read "What is R2 Data Catalog?" and "When to Use" below
2. [configuration.md](configuration.md) - Enable catalog, create tokens
3. [patterns.md](patterns.md) - PyIceberg setup and common patterns
4. [api.md](api.md) - REST API reference as needed
5. [gotchas.md](gotchas.md) - Troubleshooting when issues arise
**Quick reference?** Jump to:
- [Enable catalog on bucket](configuration.md#enable-catalog-on-bucket)
- [PyIceberg connection pattern](patterns.md#pyiceberg-connection-pattern)
- [Permission errors](gotchas.md#permission-errors)
## What is R2 Data Catalog?
R2 Data Catalog is a **managed Apache Iceberg REST catalog** built directly into R2 buckets. It provides:
- **Apache Iceberg tables** - ACID transactions, schema evolution, time-travel queries
- **Zero-egress costs** - Query from any cloud/region without data transfer fees
- **Standard REST API** - Works with Spark, PyIceberg, Snowflake, Trino, DuckDB
- **No infrastructure** - Fully managed, no catalog servers to run
- **Public beta** - Available to all R2 subscribers, no extra cost beyond R2 storage
### What is Apache Iceberg?
Open table format for analytics datasets in object storage. Features:
- **ACID transactions** - Safe concurrent reads/writes
- **Metadata optimization** - Fast queries without full scans
- **Schema evolution** - Add/rename/delete columns without rewrites
- **Time-travel** - Query historical snapshots
- **Partitioning** - Organize data for efficient queries
## When to Use
**Use R2 Data Catalog for:**
- **Log analytics** - Store and query application/system logs
- **Data lakes/warehouses** - Analytical datasets queried by multiple engines
- **BI pipelines** - Aggregate data for dashboards and reports
- **Multi-cloud analytics** - Share data across clouds without egress fees
- **Time-series data** - Event streams, metrics, sensor data
**Don't use for:**
- **Transactional workloads** - Use D1 or external database instead
- **Sub-second latency** - Iceberg optimized for batch/analytical queries
- **Small datasets (<1GB)** - Setup overhead not worth it
- **Unstructured data** - Store files directly in R2, not as Iceberg tables
## Architecture
```
┌─────────────────────────────────────────────────┐
│ Query Engines │
│ (PyIceberg, Spark, Trino, Snowflake, DuckDB) │
└────────────────┬────────────────────────────────┘
│ REST API (OAuth2 token)
┌─────────────────────────────────────────────────┐
│ R2 Data Catalog (Managed Iceberg REST Catalog)│
│ • Namespace/table metadata │
│ • Transaction coordination │
│ • Snapshot management │
└────────────────┬────────────────────────────────┘
│ Vended credentials
┌─────────────────────────────────────────────────┐
│ R2 Bucket Storage │
│ • Parquet data files │
│ • Metadata files │
│ • Manifest files │
└─────────────────────────────────────────────────┘
```
**Key concepts:**
- **Catalog URI** - REST endpoint for catalog operations (e.g., `https://<account-id>.r2.cloudflarestorage.com/iceberg/<bucket>`)
- **Warehouse** - Logical grouping of tables (typically same as bucket name)
- **Namespace** - Schema/database containing tables (e.g., `logs`, `analytics`)
- **Table** - Iceberg table with schema, data files, snapshots
- **Vended credentials** - Temporary S3 credentials catalog provides for data access
## Limits
| Resource | Limit | Notes |
|----------|-------|-------|
| Namespaces per catalog | No hard limit | Organize tables logically |
| Tables per namespace | <10,000 recommended | Performance degrades beyond this |
| Files per table | <100,000 recommended | Run compaction regularly |
| Snapshots per table | Configurable retention | Expire >7 days old |
| Partitions per table | 100-1,000 optimal | Too many = slow metadata ops |
| Table size | Same as R2 bucket | 10GB-10TB+ common |
| API rate limits | Standard R2 API limits | Shared with R2 storage operations |
| Target file size | 128-512 MB | After compaction |
## Current Status
**Public Beta** (as of Jan 2026)
- Available to all R2 subscribers
- No extra cost beyond standard R2 storage/operations
- Production-ready, but breaking changes possible
- Supports: namespaces, tables, snapshots, compaction, time-travel, table maintenance
## Decision Tree: Is R2 Data Catalog Right For You?
```
Start → Need analytics on object storage data?
├─ No → Use R2 directly for object storage
└─ Yes → Dataset >1GB with structured schema?
├─ No → Too small, use R2 + ad-hoc queries
└─ Yes → Need ACID transactions or schema evolution?
├─ No → Consider simpler solutions (Parquet on R2)
└─ Yes → Need multi-cloud/multi-tool access?
├─ No → D1 or external DB may be simpler
└─ Yes → ✅ Use R2 Data Catalog
```
**Quick check:** If you answer "yes" to all:
- Dataset >1GB and growing
- Structured/tabular data (logs, events, metrics)
- Multiple query tools or cloud environments
- Need versioning, schema changes, or concurrent access
→ R2 Data Catalog is a good fit.
## In This Reference
- **[configuration.md](configuration.md)** - Enable catalog, create API tokens, connect clients
- **[api.md](api.md)** - REST endpoints, operations, maintenance
- **[patterns.md](patterns.md)** - PyIceberg examples, common use cases
- **[gotchas.md](gotchas.md)** - Troubleshooting, best practices, limitations
## See Also
- [Cloudflare R2 Data Catalog Docs](https://developers.cloudflare.com/r2/data-catalog/)
- [Apache Iceberg Docs](https://iceberg.apache.org/)
- [PyIceberg Docs](https://py.iceberg.apache.org/)