mirror of
https://github.com/ksyasuda/dotfiles.git
synced 2026-03-21 18:11:27 -07:00
update skills
This commit is contained in:
@@ -0,0 +1,149 @@
|
||||
# Cloudflare R2 Data Catalog Skill Reference
|
||||
|
||||
Expert guidance for Cloudflare R2 Data Catalog - Apache Iceberg catalog built into R2 buckets.
|
||||
|
||||
## Reading Order
|
||||
|
||||
**New to R2 Data Catalog?** Start here:
|
||||
1. Read "What is R2 Data Catalog?" and "When to Use" below
|
||||
2. [configuration.md](configuration.md) - Enable catalog, create tokens
|
||||
3. [patterns.md](patterns.md) - PyIceberg setup and common patterns
|
||||
4. [api.md](api.md) - REST API reference as needed
|
||||
5. [gotchas.md](gotchas.md) - Troubleshooting when issues arise
|
||||
|
||||
**Quick reference?** Jump to:
|
||||
- [Enable catalog on bucket](configuration.md#enable-catalog-on-bucket)
|
||||
- [PyIceberg connection pattern](patterns.md#pyiceberg-connection-pattern)
|
||||
- [Permission errors](gotchas.md#permission-errors)
|
||||
|
||||
## What is R2 Data Catalog?
|
||||
|
||||
R2 Data Catalog is a **managed Apache Iceberg REST catalog** built directly into R2 buckets. It provides:
|
||||
|
||||
- **Apache Iceberg tables** - ACID transactions, schema evolution, time-travel queries
|
||||
- **Zero-egress costs** - Query from any cloud/region without data transfer fees
|
||||
- **Standard REST API** - Works with Spark, PyIceberg, Snowflake, Trino, DuckDB
|
||||
- **No infrastructure** - Fully managed, no catalog servers to run
|
||||
- **Public beta** - Available to all R2 subscribers, no extra cost beyond R2 storage
|
||||
|
||||
### What is Apache Iceberg?
|
||||
|
||||
Open table format for analytics datasets in object storage. Features:
|
||||
- **ACID transactions** - Safe concurrent reads/writes
|
||||
- **Metadata optimization** - Fast queries without full scans
|
||||
- **Schema evolution** - Add/rename/delete columns without rewrites
|
||||
- **Time-travel** - Query historical snapshots
|
||||
- **Partitioning** - Organize data for efficient queries
|
||||
|
||||
## When to Use
|
||||
|
||||
**Use R2 Data Catalog for:**
|
||||
- **Log analytics** - Store and query application/system logs
|
||||
- **Data lakes/warehouses** - Analytical datasets queried by multiple engines
|
||||
- **BI pipelines** - Aggregate data for dashboards and reports
|
||||
- **Multi-cloud analytics** - Share data across clouds without egress fees
|
||||
- **Time-series data** - Event streams, metrics, sensor data
|
||||
|
||||
**Don't use for:**
|
||||
- **Transactional workloads** - Use D1 or external database instead
|
||||
- **Sub-second latency** - Iceberg optimized for batch/analytical queries
|
||||
- **Small datasets (<1GB)** - Setup overhead not worth it
|
||||
- **Unstructured data** - Store files directly in R2, not as Iceberg tables
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────┐
|
||||
│ Query Engines │
|
||||
│ (PyIceberg, Spark, Trino, Snowflake, DuckDB) │
|
||||
└────────────────┬────────────────────────────────┘
|
||||
│
|
||||
│ REST API (OAuth2 token)
|
||||
▼
|
||||
┌─────────────────────────────────────────────────┐
|
||||
│ R2 Data Catalog (Managed Iceberg REST Catalog)│
|
||||
│ • Namespace/table metadata │
|
||||
│ • Transaction coordination │
|
||||
│ • Snapshot management │
|
||||
└────────────────┬────────────────────────────────┘
|
||||
│
|
||||
│ Vended credentials
|
||||
▼
|
||||
┌─────────────────────────────────────────────────┐
|
||||
│ R2 Bucket Storage │
|
||||
│ • Parquet data files │
|
||||
│ • Metadata files │
|
||||
│ • Manifest files │
|
||||
└─────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
**Key concepts:**
|
||||
- **Catalog URI** - REST endpoint for catalog operations (e.g., `https://<account-id>.r2.cloudflarestorage.com/iceberg/<bucket>`)
|
||||
- **Warehouse** - Logical grouping of tables (typically same as bucket name)
|
||||
- **Namespace** - Schema/database containing tables (e.g., `logs`, `analytics`)
|
||||
- **Table** - Iceberg table with schema, data files, snapshots
|
||||
- **Vended credentials** - Temporary S3 credentials catalog provides for data access
|
||||
|
||||
## Limits
|
||||
|
||||
| Resource | Limit | Notes |
|
||||
|----------|-------|-------|
|
||||
| Namespaces per catalog | No hard limit | Organize tables logically |
|
||||
| Tables per namespace | <10,000 recommended | Performance degrades beyond this |
|
||||
| Files per table | <100,000 recommended | Run compaction regularly |
|
||||
| Snapshots per table | Configurable retention | Expire >7 days old |
|
||||
| Partitions per table | 100-1,000 optimal | Too many = slow metadata ops |
|
||||
| Table size | Same as R2 bucket | 10GB-10TB+ common |
|
||||
| API rate limits | Standard R2 API limits | Shared with R2 storage operations |
|
||||
| Target file size | 128-512 MB | After compaction |
|
||||
|
||||
## Current Status
|
||||
|
||||
**Public Beta** (as of Jan 2026)
|
||||
- Available to all R2 subscribers
|
||||
- No extra cost beyond standard R2 storage/operations
|
||||
- Production-ready, but breaking changes possible
|
||||
- Supports: namespaces, tables, snapshots, compaction, time-travel, table maintenance
|
||||
|
||||
## Decision Tree: Is R2 Data Catalog Right For You?
|
||||
|
||||
```
|
||||
Start → Need analytics on object storage data?
|
||||
│
|
||||
├─ No → Use R2 directly for object storage
|
||||
│
|
||||
└─ Yes → Dataset >1GB with structured schema?
|
||||
│
|
||||
├─ No → Too small, use R2 + ad-hoc queries
|
||||
│
|
||||
└─ Yes → Need ACID transactions or schema evolution?
|
||||
│
|
||||
├─ No → Consider simpler solutions (Parquet on R2)
|
||||
│
|
||||
└─ Yes → Need multi-cloud/multi-tool access?
|
||||
│
|
||||
├─ No → D1 or external DB may be simpler
|
||||
│
|
||||
└─ Yes → ✅ Use R2 Data Catalog
|
||||
```
|
||||
|
||||
**Quick check:** If you answer "yes" to all:
|
||||
- Dataset >1GB and growing
|
||||
- Structured/tabular data (logs, events, metrics)
|
||||
- Multiple query tools or cloud environments
|
||||
- Need versioning, schema changes, or concurrent access
|
||||
|
||||
→ R2 Data Catalog is a good fit.
|
||||
|
||||
## In This Reference
|
||||
|
||||
- **[configuration.md](configuration.md)** - Enable catalog, create API tokens, connect clients
|
||||
- **[api.md](api.md)** - REST endpoints, operations, maintenance
|
||||
- **[patterns.md](patterns.md)** - PyIceberg examples, common use cases
|
||||
- **[gotchas.md](gotchas.md)** - Troubleshooting, best practices, limitations
|
||||
|
||||
## See Also
|
||||
|
||||
- [Cloudflare R2 Data Catalog Docs](https://developers.cloudflare.com/r2/data-catalog/)
|
||||
- [Apache Iceberg Docs](https://iceberg.apache.org/)
|
||||
- [PyIceberg Docs](https://py.iceberg.apache.org/)
|
||||
Reference in New Issue
Block a user