mirror of
https://github.com/ksyasuda/dotfiles.git
synced 2026-03-20 18:11:27 -07:00
129 lines
5.1 KiB
Markdown
129 lines
5.1 KiB
Markdown
# Cloudflare R2 SQL Skill Reference
|
|
|
|
Expert guidance for Cloudflare R2 SQL - serverless distributed query engine for Apache Iceberg tables.
|
|
|
|
## Reading Order
|
|
|
|
**New to R2 SQL?** Start here:
|
|
1. Read "What is R2 SQL?" and "When to Use" below
|
|
2. [configuration.md](configuration.md) - Enable catalog, create tokens
|
|
3. [patterns.md](patterns.md) - Wrangler CLI and integration examples
|
|
4. [api.md](api.md) - SQL syntax and query reference
|
|
5. [gotchas.md](gotchas.md) - Limitations and troubleshooting
|
|
|
|
**Quick reference?** Jump to:
|
|
- [Run a query via Wrangler](patterns.md#wrangler-cli-query)
|
|
- [SQL syntax reference](api.md#sql-syntax)
|
|
- [ORDER BY limitations](gotchas.md#order-by-limitations)
|
|
|
|
## What is R2 SQL?
|
|
|
|
R2 SQL is Cloudflare's **serverless distributed analytics query engine** for querying Apache Iceberg tables in R2 Data Catalog. Features:
|
|
|
|
- **Serverless** - No clusters to manage, no infrastructure
|
|
- **Distributed** - Leverages Cloudflare's global network for parallel execution
|
|
- **SQL interface** - Familiar SQL syntax for analytics queries
|
|
- **Zero egress fees** - Query from any cloud/region without data transfer costs
|
|
- **Open beta** - Free during beta (standard R2 storage costs apply)
|
|
|
|
### What is Apache Iceberg?
|
|
|
|
Open table format for large-scale analytics datasets in object storage:
|
|
- **ACID transactions** - Safe concurrent reads/writes
|
|
- **Metadata optimization** - Fast queries without full table scans
|
|
- **Schema evolution** - Add/rename/drop columns without rewrites
|
|
- **Partitioning** - Organize data for efficient pruning
|
|
|
|
## When to Use
|
|
|
|
**Use R2 SQL for:**
|
|
- **Log analytics** - Query application/system logs with WHERE filters and aggregations
|
|
- **BI dashboards** - Generate reports from large analytical datasets
|
|
- **Fraud detection** - Analyze transaction patterns with GROUP BY/HAVING
|
|
- **Multi-cloud analytics** - Query data from any cloud without egress fees
|
|
- **Ad-hoc exploration** - Run SQL queries on Iceberg tables via Wrangler CLI
|
|
|
|
**Don't use R2 SQL for:**
|
|
- **Workers/Pages runtime** - R2 SQL has no Workers binding, use HTTP API from external systems
|
|
- **Real-time queries (<100ms)** - Optimized for analytical batch queries, not OLTP
|
|
- **Complex joins/CTEs** - Limited SQL feature set (no JOINs, subqueries, CTEs currently)
|
|
- **Small datasets (<1GB)** - Setup overhead not justified
|
|
|
|
## Decision Tree: Need to Query R2 Data?
|
|
|
|
```
|
|
Do you need to query structured data in R2?
|
|
├─ YES, data is in Iceberg tables
|
|
│ ├─ Need SQL interface? → Use R2 SQL (this reference)
|
|
│ ├─ Need Python API? → See r2-data-catalog reference (PyIceberg)
|
|
│ └─ Need other engine? → See r2-data-catalog reference (Spark, Trino, etc.)
|
|
│
|
|
├─ YES, but not in Iceberg format
|
|
│ ├─ Streaming data? → Use Pipelines to write to Data Catalog, then R2 SQL
|
|
│ └─ Static files? → Use PyIceberg to create Iceberg tables, then R2 SQL
|
|
│
|
|
└─ NO, just need object storage → Use R2 reference (not R2 SQL)
|
|
```
|
|
|
|
## Architecture Overview
|
|
|
|
**Query Planner:**
|
|
- Top-down metadata investigation with multi-layer pruning
|
|
- Partition-level, column-level, and row-group pruning
|
|
- Streaming pipeline - execution starts before planning completes
|
|
- Early termination with LIMIT - stops when result complete
|
|
|
|
**Query Execution:**
|
|
- Coordinator distributes work to workers across Cloudflare network
|
|
- Workers run Apache DataFusion for parallel query execution
|
|
- Parquet column pruning - reads only required columns
|
|
- Ranged reads from R2 for efficiency
|
|
|
|
**Aggregation Strategies:**
|
|
- Scatter-gather - simple aggregations (SUM, COUNT, AVG)
|
|
- Shuffling - ORDER BY/HAVING on aggregates via hash partitioning
|
|
|
|
## Quick Start
|
|
|
|
```bash
|
|
# 1. Enable R2 Data Catalog on bucket
|
|
npx wrangler r2 bucket catalog enable my-bucket
|
|
|
|
# 2. Create API token (Admin Read & Write)
|
|
# Dashboard: R2 → Manage API tokens → Create API token
|
|
|
|
# 3. Set environment variable
|
|
export WRANGLER_R2_SQL_AUTH_TOKEN=<your-token>
|
|
|
|
# 4. Run query
|
|
npx wrangler r2 sql query "my-bucket" "SELECT * FROM default.my_table LIMIT 10"
|
|
```
|
|
|
|
## Important Limitations
|
|
|
|
**CRITICAL: No Workers Binding**
|
|
- R2 SQL cannot be called directly from Workers/Pages code
|
|
- For programmatic access, use HTTP API from external systems
|
|
- Or query via PyIceberg, Spark, etc. (see r2-data-catalog reference)
|
|
|
|
**SQL Feature Set:**
|
|
- No JOINs, CTEs, subqueries, window functions
|
|
- ORDER BY supports aggregation columns (not just partition keys)
|
|
- LIMIT max 10,000 (default 500)
|
|
- See [gotchas.md](gotchas.md) for complete limitations
|
|
|
|
## In This Reference
|
|
|
|
- **[configuration.md](configuration.md)** - Enable catalog, create API tokens
|
|
- **[api.md](api.md)** - SQL syntax, functions, operators, data types
|
|
- **[patterns.md](patterns.md)** - Wrangler CLI, HTTP API, Pipelines, PyIceberg
|
|
- **[gotchas.md](gotchas.md)** - Limitations, troubleshooting, performance tips
|
|
|
|
## See Also
|
|
|
|
- [r2-data-catalog](../r2-data-catalog/) - PyIceberg, REST API, external engines
|
|
- [pipelines](../pipelines/) - Streaming ingestion to Iceberg tables
|
|
- [r2](../r2/) - R2 object storage fundamentals
|
|
- [Cloudflare R2 SQL Docs](https://developers.cloudflare.com/r2-sql/)
|
|
- [R2 SQL Deep Dive Blog](https://blog.cloudflare.com/r2-sql-deep-dive/)
|