mirror of
https://github.com/ksyasuda/dotfiles.git
synced 2026-03-21 06:11:27 -07:00
update skills
This commit is contained in:
128
.agents/skills/cloudflare-deploy/references/r2-sql/README.md
Normal file
128
.agents/skills/cloudflare-deploy/references/r2-sql/README.md
Normal file
@@ -0,0 +1,128 @@
|
||||
# Cloudflare R2 SQL Skill Reference
|
||||
|
||||
Expert guidance for Cloudflare R2 SQL - serverless distributed query engine for Apache Iceberg tables.
|
||||
|
||||
## Reading Order
|
||||
|
||||
**New to R2 SQL?** Start here:
|
||||
1. Read "What is R2 SQL?" and "When to Use" below
|
||||
2. [configuration.md](configuration.md) - Enable catalog, create tokens
|
||||
3. [patterns.md](patterns.md) - Wrangler CLI and integration examples
|
||||
4. [api.md](api.md) - SQL syntax and query reference
|
||||
5. [gotchas.md](gotchas.md) - Limitations and troubleshooting
|
||||
|
||||
**Quick reference?** Jump to:
|
||||
- [Run a query via Wrangler](patterns.md#wrangler-cli-query)
|
||||
- [SQL syntax reference](api.md#sql-syntax)
|
||||
- [ORDER BY limitations](gotchas.md#order-by-limitations)
|
||||
|
||||
## What is R2 SQL?
|
||||
|
||||
R2 SQL is Cloudflare's **serverless distributed analytics query engine** for querying Apache Iceberg tables in R2 Data Catalog. Features:
|
||||
|
||||
- **Serverless** - No clusters to manage, no infrastructure
|
||||
- **Distributed** - Leverages Cloudflare's global network for parallel execution
|
||||
- **SQL interface** - Familiar SQL syntax for analytics queries
|
||||
- **Zero egress fees** - Query from any cloud/region without data transfer costs
|
||||
- **Open beta** - Free during beta (standard R2 storage costs apply)
|
||||
|
||||
### What is Apache Iceberg?
|
||||
|
||||
Open table format for large-scale analytics datasets in object storage:
|
||||
- **ACID transactions** - Safe concurrent reads/writes
|
||||
- **Metadata optimization** - Fast queries without full table scans
|
||||
- **Schema evolution** - Add/rename/drop columns without rewrites
|
||||
- **Partitioning** - Organize data for efficient pruning
|
||||
|
||||
## When to Use
|
||||
|
||||
**Use R2 SQL for:**
|
||||
- **Log analytics** - Query application/system logs with WHERE filters and aggregations
|
||||
- **BI dashboards** - Generate reports from large analytical datasets
|
||||
- **Fraud detection** - Analyze transaction patterns with GROUP BY/HAVING
|
||||
- **Multi-cloud analytics** - Query data from any cloud without egress fees
|
||||
- **Ad-hoc exploration** - Run SQL queries on Iceberg tables via Wrangler CLI
|
||||
|
||||
**Don't use R2 SQL for:**
|
||||
- **Workers/Pages runtime** - R2 SQL has no Workers binding, use HTTP API from external systems
|
||||
- **Real-time queries (<100ms)** - Optimized for analytical batch queries, not OLTP
|
||||
- **Complex joins/CTEs** - Limited SQL feature set (no JOINs, subqueries, CTEs currently)
|
||||
- **Small datasets (<1GB)** - Setup overhead not justified
|
||||
|
||||
## Decision Tree: Need to Query R2 Data?
|
||||
|
||||
```
|
||||
Do you need to query structured data in R2?
|
||||
├─ YES, data is in Iceberg tables
|
||||
│ ├─ Need SQL interface? → Use R2 SQL (this reference)
|
||||
│ ├─ Need Python API? → See r2-data-catalog reference (PyIceberg)
|
||||
│ └─ Need other engine? → See r2-data-catalog reference (Spark, Trino, etc.)
|
||||
│
|
||||
├─ YES, but not in Iceberg format
|
||||
│ ├─ Streaming data? → Use Pipelines to write to Data Catalog, then R2 SQL
|
||||
│ └─ Static files? → Use PyIceberg to create Iceberg tables, then R2 SQL
|
||||
│
|
||||
└─ NO, just need object storage → Use R2 reference (not R2 SQL)
|
||||
```
|
||||
|
||||
## Architecture Overview
|
||||
|
||||
**Query Planner:**
|
||||
- Top-down metadata investigation with multi-layer pruning
|
||||
- Partition-level, column-level, and row-group pruning
|
||||
- Streaming pipeline - execution starts before planning completes
|
||||
- Early termination with LIMIT - stops when result complete
|
||||
|
||||
**Query Execution:**
|
||||
- Coordinator distributes work to workers across Cloudflare network
|
||||
- Workers run Apache DataFusion for parallel query execution
|
||||
- Parquet column pruning - reads only required columns
|
||||
- Ranged reads from R2 for efficiency
|
||||
|
||||
**Aggregation Strategies:**
|
||||
- Scatter-gather - simple aggregations (SUM, COUNT, AVG)
|
||||
- Shuffling - ORDER BY/HAVING on aggregates via hash partitioning
|
||||
|
||||
## Quick Start
|
||||
|
||||
```bash
|
||||
# 1. Enable R2 Data Catalog on bucket
|
||||
npx wrangler r2 bucket catalog enable my-bucket
|
||||
|
||||
# 2. Create API token (Admin Read & Write)
|
||||
# Dashboard: R2 → Manage API tokens → Create API token
|
||||
|
||||
# 3. Set environment variable
|
||||
export WRANGLER_R2_SQL_AUTH_TOKEN=<your-token>
|
||||
|
||||
# 4. Run query
|
||||
npx wrangler r2 sql query "my-bucket" "SELECT * FROM default.my_table LIMIT 10"
|
||||
```
|
||||
|
||||
## Important Limitations
|
||||
|
||||
**CRITICAL: No Workers Binding**
|
||||
- R2 SQL cannot be called directly from Workers/Pages code
|
||||
- For programmatic access, use HTTP API from external systems
|
||||
- Or query via PyIceberg, Spark, etc. (see r2-data-catalog reference)
|
||||
|
||||
**SQL Feature Set:**
|
||||
- No JOINs, CTEs, subqueries, window functions
|
||||
- ORDER BY supports aggregation columns (not just partition keys)
|
||||
- LIMIT max 10,000 (default 500)
|
||||
- See [gotchas.md](gotchas.md) for complete limitations
|
||||
|
||||
## In This Reference
|
||||
|
||||
- **[configuration.md](configuration.md)** - Enable catalog, create API tokens
|
||||
- **[api.md](api.md)** - SQL syntax, functions, operators, data types
|
||||
- **[patterns.md](patterns.md)** - Wrangler CLI, HTTP API, Pipelines, PyIceberg
|
||||
- **[gotchas.md](gotchas.md)** - Limitations, troubleshooting, performance tips
|
||||
|
||||
## See Also
|
||||
|
||||
- [r2-data-catalog](../r2-data-catalog/) - PyIceberg, REST API, external engines
|
||||
- [pipelines](../pipelines/) - Streaming ingestion to Iceberg tables
|
||||
- [r2](../r2/) - R2 object storage fundamentals
|
||||
- [Cloudflare R2 SQL Docs](https://developers.cloudflare.com/r2-sql/)
|
||||
- [R2 SQL Deep Dive Blog](https://blog.cloudflare.com/r2-sql-deep-dive/)
|
||||
158
.agents/skills/cloudflare-deploy/references/r2-sql/api.md
Normal file
158
.agents/skills/cloudflare-deploy/references/r2-sql/api.md
Normal file
@@ -0,0 +1,158 @@
|
||||
# R2 SQL API Reference
|
||||
|
||||
SQL syntax, functions, operators, and data types for R2 SQL queries.
|
||||
|
||||
## SQL Syntax
|
||||
|
||||
```sql
|
||||
SELECT column_list | aggregation_function
|
||||
FROM [namespace.]table_name
|
||||
WHERE conditions
|
||||
[GROUP BY column_list]
|
||||
[HAVING conditions]
|
||||
[ORDER BY column | aggregation_function [DESC | ASC]]
|
||||
[LIMIT number]
|
||||
```
|
||||
|
||||
## Schema Discovery
|
||||
|
||||
```sql
|
||||
SHOW DATABASES; -- List namespaces
|
||||
SHOW NAMESPACES; -- Alias for SHOW DATABASES
|
||||
SHOW SCHEMAS; -- Alias for SHOW DATABASES
|
||||
SHOW TABLES IN namespace; -- List tables in namespace
|
||||
DESCRIBE namespace.table; -- Show table schema, partition keys
|
||||
```
|
||||
|
||||
## SELECT Clause
|
||||
|
||||
```sql
|
||||
-- All columns
|
||||
SELECT * FROM logs.http_requests;
|
||||
|
||||
-- Specific columns
|
||||
SELECT user_id, timestamp, status FROM logs.http_requests;
|
||||
```
|
||||
|
||||
**Limitations:** No column aliases, expressions, or nested column access
|
||||
|
||||
## WHERE Clause
|
||||
|
||||
### Operators
|
||||
|
||||
| Operator | Example |
|
||||
|----------|---------|
|
||||
| `=`, `!=`, `<`, `<=`, `>`, `>=` | `status = 200` |
|
||||
| `LIKE` | `user_agent LIKE '%Chrome%'` |
|
||||
| `BETWEEN` | `timestamp BETWEEN '2025-01-01T00:00:00Z' AND '2025-01-31T23:59:59Z'` |
|
||||
| `IS NULL`, `IS NOT NULL` | `email IS NOT NULL` |
|
||||
| `AND`, `OR` | `status = 200 AND method = 'GET'` |
|
||||
|
||||
Use parentheses for precedence: `(status = 404 OR status = 500) AND method = 'POST'`
|
||||
|
||||
## Aggregation Functions
|
||||
|
||||
| Function | Description |
|
||||
|----------|-------------|
|
||||
| `COUNT(*)` | Count all rows |
|
||||
| `COUNT(column)` | Count non-null values |
|
||||
| `COUNT(DISTINCT column)` | Count unique values |
|
||||
| `SUM(column)`, `AVG(column)` | Numeric aggregations |
|
||||
| `MIN(column)`, `MAX(column)` | Min/max values |
|
||||
|
||||
```sql
|
||||
-- Multiple aggregations with GROUP BY
|
||||
SELECT region, COUNT(*), SUM(amount), AVG(amount)
|
||||
FROM sales.transactions
|
||||
WHERE sale_date >= '2024-01-01'
|
||||
GROUP BY region;
|
||||
```
|
||||
|
||||
## HAVING Clause
|
||||
|
||||
Filter aggregated results (after GROUP BY):
|
||||
|
||||
```sql
|
||||
SELECT category, SUM(amount)
|
||||
FROM sales.transactions
|
||||
GROUP BY category
|
||||
HAVING SUM(amount) > 10000;
|
||||
```
|
||||
|
||||
## ORDER BY Clause
|
||||
|
||||
Sort results by:
|
||||
- **Partition key columns** - Always supported
|
||||
- **Aggregation functions** - Supported via shuffle strategy
|
||||
|
||||
```sql
|
||||
-- Order by partition key
|
||||
SELECT * FROM logs.requests ORDER BY timestamp DESC LIMIT 100;
|
||||
|
||||
-- Order by aggregation (repeat function, aliases not supported)
|
||||
SELECT region, SUM(amount)
|
||||
FROM sales.transactions
|
||||
GROUP BY region
|
||||
ORDER BY SUM(amount) DESC;
|
||||
```
|
||||
|
||||
**Limitations:** Cannot order by non-partition columns. See [gotchas.md](gotchas.md#order-by-limitations)
|
||||
|
||||
## LIMIT Clause
|
||||
|
||||
```sql
|
||||
SELECT * FROM logs.requests LIMIT 100;
|
||||
```
|
||||
|
||||
| Setting | Value |
|
||||
|---------|-------|
|
||||
| Min | 1 |
|
||||
| Max | 10,000 |
|
||||
| Default | 500 |
|
||||
|
||||
**Always use LIMIT** to enable early termination optimization.
|
||||
|
||||
## Data Types
|
||||
|
||||
| Type | SQL Literal | Example |
|
||||
|------|-------------|---------|
|
||||
| `integer` | Unquoted number | `42`, `-10` |
|
||||
| `float` | Decimal number | `3.14`, `-0.5` |
|
||||
| `string` | Single quotes | `'hello'`, `'GET'` |
|
||||
| `boolean` | Keyword | `true`, `false` |
|
||||
| `timestamp` | RFC3339 string | `'2025-01-01T00:00:00Z'` |
|
||||
| `date` | ISO 8601 date | `'2025-01-01'` |
|
||||
|
||||
### Type Safety
|
||||
|
||||
- Quote strings with single quotes: `'value'`
|
||||
- Timestamps must be RFC3339: `'2025-01-01T00:00:00Z'` (include timezone)
|
||||
- Dates must be ISO 8601: `'2025-01-01'` (YYYY-MM-DD)
|
||||
- No implicit conversions
|
||||
|
||||
```sql
|
||||
-- ✅ Correct
|
||||
WHERE status = 200 AND method = 'GET' AND timestamp > '2025-01-01T00:00:00Z'
|
||||
|
||||
-- ❌ Wrong
|
||||
WHERE status = '200' -- string instead of integer
|
||||
WHERE timestamp > '2025-01-01' -- missing time/timezone
|
||||
WHERE method = GET -- unquoted string
|
||||
```
|
||||
|
||||
## Query Result Format
|
||||
|
||||
JSON array of objects:
|
||||
|
||||
```json
|
||||
[
|
||||
{"user_id": "user_123", "timestamp": "2025-01-15T10:30:00Z", "status": 200},
|
||||
{"user_id": "user_456", "timestamp": "2025-01-15T10:31:00Z", "status": 404}
|
||||
]
|
||||
```
|
||||
|
||||
## See Also
|
||||
|
||||
- [patterns.md](patterns.md) - Query examples and use cases
|
||||
- [gotchas.md](gotchas.md) - SQL limitations and error handling
|
||||
- [configuration.md](configuration.md) - Setup and authentication
|
||||
@@ -0,0 +1,147 @@
|
||||
# R2 SQL Configuration
|
||||
|
||||
Setup and configuration for R2 SQL queries.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- R2 bucket with Data Catalog enabled
|
||||
- API token with R2 permissions
|
||||
- Wrangler CLI installed (for CLI queries)
|
||||
|
||||
## Enable R2 Data Catalog
|
||||
|
||||
R2 SQL queries Apache Iceberg tables in R2 Data Catalog. Must enable catalog on bucket first.
|
||||
|
||||
### Via Wrangler CLI
|
||||
|
||||
```bash
|
||||
npx wrangler r2 bucket catalog enable <bucket-name>
|
||||
```
|
||||
|
||||
Output includes:
|
||||
- **Warehouse name** - Typically same as bucket name
|
||||
- **Catalog URI** - REST endpoint for catalog operations
|
||||
|
||||
Example output:
|
||||
```
|
||||
Catalog enabled successfully
|
||||
Warehouse: my-bucket
|
||||
Catalog URI: https://abc123.r2.cloudflarestorage.com/iceberg/my-bucket
|
||||
```
|
||||
|
||||
### Via Dashboard
|
||||
|
||||
1. Navigate to **R2 Object Storage** → Select your bucket
|
||||
2. Click **Settings** tab
|
||||
3. Scroll to **R2 Data Catalog** section
|
||||
4. Click **Enable**
|
||||
5. Note the **Catalog URI** and **Warehouse** name
|
||||
|
||||
**Important:** Enabling catalog creates metadata directories in bucket but does not modify existing objects.
|
||||
|
||||
## Create API Token
|
||||
|
||||
R2 SQL requires API token with R2 permissions.
|
||||
|
||||
### Required Permission
|
||||
|
||||
**R2 Admin Read & Write** (includes R2 SQL Read permission)
|
||||
|
||||
### Via Dashboard
|
||||
|
||||
1. Navigate to **R2 Object Storage**
|
||||
2. Click **Manage API tokens** (top right)
|
||||
3. Click **Create API token**
|
||||
4. Select **Admin Read & Write** permission
|
||||
5. Click **Create API Token**
|
||||
6. **Copy token value** - shown only once
|
||||
|
||||
### Permission Scope
|
||||
|
||||
| Permission | Grants Access To |
|
||||
|------------|------------------|
|
||||
| R2 Admin Read & Write | R2 storage operations + R2 SQL queries + Data Catalog operations |
|
||||
| R2 SQL Read | SQL queries only (no storage writes) |
|
||||
|
||||
**Note:** R2 SQL Read permission not yet available via Dashboard - use Admin Read & Write.
|
||||
|
||||
## Configure Environment
|
||||
|
||||
### Wrangler CLI
|
||||
|
||||
Set environment variable for Wrangler to use:
|
||||
|
||||
```bash
|
||||
export WRANGLER_R2_SQL_AUTH_TOKEN=<your-token>
|
||||
```
|
||||
|
||||
Or create `.env` file in project directory:
|
||||
|
||||
```
|
||||
WRANGLER_R2_SQL_AUTH_TOKEN=<your-token>
|
||||
```
|
||||
|
||||
Wrangler automatically loads `.env` file when running commands.
|
||||
|
||||
### HTTP API
|
||||
|
||||
For programmatic access (non-Wrangler), pass token in Authorization header:
|
||||
|
||||
```bash
|
||||
curl -X POST https://api.cloudflare.com/client/v4/accounts/{account_id}/r2/sql/query \
|
||||
-H "Authorization: Bearer <your-token>" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"warehouse": "my-bucket",
|
||||
"query": "SELECT * FROM default.my_table LIMIT 10"
|
||||
}'
|
||||
```
|
||||
|
||||
**Note:** HTTP API endpoint URL may vary - see [patterns.md](patterns.md#http-api-query) for current endpoint.
|
||||
|
||||
## Verify Setup
|
||||
|
||||
Test configuration by querying system tables:
|
||||
|
||||
```bash
|
||||
# List namespaces
|
||||
npx wrangler r2 sql query "my-bucket" "SHOW DATABASES"
|
||||
|
||||
# List tables in namespace
|
||||
npx wrangler r2 sql query "my-bucket" "SHOW TABLES IN default"
|
||||
```
|
||||
|
||||
If successful, returns JSON array of results.
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### "Token authentication failed"
|
||||
|
||||
**Cause:** Invalid or missing token
|
||||
|
||||
**Solution:**
|
||||
- Verify `WRANGLER_R2_SQL_AUTH_TOKEN` environment variable set
|
||||
- Check token has Admin Read & Write permission
|
||||
- Create new token if expired
|
||||
|
||||
### "Catalog not enabled on bucket"
|
||||
|
||||
**Cause:** Data Catalog not enabled
|
||||
|
||||
**Solution:**
|
||||
- Run `npx wrangler r2 bucket catalog enable <bucket-name>`
|
||||
- Or enable via Dashboard (R2 → bucket → Settings → R2 Data Catalog)
|
||||
|
||||
### "Permission denied"
|
||||
|
||||
**Cause:** Token lacks required permissions
|
||||
|
||||
**Solution:**
|
||||
- Verify token has **Admin Read & Write** permission
|
||||
- Create new token with correct permissions
|
||||
|
||||
## See Also
|
||||
|
||||
- [r2-data-catalog/configuration.md](../r2-data-catalog/configuration.md) - Detailed token setup and PyIceberg connection
|
||||
- [patterns.md](patterns.md) - Query examples using configuration
|
||||
- [gotchas.md](gotchas.md) - Common configuration errors
|
||||
212
.agents/skills/cloudflare-deploy/references/r2-sql/gotchas.md
Normal file
212
.agents/skills/cloudflare-deploy/references/r2-sql/gotchas.md
Normal file
@@ -0,0 +1,212 @@
|
||||
# R2 SQL Gotchas
|
||||
|
||||
Limitations, troubleshooting, and common pitfalls for R2 SQL.
|
||||
|
||||
## Critical Limitations
|
||||
|
||||
### No Workers Binding
|
||||
|
||||
**Cannot call R2 SQL from Workers/Pages code** - no binding exists.
|
||||
|
||||
```typescript
|
||||
// ❌ This doesn't exist
|
||||
export default {
|
||||
async fetch(request, env) {
|
||||
const result = await env.R2_SQL.query("SELECT * FROM table"); // Not possible
|
||||
return Response.json(result);
|
||||
}
|
||||
};
|
||||
```
|
||||
|
||||
**Solutions:**
|
||||
- HTTP API from external systems (not Workers)
|
||||
- PyIceberg/Spark via r2-data-catalog REST API
|
||||
- For Workers, use D1 or external databases
|
||||
|
||||
### ORDER BY Limitations
|
||||
|
||||
Can only order by:
|
||||
1. **Partition key columns** - Always supported
|
||||
2. **Aggregation functions** - Supported via shuffle strategy
|
||||
|
||||
**Cannot order by** regular non-partition columns.
|
||||
|
||||
```sql
|
||||
-- ✅ Valid: ORDER BY partition key
|
||||
SELECT * FROM logs.requests ORDER BY timestamp DESC LIMIT 100;
|
||||
|
||||
-- ✅ Valid: ORDER BY aggregation
|
||||
SELECT region, SUM(amount) FROM sales.transactions
|
||||
GROUP BY region ORDER BY SUM(amount) DESC;
|
||||
|
||||
-- ❌ Invalid: ORDER BY non-partition column
|
||||
SELECT * FROM logs.requests ORDER BY user_id;
|
||||
|
||||
-- ❌ Invalid: ORDER BY alias (must repeat function)
|
||||
SELECT region, SUM(amount) as total FROM sales.transactions
|
||||
GROUP BY region ORDER BY total; -- Use ORDER BY SUM(amount)
|
||||
```
|
||||
|
||||
Check partition spec: `DESCRIBE namespace.table_name`
|
||||
|
||||
## SQL Feature Limitations
|
||||
|
||||
| Feature | Supported | Notes |
|
||||
|---------|-----------|-------|
|
||||
| SELECT, WHERE, GROUP BY, HAVING | ✅ | Standard support |
|
||||
| COUNT, SUM, AVG, MIN, MAX | ✅ | Standard aggregations |
|
||||
| ORDER BY partition/aggregation | ✅ | See above |
|
||||
| LIMIT | ✅ | Max 10,000 |
|
||||
| Column aliases | ❌ | No AS alias |
|
||||
| Expressions in SELECT | ❌ | No col1 + col2 |
|
||||
| ORDER BY non-partition | ❌ | Fails at runtime |
|
||||
| JOINs, subqueries, CTEs | ❌ | Denormalize at write time |
|
||||
| Window functions, UNION | ❌ | Use external engines |
|
||||
| INSERT/UPDATE/DELETE | ❌ | Use PyIceberg/Pipelines |
|
||||
| Nested columns, arrays, JSON | ❌ | Flatten at write time |
|
||||
|
||||
**Workarounds:**
|
||||
- No JOINs: Denormalize data or use Spark/PyIceberg
|
||||
- No subqueries: Split into multiple queries
|
||||
- No aliases: Accept generated names, transform in app
|
||||
|
||||
## Common Errors
|
||||
|
||||
### "Column not found"
|
||||
**Cause:** Typo, column doesn't exist, or case mismatch
|
||||
**Solution:** `DESCRIBE namespace.table_name` to check schema
|
||||
|
||||
### "Type mismatch"
|
||||
```sql
|
||||
-- ❌ Wrong types
|
||||
WHERE status = '200' -- string instead of integer
|
||||
WHERE timestamp > '2025-01-01' -- missing time/timezone
|
||||
|
||||
-- ✅ Correct types
|
||||
WHERE status = 200
|
||||
WHERE timestamp > '2025-01-01T00:00:00Z'
|
||||
```
|
||||
|
||||
### "ORDER BY column not in partition key"
|
||||
**Cause:** Ordering by non-partition column
|
||||
**Solution:** Use partition key, aggregation, or remove ORDER BY. Check: `DESCRIBE table`
|
||||
|
||||
### "Token authentication failed"
|
||||
```bash
|
||||
# Check/set token
|
||||
echo $WRANGLER_R2_SQL_AUTH_TOKEN
|
||||
export WRANGLER_R2_SQL_AUTH_TOKEN=<your-token>
|
||||
|
||||
# Or .env file
|
||||
echo "WRANGLER_R2_SQL_AUTH_TOKEN=<your-token>" > .env
|
||||
```
|
||||
|
||||
### "Table not found"
|
||||
```sql
|
||||
-- Verify catalog and tables
|
||||
SHOW DATABASES;
|
||||
SHOW TABLES IN namespace_name;
|
||||
```
|
||||
|
||||
Enable catalog: `npx wrangler r2 bucket catalog enable <bucket>`
|
||||
|
||||
### "LIMIT exceeds maximum"
|
||||
Max LIMIT is 10,000. For pagination, use WHERE filters with partition keys.
|
||||
|
||||
### "No data returned" (unexpected)
|
||||
**Debug steps:**
|
||||
1. `SELECT COUNT(*) FROM table` - verify data exists
|
||||
2. Remove WHERE filters incrementally
|
||||
3. `SELECT * FROM table LIMIT 10` - inspect actual data/types
|
||||
|
||||
## Performance Issues
|
||||
|
||||
### Slow Queries
|
||||
|
||||
**Causes:** Too many partitions, large LIMIT, no filters, small files
|
||||
|
||||
```sql
|
||||
-- ❌ Slow: No filters
|
||||
SELECT * FROM logs.requests LIMIT 10000;
|
||||
|
||||
-- ✅ Fast: Filter on partition key
|
||||
SELECT * FROM logs.requests
|
||||
WHERE timestamp >= '2025-01-15T00:00:00Z' AND timestamp < '2025-01-16T00:00:00Z'
|
||||
LIMIT 1000;
|
||||
|
||||
-- ✅ Faster: Multiple filters
|
||||
SELECT * FROM logs.requests
|
||||
WHERE timestamp >= '2025-01-15T00:00:00Z' AND status = 404 AND method = 'GET'
|
||||
LIMIT 1000;
|
||||
```
|
||||
|
||||
**File optimization:**
|
||||
- Target Parquet size: 100-500MB compressed
|
||||
- Pipelines roll interval: 300+ sec (prod), 10 sec (dev)
|
||||
- Run compaction to merge small files
|
||||
|
||||
### Query Timeout
|
||||
|
||||
**Solution:** Add restrictive WHERE filters, reduce time range, query smaller intervals
|
||||
|
||||
```sql
|
||||
-- ❌ Times out: Year-long aggregation
|
||||
SELECT status, COUNT(*) FROM logs.requests
|
||||
WHERE timestamp >= '2024-01-01T00:00:00Z' GROUP BY status;
|
||||
|
||||
-- ✅ Faster: Month-long aggregation
|
||||
SELECT status, COUNT(*) FROM logs.requests
|
||||
WHERE timestamp >= '2025-01-01T00:00:00Z' AND timestamp < '2025-02-01T00:00:00Z'
|
||||
GROUP BY status;
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
### Partitioning
|
||||
- **Time-series:** Partition by day/hour on timestamp
|
||||
- **Avoid:** High-cardinality keys (user_id), >10,000 partitions
|
||||
|
||||
```python
|
||||
from pyiceberg.partitioning import PartitionSpec, PartitionField
|
||||
from pyiceberg.transforms import DayTransform
|
||||
|
||||
PartitionSpec(PartitionField(source_id=1, field_id=1000, transform=DayTransform(), name="day"))
|
||||
```
|
||||
|
||||
### Query Writing
|
||||
- **Always use LIMIT** for early termination
|
||||
- **Filter on partition keys first** for pruning
|
||||
- **Combine filters with AND** for more pruning
|
||||
|
||||
```sql
|
||||
-- Good
|
||||
WHERE timestamp >= '2025-01-15T00:00:00Z' AND status = 404 AND method = 'GET' LIMIT 100
|
||||
```
|
||||
|
||||
### Type Safety
|
||||
- Quote strings: `'GET'` not `GET`
|
||||
- RFC3339 timestamps: `'2025-01-01T00:00:00Z'` not `'2025-01-01'`
|
||||
- ISO dates: `'2025-01-15'` not `'01/15/2025'`
|
||||
|
||||
### Data Organization
|
||||
- **Pipelines:** Dev `roll_file_time: 10`, Prod `roll_file_time: 300+`
|
||||
- **Compression:** Use `zstd`
|
||||
- **Maintenance:** Compaction for small files, expire old snapshots
|
||||
|
||||
## Debugging Checklist
|
||||
|
||||
1. `npx wrangler r2 bucket catalog enable <bucket>` - Verify catalog
|
||||
2. `echo $WRANGLER_R2_SQL_AUTH_TOKEN` - Check token
|
||||
3. `SHOW DATABASES` - List namespaces
|
||||
4. `SHOW TABLES IN namespace` - List tables
|
||||
5. `DESCRIBE namespace.table` - Check schema
|
||||
6. `SELECT COUNT(*) FROM namespace.table` - Verify data
|
||||
7. `SELECT * FROM namespace.table LIMIT 10` - Test simple query
|
||||
8. Add filters incrementally
|
||||
|
||||
## See Also
|
||||
|
||||
- [api.md](api.md) - SQL syntax
|
||||
- [patterns.md](patterns.md) - Query optimization
|
||||
- [configuration.md](configuration.md) - Setup
|
||||
- [Cloudflare R2 SQL Docs](https://developers.cloudflare.com/r2-sql/)
|
||||
222
.agents/skills/cloudflare-deploy/references/r2-sql/patterns.md
Normal file
222
.agents/skills/cloudflare-deploy/references/r2-sql/patterns.md
Normal file
@@ -0,0 +1,222 @@
|
||||
# R2 SQL Patterns
|
||||
|
||||
Common patterns, use cases, and integration examples for R2 SQL.
|
||||
|
||||
## Wrangler CLI Query
|
||||
|
||||
```bash
|
||||
# Basic query
|
||||
npx wrangler r2 sql query "my-bucket" "SELECT * FROM default.logs LIMIT 10"
|
||||
|
||||
# Multi-line query
|
||||
npx wrangler r2 sql query "my-bucket" "
|
||||
SELECT status, COUNT(*), AVG(response_time)
|
||||
FROM logs.http_requests
|
||||
WHERE timestamp >= '2025-01-01T00:00:00Z'
|
||||
GROUP BY status
|
||||
ORDER BY COUNT(*) DESC
|
||||
LIMIT 100
|
||||
"
|
||||
|
||||
# Use environment variable
|
||||
export R2_SQL_WAREHOUSE="my-bucket"
|
||||
npx wrangler r2 sql query "$R2_SQL_WAREHOUSE" "SELECT * FROM default.logs"
|
||||
```
|
||||
|
||||
## HTTP API Query
|
||||
|
||||
For programmatic access from external systems (not Workers - see gotchas.md).
|
||||
|
||||
```bash
|
||||
curl -X POST https://api.cloudflare.com/client/v4/accounts/{account_id}/r2/sql/query \
|
||||
-H "Authorization: Bearer <your-token>" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"warehouse": "my-bucket",
|
||||
"query": "SELECT * FROM default.my_table WHERE status = 200 LIMIT 100"
|
||||
}'
|
||||
```
|
||||
|
||||
Response:
|
||||
```json
|
||||
{
|
||||
"success": true,
|
||||
"result": [{"user_id": "user_123", "timestamp": "2025-01-15T10:30:00Z", "status": 200}],
|
||||
"errors": []
|
||||
}
|
||||
```
|
||||
|
||||
## Pipelines Integration
|
||||
|
||||
Stream data to Iceberg tables via Pipelines, then query with R2 SQL.
|
||||
|
||||
```bash
|
||||
# Setup pipeline (select Data Catalog Table destination)
|
||||
npx wrangler pipelines setup
|
||||
|
||||
# Key settings:
|
||||
# - Destination: Data Catalog Table
|
||||
# - Compression: zstd (recommended)
|
||||
# - Roll file time: 300+ sec (production), 10 sec (dev)
|
||||
|
||||
# Send data to pipeline
|
||||
curl -X POST https://{stream-id}.ingest.cloudflare.com \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '[{"user_id": "user_123", "event_type": "purchase", "timestamp": "2025-01-15T10:30:00Z", "amount": 29.99}]'
|
||||
|
||||
# Query ingested data (wait for roll interval)
|
||||
npx wrangler r2 sql query "my-bucket" "
|
||||
SELECT event_type, COUNT(*), SUM(amount)
|
||||
FROM default.events
|
||||
WHERE timestamp >= '2025-01-15T00:00:00Z'
|
||||
GROUP BY event_type
|
||||
"
|
||||
```
|
||||
|
||||
See [pipelines/patterns.md](../pipelines/patterns.md) for detailed setup.
|
||||
|
||||
## PyIceberg Integration
|
||||
|
||||
Create and populate Iceberg tables with PyIceberg, then query with R2 SQL.
|
||||
|
||||
```python
|
||||
from pyiceberg.catalog.rest import RestCatalog
|
||||
import pyarrow as pa
|
||||
import pandas as pd
|
||||
|
||||
# Setup catalog
|
||||
catalog = RestCatalog(
|
||||
name="my_catalog",
|
||||
warehouse="my-bucket",
|
||||
uri="https://<account-id>.r2.cloudflarestorage.com/iceberg/my-bucket",
|
||||
token="<your-token>",
|
||||
)
|
||||
catalog.create_namespace_if_not_exists("analytics")
|
||||
|
||||
# Create table
|
||||
schema = pa.schema([
|
||||
pa.field("user_id", pa.string(), nullable=False),
|
||||
pa.field("event_time", pa.timestamp("us", tz="UTC"), nullable=False),
|
||||
pa.field("page_views", pa.int64(), nullable=False),
|
||||
])
|
||||
table = catalog.create_table(("analytics", "user_metrics"), schema=schema)
|
||||
|
||||
# Append data
|
||||
df = pd.DataFrame({
|
||||
"user_id": ["user_1", "user_2"],
|
||||
"event_time": pd.to_datetime(["2025-01-15 10:00:00", "2025-01-15 11:00:00"], utc=True),
|
||||
"page_views": [10, 25],
|
||||
})
|
||||
table.append(pa.Table.from_pandas(df, schema=schema))
|
||||
```
|
||||
|
||||
Query with R2 SQL:
|
||||
```bash
|
||||
npx wrangler r2 sql query "my-bucket" "
|
||||
SELECT user_id, SUM(page_views)
|
||||
FROM analytics.user_metrics
|
||||
WHERE event_time >= '2025-01-15T00:00:00Z'
|
||||
GROUP BY user_id
|
||||
"
|
||||
```
|
||||
|
||||
See [r2-data-catalog/patterns.md](../r2-data-catalog/patterns.md) for advanced PyIceberg patterns.
|
||||
|
||||
## Use Cases
|
||||
|
||||
### Log Analytics
|
||||
```sql
|
||||
-- Error rate by endpoint
|
||||
SELECT path, COUNT(*), SUM(CASE WHEN status >= 400 THEN 1 ELSE 0 END) as errors
|
||||
FROM logs.http_requests
|
||||
WHERE timestamp BETWEEN '2025-01-01T00:00:00Z' AND '2025-01-31T23:59:59Z'
|
||||
GROUP BY path ORDER BY errors DESC LIMIT 20;
|
||||
|
||||
-- Response time stats
|
||||
SELECT method, MIN(response_time_ms), AVG(response_time_ms), MAX(response_time_ms)
|
||||
FROM logs.http_requests WHERE timestamp >= '2025-01-15T00:00:00Z' GROUP BY method;
|
||||
|
||||
-- Traffic by status
|
||||
SELECT status, COUNT(*) FROM logs.http_requests
|
||||
WHERE timestamp >= '2025-01-15T00:00:00Z' AND method = 'GET'
|
||||
GROUP BY status ORDER BY COUNT(*) DESC;
|
||||
```
|
||||
|
||||
### Fraud Detection
|
||||
```sql
|
||||
-- High-value transactions
|
||||
SELECT location, COUNT(*), SUM(amount), AVG(amount)
|
||||
FROM fraud.transactions WHERE transaction_timestamp >= '2025-01-01T00:00:00Z' AND amount > 1000.0
|
||||
GROUP BY location ORDER BY SUM(amount) DESC LIMIT 20;
|
||||
|
||||
-- Flagged transactions
|
||||
SELECT merchant_category, COUNT(*), AVG(amount) FROM fraud.transactions
|
||||
WHERE is_fraud_flag = true AND transaction_timestamp >= '2025-01-01T00:00:00Z'
|
||||
GROUP BY merchant_category HAVING COUNT(*) > 10 ORDER BY COUNT(*) DESC;
|
||||
```
|
||||
|
||||
### Business Intelligence
|
||||
```sql
|
||||
-- Sales by department
|
||||
SELECT department, SUM(revenue), AVG(revenue), COUNT(*) FROM sales.transactions
|
||||
WHERE sale_date >= '2024-01-01' GROUP BY department ORDER BY SUM(revenue) DESC LIMIT 10;
|
||||
|
||||
-- Product performance
|
||||
SELECT category, COUNT(DISTINCT product_id), SUM(units_sold), SUM(revenue)
|
||||
FROM sales.product_sales WHERE sale_date BETWEEN '2024-10-01' AND '2024-12-31'
|
||||
GROUP BY category ORDER BY SUM(revenue) DESC;
|
||||
```
|
||||
|
||||
## Connecting External Engines
|
||||
|
||||
R2 Data Catalog exposes Iceberg REST API. Connect Spark, Snowflake, Trino, DuckDB, etc.
|
||||
|
||||
```scala
|
||||
// Apache Spark example
|
||||
val spark = SparkSession.builder()
|
||||
.config("spark.sql.catalog.my_catalog", "org.apache.iceberg.spark.SparkCatalog")
|
||||
.config("spark.sql.catalog.my_catalog.catalog-impl", "org.apache.iceberg.rest.RESTCatalog")
|
||||
.config("spark.sql.catalog.my_catalog.uri", "https://<account-id>.r2.cloudflarestorage.com/iceberg/my-bucket")
|
||||
.config("spark.sql.catalog.my_catalog.token", "<token>")
|
||||
.getOrCreate()
|
||||
|
||||
spark.sql("SELECT * FROM my_catalog.default.my_table LIMIT 10").show()
|
||||
```
|
||||
|
||||
See [r2-data-catalog/patterns.md](../r2-data-catalog/patterns.md) for more engines.
|
||||
|
||||
## Performance Optimization
|
||||
|
||||
### Partitioning
|
||||
- **Time-series:** day/hour on timestamp
|
||||
- **Geographic:** region/country
|
||||
- **Avoid:** High-cardinality keys (user_id)
|
||||
|
||||
```python
|
||||
from pyiceberg.partitioning import PartitionSpec, PartitionField
|
||||
from pyiceberg.transforms import DayTransform
|
||||
|
||||
PartitionSpec(PartitionField(source_id=1, field_id=1000, transform=DayTransform(), name="day"))
|
||||
```
|
||||
|
||||
### Query Optimization
|
||||
- **Always use LIMIT** for early termination
|
||||
- **Filter on partition keys first**
|
||||
- **Multiple filters** for better pruning
|
||||
|
||||
```sql
|
||||
-- Better: Multiple filters on partition key
|
||||
SELECT * FROM logs.requests
|
||||
WHERE timestamp >= '2025-01-15T00:00:00Z' AND status = 404 AND method = 'GET' LIMIT 100;
|
||||
```
|
||||
|
||||
### File Organization
|
||||
- **Pipelines roll:** Dev 10-30s, Prod 300+s
|
||||
- **Target Parquet:** 100-500MB compressed
|
||||
|
||||
## See Also
|
||||
|
||||
- [api.md](api.md) - SQL syntax reference
|
||||
- [gotchas.md](gotchas.md) - Limitations and troubleshooting
|
||||
- [r2-data-catalog/patterns.md](../r2-data-catalog/patterns.md) - PyIceberg advanced patterns
|
||||
- [pipelines/patterns.md](../pipelines/patterns.md) - Streaming ingestion patterns
|
||||
Reference in New Issue
Block a user