update skills

2026-07-03 09:13:32 -07:00 · 2026-03-17 16:53:22 -07:00
parent 0b0783ef8e
commit f9a530667e
389 changed files with 54512 additions and 1 deletions
@@ -0,0 +1,128 @@
+# Cloudflare R2 SQL Skill Reference
+
+Expert guidance for Cloudflare R2 SQL - serverless distributed query engine for Apache Iceberg tables.
+
+## Reading Order
+
+**New to R2 SQL?** Start here:
+1. Read "What is R2 SQL?" and "When to Use" below
+2. [configuration.md](configuration.md) - Enable catalog, create tokens
+3. [patterns.md](patterns.md) - Wrangler CLI and integration examples
+4. [api.md](api.md) - SQL syntax and query reference
+5. [gotchas.md](gotchas.md) - Limitations and troubleshooting
+
+**Quick reference?** Jump to:
+- [Run a query via Wrangler](patterns.md#wrangler-cli-query)
+- [SQL syntax reference](api.md#sql-syntax)
+- [ORDER BY limitations](gotchas.md#order-by-limitations)
+
+## What is R2 SQL?
+
+R2 SQL is Cloudflare's **serverless distributed analytics query engine** for querying Apache Iceberg tables in R2 Data Catalog. Features:
+
+- **Serverless** - No clusters to manage, no infrastructure
+- **Distributed** - Leverages Cloudflare's global network for parallel execution
+- **SQL interface** - Familiar SQL syntax for analytics queries
+- **Zero egress fees** - Query from any cloud/region without data transfer costs
+- **Open beta** - Free during beta (standard R2 storage costs apply)
+
+### What is Apache Iceberg?
+
+Open table format for large-scale analytics datasets in object storage:
+- **ACID transactions** - Safe concurrent reads/writes
+- **Metadata optimization** - Fast queries without full table scans
+- **Schema evolution** - Add/rename/drop columns without rewrites
+- **Partitioning** - Organize data for efficient pruning
+
+## When to Use
+
+**Use R2 SQL for:**
+- **Log analytics** - Query application/system logs with WHERE filters and aggregations
+- **BI dashboards** - Generate reports from large analytical datasets
+- **Fraud detection** - Analyze transaction patterns with GROUP BY/HAVING
+- **Multi-cloud analytics** - Query data from any cloud without egress fees
+- **Ad-hoc exploration** - Run SQL queries on Iceberg tables via Wrangler CLI
+
+**Don't use R2 SQL for:**
+- **Workers/Pages runtime** - R2 SQL has no Workers binding, use HTTP API from external systems
+- **Real-time queries (<100ms)** - Optimized for analytical batch queries, not OLTP
+- **Complex joins/CTEs** - Limited SQL feature set (no JOINs, subqueries, CTEs currently)
+- **Small datasets (<1GB)** - Setup overhead not justified
+
+## Decision Tree: Need to Query R2 Data?
+
+```
+Do you need to query structured data in R2?
+├─ YES, data is in Iceberg tables
+│  ├─ Need SQL interface? → Use R2 SQL (this reference)
+│  ├─ Need Python API? → See r2-data-catalog reference (PyIceberg)
+│  └─ Need other engine? → See r2-data-catalog reference (Spark, Trino, etc.)
+│
+├─ YES, but not in Iceberg format
+│  ├─ Streaming data? → Use Pipelines to write to Data Catalog, then R2 SQL
+│  └─ Static files? → Use PyIceberg to create Iceberg tables, then R2 SQL
+│
+└─ NO, just need object storage → Use R2 reference (not R2 SQL)
+```
+
+## Architecture Overview
+
+**Query Planner:**
+- Top-down metadata investigation with multi-layer pruning
+- Partition-level, column-level, and row-group pruning
+- Streaming pipeline - execution starts before planning completes
+- Early termination with LIMIT - stops when result complete
+
+**Query Execution:**
+- Coordinator distributes work to workers across Cloudflare network
+- Workers run Apache DataFusion for parallel query execution
+- Parquet column pruning - reads only required columns
+- Ranged reads from R2 for efficiency
+
+**Aggregation Strategies:**
+- Scatter-gather - simple aggregations (SUM, COUNT, AVG)
+- Shuffling - ORDER BY/HAVING on aggregates via hash partitioning
+
+## Quick Start
+
+```bash
+# 1. Enable R2 Data Catalog on bucket
+npx wrangler r2 bucket catalog enable my-bucket
+
+# 2. Create API token (Admin Read & Write)
+# Dashboard: R2 → Manage API tokens → Create API token
+
+# 3. Set environment variable
+export WRANGLER_R2_SQL_AUTH_TOKEN=<your-token>
+
+# 4. Run query
+npx wrangler r2 sql query "my-bucket" "SELECT * FROM default.my_table LIMIT 10"
+```
+
+## Important Limitations
+
+**CRITICAL: No Workers Binding**
+- R2 SQL cannot be called directly from Workers/Pages code
+- For programmatic access, use HTTP API from external systems
+- Or query via PyIceberg, Spark, etc. (see r2-data-catalog reference)
+
+**SQL Feature Set:**
+- No JOINs, CTEs, subqueries, window functions
+- ORDER BY supports aggregation columns (not just partition keys)
+- LIMIT max 10,000 (default 500)
+- See [gotchas.md](gotchas.md) for complete limitations
+
+## In This Reference
+
+- **[configuration.md](configuration.md)** - Enable catalog, create API tokens
+- **[api.md](api.md)** - SQL syntax, functions, operators, data types
+- **[patterns.md](patterns.md)** - Wrangler CLI, HTTP API, Pipelines, PyIceberg
+- **[gotchas.md](gotchas.md)** - Limitations, troubleshooting, performance tips
+
+## See Also
+
+- [r2-data-catalog](../r2-data-catalog/) - PyIceberg, REST API, external engines
+- [pipelines](../pipelines/) - Streaming ingestion to Iceberg tables
+- [r2](../r2/) - R2 object storage fundamentals
+- [Cloudflare R2 SQL Docs](https://developers.cloudflare.com/r2-sql/)
+- [R2 SQL Deep Dive Blog](https://blog.cloudflare.com/r2-sql-deep-dive/)
@@ -0,0 +1,158 @@
+# R2 SQL API Reference
+
+SQL syntax, functions, operators, and data types for R2 SQL queries.
+
+## SQL Syntax
+
+```sql
+SELECT column_list | aggregation_function
+FROM [namespace.]table_name
+WHERE conditions
+[GROUP BY column_list]
+[HAVING conditions]
+[ORDER BY column | aggregation_function [DESC | ASC]]
+[LIMIT number]
+```
+
+## Schema Discovery
+
+```sql
+SHOW DATABASES;           -- List namespaces
+SHOW NAMESPACES;          -- Alias for SHOW DATABASES
+SHOW SCHEMAS;             -- Alias for SHOW DATABASES
+SHOW TABLES IN namespace; -- List tables in namespace
+DESCRIBE namespace.table; -- Show table schema, partition keys
+```
+
+## SELECT Clause
+
+```sql
+-- All columns
+SELECT * FROM logs.http_requests;
+
+-- Specific columns
+SELECT user_id, timestamp, status FROM logs.http_requests;
+```
+
+**Limitations:** No column aliases, expressions, or nested column access
+
+## WHERE Clause
+
+### Operators
+
+| Operator | Example |
+|----------|---------|
+| `=`, `!=`, `<`, `<=`, `>`, `>=` | `status = 200` |
+| `LIKE` | `user_agent LIKE '%Chrome%'` |
+| `BETWEEN` | `timestamp BETWEEN '2025-01-01T00:00:00Z' AND '2025-01-31T23:59:59Z'` |
+| `IS NULL`, `IS NOT NULL` | `email IS NOT NULL` |
+| `AND`, `OR` | `status = 200 AND method = 'GET'` |
+
+Use parentheses for precedence: `(status = 404 OR status = 500) AND method = 'POST'`
+
+## Aggregation Functions
+
+| Function | Description |
+|----------|-------------|
+| `COUNT(*)` | Count all rows |
+| `COUNT(column)` | Count non-null values |
+| `COUNT(DISTINCT column)` | Count unique values |
+| `SUM(column)`, `AVG(column)` | Numeric aggregations |
+| `MIN(column)`, `MAX(column)` | Min/max values |
+
+```sql
+-- Multiple aggregations with GROUP BY
+SELECT region, COUNT(*), SUM(amount), AVG(amount)
+FROM sales.transactions
+WHERE sale_date >= '2024-01-01'
+GROUP BY region;
+```
+
+## HAVING Clause
+
+Filter aggregated results (after GROUP BY):
+
+```sql
+SELECT category, SUM(amount)
+FROM sales.transactions
+GROUP BY category
+HAVING SUM(amount) > 10000;
+```
+
+## ORDER BY Clause
+
+Sort results by:
+- **Partition key columns** - Always supported
+- **Aggregation functions** - Supported via shuffle strategy
+
+```sql
+-- Order by partition key
+SELECT * FROM logs.requests ORDER BY timestamp DESC LIMIT 100;
+
+-- Order by aggregation (repeat function, aliases not supported)
+SELECT region, SUM(amount)
+FROM sales.transactions
+GROUP BY region
+ORDER BY SUM(amount) DESC;
+```
+
+**Limitations:** Cannot order by non-partition columns. See [gotchas.md](gotchas.md#order-by-limitations)
+
+## LIMIT Clause
+
+```sql
+SELECT * FROM logs.requests LIMIT 100;
+```
+
+| Setting | Value |
+|---------|-------|
+| Min | 1 |
+| Max | 10,000 |
+| Default | 500 |
+
+**Always use LIMIT** to enable early termination optimization.
+
+## Data Types
+
+| Type | SQL Literal | Example |
+|------|-------------|---------|
+| `integer` | Unquoted number | `42`, `-10` |
+| `float` | Decimal number | `3.14`, `-0.5` |
+| `string` | Single quotes | `'hello'`, `'GET'` |
+| `boolean` | Keyword | `true`, `false` |
+| `timestamp` | RFC3339 string | `'2025-01-01T00:00:00Z'` |
+| `date` | ISO 8601 date | `'2025-01-01'` |
+
+### Type Safety
+
+- Quote strings with single quotes: `'value'`
+- Timestamps must be RFC3339: `'2025-01-01T00:00:00Z'` (include timezone)
+- Dates must be ISO 8601: `'2025-01-01'` (YYYY-MM-DD)
+- No implicit conversions
+
+```sql
+-- ✅ Correct
+WHERE status = 200 AND method = 'GET' AND timestamp > '2025-01-01T00:00:00Z'
+
+-- ❌ Wrong
+WHERE status = '200'              -- string instead of integer
+WHERE timestamp > '2025-01-01'    -- missing time/timezone
+WHERE method = GET                -- unquoted string
+```
+
+## Query Result Format
+
+JSON array of objects:
+
+```json
+[
+  {"user_id": "user_123", "timestamp": "2025-01-15T10:30:00Z", "status": 200},
+  {"user_id": "user_456", "timestamp": "2025-01-15T10:31:00Z", "status": 404}
+]
+```
+
+## See Also
+
+- [patterns.md](patterns.md) - Query examples and use cases
+- [gotchas.md](gotchas.md) - SQL limitations and error handling
+- [configuration.md](configuration.md) - Setup and authentication
@@ -0,0 +1,147 @@
+# R2 SQL Configuration
+
+Setup and configuration for R2 SQL queries.
+
+## Prerequisites
+
+- R2 bucket with Data Catalog enabled
+- API token with R2 permissions
+- Wrangler CLI installed (for CLI queries)
+
+## Enable R2 Data Catalog
+
+R2 SQL queries Apache Iceberg tables in R2 Data Catalog. Must enable catalog on bucket first.
+
+### Via Wrangler CLI
+
+```bash
+npx wrangler r2 bucket catalog enable <bucket-name>
+```
+
+Output includes:
+- **Warehouse name** - Typically same as bucket name
+- **Catalog URI** - REST endpoint for catalog operations
+
+Example output:
+```
+Catalog enabled successfully
+Warehouse: my-bucket
+Catalog URI: https://abc123.r2.cloudflarestorage.com/iceberg/my-bucket
+```
+
+### Via Dashboard
+
+1. Navigate to **R2 Object Storage** → Select your bucket
+2. Click **Settings** tab
+3. Scroll to **R2 Data Catalog** section
+4. Click **Enable**
+5. Note the **Catalog URI** and **Warehouse** name
+
+**Important:** Enabling catalog creates metadata directories in bucket but does not modify existing objects.
+
+## Create API Token
+
+R2 SQL requires API token with R2 permissions.
+
+### Required Permission
+
+**R2 Admin Read & Write** (includes R2 SQL Read permission)
+
+### Via Dashboard
+
+1. Navigate to **R2 Object Storage**
+2. Click **Manage API tokens** (top right)
+3. Click **Create API token**
+4. Select **Admin Read & Write** permission
+5. Click **Create API Token**
+6. **Copy token value** - shown only once
+
+### Permission Scope
+
+| Permission | Grants Access To |
+|------------|------------------|
+| R2 Admin Read & Write | R2 storage operations + R2 SQL queries + Data Catalog operations |
+| R2 SQL Read | SQL queries only (no storage writes) |
+
+**Note:** R2 SQL Read permission not yet available via Dashboard - use Admin Read & Write.
+
+## Configure Environment
+
+### Wrangler CLI
+
+Set environment variable for Wrangler to use:
+
+```bash
+export WRANGLER_R2_SQL_AUTH_TOKEN=<your-token>
+```
+
+Or create `.env` file in project directory:
+
+```
+WRANGLER_R2_SQL_AUTH_TOKEN=<your-token>
+```
+
+Wrangler automatically loads `.env` file when running commands.
+
+### HTTP API
+
+For programmatic access (non-Wrangler), pass token in Authorization header:
+
+```bash
+curl -X POST https://api.cloudflare.com/client/v4/accounts/{account_id}/r2/sql/query \
+  -H "Authorization: Bearer <your-token>" \
+  -H "Content-Type: application/json" \
+  -d '{
+    "warehouse": "my-bucket",
+    "query": "SELECT * FROM default.my_table LIMIT 10"
+  }'
+```
+
+**Note:** HTTP API endpoint URL may vary - see [patterns.md](patterns.md#http-api-query) for current endpoint.
+
+## Verify Setup
+
+Test configuration by querying system tables:
+
+```bash
+# List namespaces
+npx wrangler r2 sql query "my-bucket" "SHOW DATABASES"
+
+# List tables in namespace
+npx wrangler r2 sql query "my-bucket" "SHOW TABLES IN default"
+```
+
+If successful, returns JSON array of results.
+
+## Troubleshooting
+
+### "Token authentication failed"
+
+**Cause:** Invalid or missing token
+
+**Solution:**
+- Verify `WRANGLER_R2_SQL_AUTH_TOKEN` environment variable set
+- Check token has Admin Read & Write permission
+- Create new token if expired
+
+### "Catalog not enabled on bucket"
+
+**Cause:** Data Catalog not enabled
+
+**Solution:**
+- Run `npx wrangler r2 bucket catalog enable <bucket-name>`
+- Or enable via Dashboard (R2 → bucket → Settings → R2 Data Catalog)
+
+### "Permission denied"
+
+**Cause:** Token lacks required permissions
+
+**Solution:**
+- Verify token has **Admin Read & Write** permission
+- Create new token with correct permissions
+
+## See Also
+
+- [r2-data-catalog/configuration.md](../r2-data-catalog/configuration.md) - Detailed token setup and PyIceberg connection
+- [patterns.md](patterns.md) - Query examples using configuration
+- [gotchas.md](gotchas.md) - Common configuration errors
@@ -0,0 +1,212 @@
+# R2 SQL Gotchas
+
+Limitations, troubleshooting, and common pitfalls for R2 SQL.
+
+## Critical Limitations
+
+### No Workers Binding
+
+**Cannot call R2 SQL from Workers/Pages code** - no binding exists.
+
+```typescript
+// ❌ This doesn't exist
+export default {
+  async fetch(request, env) {
+    const result = await env.R2_SQL.query("SELECT * FROM table");  // Not possible
+    return Response.json(result);
+  }
+};
+```
+
+**Solutions:**
+- HTTP API from external systems (not Workers)
+- PyIceberg/Spark via r2-data-catalog REST API
+- For Workers, use D1 or external databases
+
+### ORDER BY Limitations
+
+Can only order by:
+1. **Partition key columns** - Always supported
+2. **Aggregation functions** - Supported via shuffle strategy
+
+**Cannot order by** regular non-partition columns.
+
+```sql
+-- ✅ Valid: ORDER BY partition key
+SELECT * FROM logs.requests ORDER BY timestamp DESC LIMIT 100;
+
+-- ✅ Valid: ORDER BY aggregation
+SELECT region, SUM(amount) FROM sales.transactions
+GROUP BY region ORDER BY SUM(amount) DESC;
+
+-- ❌ Invalid: ORDER BY non-partition column
+SELECT * FROM logs.requests ORDER BY user_id;
+
+-- ❌ Invalid: ORDER BY alias (must repeat function)
+SELECT region, SUM(amount) as total FROM sales.transactions
+GROUP BY region ORDER BY total;  -- Use ORDER BY SUM(amount)
+```
+
+Check partition spec: `DESCRIBE namespace.table_name`
+
+## SQL Feature Limitations
+
+| Feature | Supported | Notes |
+|---------|-----------|-------|
+| SELECT, WHERE, GROUP BY, HAVING | ✅ | Standard support |
+| COUNT, SUM, AVG, MIN, MAX | ✅ | Standard aggregations |
+| ORDER BY partition/aggregation | ✅ | See above |
+| LIMIT | ✅ | Max 10,000 |
+| Column aliases | ❌ | No AS alias |
+| Expressions in SELECT | ❌ | No col1 + col2 |
+| ORDER BY non-partition | ❌ | Fails at runtime |
+| JOINs, subqueries, CTEs | ❌ | Denormalize at write time |
+| Window functions, UNION | ❌ | Use external engines |
+| INSERT/UPDATE/DELETE | ❌ | Use PyIceberg/Pipelines |
+| Nested columns, arrays, JSON | ❌ | Flatten at write time |
+
+**Workarounds:**
+- No JOINs: Denormalize data or use Spark/PyIceberg
+- No subqueries: Split into multiple queries
+- No aliases: Accept generated names, transform in app
+
+## Common Errors
+
+### "Column not found"
+**Cause:** Typo, column doesn't exist, or case mismatch  
+**Solution:** `DESCRIBE namespace.table_name` to check schema
+
+### "Type mismatch"
+```sql
+-- ❌ Wrong types
+WHERE status = '200'              -- string instead of integer
+WHERE timestamp > '2025-01-01'    -- missing time/timezone
+
+-- ✅ Correct types
+WHERE status = 200
+WHERE timestamp > '2025-01-01T00:00:00Z'
+```
+
+### "ORDER BY column not in partition key"
+**Cause:** Ordering by non-partition column  
+**Solution:** Use partition key, aggregation, or remove ORDER BY. Check: `DESCRIBE table`
+
+### "Token authentication failed"
+```bash
+# Check/set token
+echo $WRANGLER_R2_SQL_AUTH_TOKEN
+export WRANGLER_R2_SQL_AUTH_TOKEN=<your-token>
+
+# Or .env file
+echo "WRANGLER_R2_SQL_AUTH_TOKEN=<your-token>" > .env
+```
+
+### "Table not found"
+```sql
+-- Verify catalog and tables
+SHOW DATABASES;
+SHOW TABLES IN namespace_name;
+```
+
+Enable catalog: `npx wrangler r2 bucket catalog enable <bucket>`
+
+### "LIMIT exceeds maximum"
+Max LIMIT is 10,000. For pagination, use WHERE filters with partition keys.
+
+### "No data returned" (unexpected)
+**Debug steps:**
+1. `SELECT COUNT(*) FROM table` - verify data exists
+2. Remove WHERE filters incrementally
+3. `SELECT * FROM table LIMIT 10` - inspect actual data/types
+
+## Performance Issues
+
+### Slow Queries
+
+**Causes:** Too many partitions, large LIMIT, no filters, small files
+
+```sql
+-- ❌ Slow: No filters
+SELECT * FROM logs.requests LIMIT 10000;
+
+-- ✅ Fast: Filter on partition key
+SELECT * FROM logs.requests 
+WHERE timestamp >= '2025-01-15T00:00:00Z' AND timestamp < '2025-01-16T00:00:00Z'
+LIMIT 1000;
+
+-- ✅ Faster: Multiple filters
+SELECT * FROM logs.requests 
+WHERE timestamp >= '2025-01-15T00:00:00Z' AND status = 404 AND method = 'GET'
+LIMIT 1000;
+```
+
+**File optimization:**
+- Target Parquet size: 100-500MB compressed
+- Pipelines roll interval: 300+ sec (prod), 10 sec (dev)
+- Run compaction to merge small files
+
+### Query Timeout
+
+**Solution:** Add restrictive WHERE filters, reduce time range, query smaller intervals
+
+```sql
+-- ❌ Times out: Year-long aggregation
+SELECT status, COUNT(*) FROM logs.requests 
+WHERE timestamp >= '2024-01-01T00:00:00Z' GROUP BY status;
+
+-- ✅ Faster: Month-long aggregation
+SELECT status, COUNT(*) FROM logs.requests 
+WHERE timestamp >= '2025-01-01T00:00:00Z' AND timestamp < '2025-02-01T00:00:00Z'
+GROUP BY status;
+```
+
+## Best Practices
+
+### Partitioning
+- **Time-series:** Partition by day/hour on timestamp
+- **Avoid:** High-cardinality keys (user_id), >10,000 partitions
+
+```python
+from pyiceberg.partitioning import PartitionSpec, PartitionField
+from pyiceberg.transforms import DayTransform
+
+PartitionSpec(PartitionField(source_id=1, field_id=1000, transform=DayTransform(), name="day"))
+```
+
+### Query Writing
+- **Always use LIMIT** for early termination
+- **Filter on partition keys first** for pruning
+- **Combine filters with AND** for more pruning
+
+```sql
+-- Good
+WHERE timestamp >= '2025-01-15T00:00:00Z' AND status = 404 AND method = 'GET' LIMIT 100
+```
+
+### Type Safety
+- Quote strings: `'GET'` not `GET`
+- RFC3339 timestamps: `'2025-01-01T00:00:00Z'` not `'2025-01-01'`
+- ISO dates: `'2025-01-15'` not `'01/15/2025'`
+
+### Data Organization
+- **Pipelines:** Dev `roll_file_time: 10`, Prod `roll_file_time: 300+`
+- **Compression:** Use `zstd`
+- **Maintenance:** Compaction for small files, expire old snapshots
+
+## Debugging Checklist
+
+1. `npx wrangler r2 bucket catalog enable <bucket>` - Verify catalog
+2. `echo $WRANGLER_R2_SQL_AUTH_TOKEN` - Check token
+3. `SHOW DATABASES` - List namespaces
+4. `SHOW TABLES IN namespace` - List tables
+5. `DESCRIBE namespace.table` - Check schema
+6. `SELECT COUNT(*) FROM namespace.table` - Verify data
+7. `SELECT * FROM namespace.table LIMIT 10` - Test simple query
+8. Add filters incrementally
+
+## See Also
+
+- [api.md](api.md) - SQL syntax
+- [patterns.md](patterns.md) - Query optimization
+- [configuration.md](configuration.md) - Setup
+- [Cloudflare R2 SQL Docs](https://developers.cloudflare.com/r2-sql/)
@@ -0,0 +1,222 @@
+# R2 SQL Patterns
+
+Common patterns, use cases, and integration examples for R2 SQL.
+
+## Wrangler CLI Query
+
+```bash
+# Basic query
+npx wrangler r2 sql query "my-bucket" "SELECT * FROM default.logs LIMIT 10"
+
+# Multi-line query
+npx wrangler r2 sql query "my-bucket" "
+  SELECT status, COUNT(*), AVG(response_time)
+  FROM logs.http_requests
+  WHERE timestamp >= '2025-01-01T00:00:00Z'
+  GROUP BY status
+  ORDER BY COUNT(*) DESC
+  LIMIT 100
+"
+
+# Use environment variable
+export R2_SQL_WAREHOUSE="my-bucket"
+npx wrangler r2 sql query "$R2_SQL_WAREHOUSE" "SELECT * FROM default.logs"
+```
+
+## HTTP API Query
+
+For programmatic access from external systems (not Workers - see gotchas.md).
+
+```bash
+curl -X POST https://api.cloudflare.com/client/v4/accounts/{account_id}/r2/sql/query \
+  -H "Authorization: Bearer <your-token>" \
+  -H "Content-Type: application/json" \
+  -d '{
+    "warehouse": "my-bucket",
+    "query": "SELECT * FROM default.my_table WHERE status = 200 LIMIT 100"
+  }'
+```
+
+Response:
+```json
+{
+  "success": true,
+  "result": [{"user_id": "user_123", "timestamp": "2025-01-15T10:30:00Z", "status": 200}],
+  "errors": []
+}
+```
+
+## Pipelines Integration
+
+Stream data to Iceberg tables via Pipelines, then query with R2 SQL.
+
+```bash
+# Setup pipeline (select Data Catalog Table destination)
+npx wrangler pipelines setup
+
+# Key settings:
+# - Destination: Data Catalog Table
+# - Compression: zstd (recommended)
+# - Roll file time: 300+ sec (production), 10 sec (dev)
+
+# Send data to pipeline
+curl -X POST https://{stream-id}.ingest.cloudflare.com \
+  -H "Content-Type: application/json" \
+  -d '[{"user_id": "user_123", "event_type": "purchase", "timestamp": "2025-01-15T10:30:00Z", "amount": 29.99}]'
+
+# Query ingested data (wait for roll interval)
+npx wrangler r2 sql query "my-bucket" "
+  SELECT event_type, COUNT(*), SUM(amount)
+  FROM default.events
+  WHERE timestamp >= '2025-01-15T00:00:00Z'
+  GROUP BY event_type
+"
+```
+
+See [pipelines/patterns.md](../pipelines/patterns.md) for detailed setup.
+
+## PyIceberg Integration
+
+Create and populate Iceberg tables with PyIceberg, then query with R2 SQL.
+
+```python
+from pyiceberg.catalog.rest import RestCatalog
+import pyarrow as pa
+import pandas as pd
+
+# Setup catalog
+catalog = RestCatalog(
+    name="my_catalog",
+    warehouse="my-bucket",
+    uri="https://<account-id>.r2.cloudflarestorage.com/iceberg/my-bucket",
+    token="<your-token>",
+)
+catalog.create_namespace_if_not_exists("analytics")
+
+# Create table
+schema = pa.schema([
+    pa.field("user_id", pa.string(), nullable=False),
+    pa.field("event_time", pa.timestamp("us", tz="UTC"), nullable=False),
+    pa.field("page_views", pa.int64(), nullable=False),
+])
+table = catalog.create_table(("analytics", "user_metrics"), schema=schema)
+
+# Append data
+df = pd.DataFrame({
+    "user_id": ["user_1", "user_2"],
+    "event_time": pd.to_datetime(["2025-01-15 10:00:00", "2025-01-15 11:00:00"], utc=True),
+    "page_views": [10, 25],
+})
+table.append(pa.Table.from_pandas(df, schema=schema))
+```
+
+Query with R2 SQL:
+```bash
+npx wrangler r2 sql query "my-bucket" "
+  SELECT user_id, SUM(page_views)
+  FROM analytics.user_metrics
+  WHERE event_time >= '2025-01-15T00:00:00Z'
+  GROUP BY user_id
+"
+```
+
+See [r2-data-catalog/patterns.md](../r2-data-catalog/patterns.md) for advanced PyIceberg patterns.
+
+## Use Cases
+
+### Log Analytics
+```sql
+-- Error rate by endpoint
+SELECT path, COUNT(*), SUM(CASE WHEN status >= 400 THEN 1 ELSE 0 END) as errors
+FROM logs.http_requests
+WHERE timestamp BETWEEN '2025-01-01T00:00:00Z' AND '2025-01-31T23:59:59Z'
+GROUP BY path ORDER BY errors DESC LIMIT 20;
+
+-- Response time stats
+SELECT method, MIN(response_time_ms), AVG(response_time_ms), MAX(response_time_ms)
+FROM logs.http_requests WHERE timestamp >= '2025-01-15T00:00:00Z' GROUP BY method;
+
+-- Traffic by status
+SELECT status, COUNT(*) FROM logs.http_requests
+WHERE timestamp >= '2025-01-15T00:00:00Z' AND method = 'GET'
+GROUP BY status ORDER BY COUNT(*) DESC;
+```
+
+### Fraud Detection
+```sql
+-- High-value transactions
+SELECT location, COUNT(*), SUM(amount), AVG(amount)
+FROM fraud.transactions WHERE transaction_timestamp >= '2025-01-01T00:00:00Z' AND amount > 1000.0
+GROUP BY location ORDER BY SUM(amount) DESC LIMIT 20;
+
+-- Flagged transactions
+SELECT merchant_category, COUNT(*), AVG(amount) FROM fraud.transactions
+WHERE is_fraud_flag = true AND transaction_timestamp >= '2025-01-01T00:00:00Z'
+GROUP BY merchant_category HAVING COUNT(*) > 10 ORDER BY COUNT(*) DESC;
+```
+
+### Business Intelligence
+```sql
+-- Sales by department
+SELECT department, SUM(revenue), AVG(revenue), COUNT(*) FROM sales.transactions
+WHERE sale_date >= '2024-01-01' GROUP BY department ORDER BY SUM(revenue) DESC LIMIT 10;
+
+-- Product performance
+SELECT category, COUNT(DISTINCT product_id), SUM(units_sold), SUM(revenue)
+FROM sales.product_sales WHERE sale_date BETWEEN '2024-10-01' AND '2024-12-31'
+GROUP BY category ORDER BY SUM(revenue) DESC;
+```
+
+## Connecting External Engines
+
+R2 Data Catalog exposes Iceberg REST API. Connect Spark, Snowflake, Trino, DuckDB, etc.
+
+```scala
+// Apache Spark example
+val spark = SparkSession.builder()
+  .config("spark.sql.catalog.my_catalog", "org.apache.iceberg.spark.SparkCatalog")
+  .config("spark.sql.catalog.my_catalog.catalog-impl", "org.apache.iceberg.rest.RESTCatalog")
+  .config("spark.sql.catalog.my_catalog.uri", "https://<account-id>.r2.cloudflarestorage.com/iceberg/my-bucket")
+  .config("spark.sql.catalog.my_catalog.token", "<token>")
+  .getOrCreate()
+
+spark.sql("SELECT * FROM my_catalog.default.my_table LIMIT 10").show()
+```
+
+See [r2-data-catalog/patterns.md](../r2-data-catalog/patterns.md) for more engines.
+
+## Performance Optimization
+
+### Partitioning
+- **Time-series:** day/hour on timestamp
+- **Geographic:** region/country
+- **Avoid:** High-cardinality keys (user_id)
+
+```python
+from pyiceberg.partitioning import PartitionSpec, PartitionField
+from pyiceberg.transforms import DayTransform
+
+PartitionSpec(PartitionField(source_id=1, field_id=1000, transform=DayTransform(), name="day"))
+```
+
+### Query Optimization
+- **Always use LIMIT** for early termination
+- **Filter on partition keys first**
+- **Multiple filters** for better pruning
+
+```sql
+-- Better: Multiple filters on partition key
+SELECT * FROM logs.requests 
+WHERE timestamp >= '2025-01-15T00:00:00Z' AND status = 404 AND method = 'GET' LIMIT 100;
+```
+
+### File Organization
+- **Pipelines roll:** Dev 10-30s, Prod 300+s
+- **Target Parquet:** 100-500MB compressed
+
+## See Also
+
+- [api.md](api.md) - SQL syntax reference
+- [gotchas.md](gotchas.md) - Limitations and troubleshooting
+- [r2-data-catalog/patterns.md](../r2-data-catalog/patterns.md) - PyIceberg advanced patterns
+- [pipelines/patterns.md](../pipelines/patterns.md) - Streaming ingestion patterns