update skills

2026-05-29 12:55:14 -07:00 · 2026-03-17 16:53:22 -07:00
parent 0b0783ef8e
commit f9a530667e
389 changed files with 54512 additions and 1 deletions
@@ -0,0 +1,222 @@
+# R2 SQL Patterns
+
+Common patterns, use cases, and integration examples for R2 SQL.
+
+## Wrangler CLI Query
+
+```bash
+# Basic query
+npx wrangler r2 sql query "my-bucket" "SELECT * FROM default.logs LIMIT 10"
+
+# Multi-line query
+npx wrangler r2 sql query "my-bucket" "
+  SELECT status, COUNT(*), AVG(response_time)
+  FROM logs.http_requests
+  WHERE timestamp >= '2025-01-01T00:00:00Z'
+  GROUP BY status
+  ORDER BY COUNT(*) DESC
+  LIMIT 100
+"
+
+# Use environment variable
+export R2_SQL_WAREHOUSE="my-bucket"
+npx wrangler r2 sql query "$R2_SQL_WAREHOUSE" "SELECT * FROM default.logs"
+```
+
+## HTTP API Query
+
+For programmatic access from external systems (not Workers - see gotchas.md).
+
+```bash
+curl -X POST https://api.cloudflare.com/client/v4/accounts/{account_id}/r2/sql/query \
+  -H "Authorization: Bearer <your-token>" \
+  -H "Content-Type: application/json" \
+  -d '{
+    "warehouse": "my-bucket",
+    "query": "SELECT * FROM default.my_table WHERE status = 200 LIMIT 100"
+  }'
+```
+
+Response:
+```json
+{
+  "success": true,
+  "result": [{"user_id": "user_123", "timestamp": "2025-01-15T10:30:00Z", "status": 200}],
+  "errors": []
+}
+```
+
+## Pipelines Integration
+
+Stream data to Iceberg tables via Pipelines, then query with R2 SQL.
+
+```bash
+# Setup pipeline (select Data Catalog Table destination)
+npx wrangler pipelines setup
+
+# Key settings:
+# - Destination: Data Catalog Table
+# - Compression: zstd (recommended)
+# - Roll file time: 300+ sec (production), 10 sec (dev)
+
+# Send data to pipeline
+curl -X POST https://{stream-id}.ingest.cloudflare.com \
+  -H "Content-Type: application/json" \
+  -d '[{"user_id": "user_123", "event_type": "purchase", "timestamp": "2025-01-15T10:30:00Z", "amount": 29.99}]'
+
+# Query ingested data (wait for roll interval)
+npx wrangler r2 sql query "my-bucket" "
+  SELECT event_type, COUNT(*), SUM(amount)
+  FROM default.events
+  WHERE timestamp >= '2025-01-15T00:00:00Z'
+  GROUP BY event_type
+"
+```
+
+See [pipelines/patterns.md](../pipelines/patterns.md) for detailed setup.
+
+## PyIceberg Integration
+
+Create and populate Iceberg tables with PyIceberg, then query with R2 SQL.
+
+```python
+from pyiceberg.catalog.rest import RestCatalog
+import pyarrow as pa
+import pandas as pd
+
+# Setup catalog
+catalog = RestCatalog(
+    name="my_catalog",
+    warehouse="my-bucket",
+    uri="https://<account-id>.r2.cloudflarestorage.com/iceberg/my-bucket",
+    token="<your-token>",
+)
+catalog.create_namespace_if_not_exists("analytics")
+
+# Create table
+schema = pa.schema([
+    pa.field("user_id", pa.string(), nullable=False),
+    pa.field("event_time", pa.timestamp("us", tz="UTC"), nullable=False),
+    pa.field("page_views", pa.int64(), nullable=False),
+])
+table = catalog.create_table(("analytics", "user_metrics"), schema=schema)
+
+# Append data
+df = pd.DataFrame({
+    "user_id": ["user_1", "user_2"],
+    "event_time": pd.to_datetime(["2025-01-15 10:00:00", "2025-01-15 11:00:00"], utc=True),
+    "page_views": [10, 25],
+})
+table.append(pa.Table.from_pandas(df, schema=schema))
+```
+
+Query with R2 SQL:
+```bash
+npx wrangler r2 sql query "my-bucket" "
+  SELECT user_id, SUM(page_views)
+  FROM analytics.user_metrics
+  WHERE event_time >= '2025-01-15T00:00:00Z'
+  GROUP BY user_id
+"
+```
+
+See [r2-data-catalog/patterns.md](../r2-data-catalog/patterns.md) for advanced PyIceberg patterns.
+
+## Use Cases
+
+### Log Analytics
+```sql
+-- Error rate by endpoint
+SELECT path, COUNT(*), SUM(CASE WHEN status >= 400 THEN 1 ELSE 0 END) as errors
+FROM logs.http_requests
+WHERE timestamp BETWEEN '2025-01-01T00:00:00Z' AND '2025-01-31T23:59:59Z'
+GROUP BY path ORDER BY errors DESC LIMIT 20;
+
+-- Response time stats
+SELECT method, MIN(response_time_ms), AVG(response_time_ms), MAX(response_time_ms)
+FROM logs.http_requests WHERE timestamp >= '2025-01-15T00:00:00Z' GROUP BY method;
+
+-- Traffic by status
+SELECT status, COUNT(*) FROM logs.http_requests
+WHERE timestamp >= '2025-01-15T00:00:00Z' AND method = 'GET'
+GROUP BY status ORDER BY COUNT(*) DESC;
+```
+
+### Fraud Detection
+```sql
+-- High-value transactions
+SELECT location, COUNT(*), SUM(amount), AVG(amount)
+FROM fraud.transactions WHERE transaction_timestamp >= '2025-01-01T00:00:00Z' AND amount > 1000.0
+GROUP BY location ORDER BY SUM(amount) DESC LIMIT 20;
+
+-- Flagged transactions
+SELECT merchant_category, COUNT(*), AVG(amount) FROM fraud.transactions
+WHERE is_fraud_flag = true AND transaction_timestamp >= '2025-01-01T00:00:00Z'
+GROUP BY merchant_category HAVING COUNT(*) > 10 ORDER BY COUNT(*) DESC;
+```
+
+### Business Intelligence
+```sql
+-- Sales by department
+SELECT department, SUM(revenue), AVG(revenue), COUNT(*) FROM sales.transactions
+WHERE sale_date >= '2024-01-01' GROUP BY department ORDER BY SUM(revenue) DESC LIMIT 10;
+
+-- Product performance
+SELECT category, COUNT(DISTINCT product_id), SUM(units_sold), SUM(revenue)
+FROM sales.product_sales WHERE sale_date BETWEEN '2024-10-01' AND '2024-12-31'
+GROUP BY category ORDER BY SUM(revenue) DESC;
+```
+
+## Connecting External Engines
+
+R2 Data Catalog exposes Iceberg REST API. Connect Spark, Snowflake, Trino, DuckDB, etc.
+
+```scala
+// Apache Spark example
+val spark = SparkSession.builder()
+  .config("spark.sql.catalog.my_catalog", "org.apache.iceberg.spark.SparkCatalog")
+  .config("spark.sql.catalog.my_catalog.catalog-impl", "org.apache.iceberg.rest.RESTCatalog")
+  .config("spark.sql.catalog.my_catalog.uri", "https://<account-id>.r2.cloudflarestorage.com/iceberg/my-bucket")
+  .config("spark.sql.catalog.my_catalog.token", "<token>")
+  .getOrCreate()
+
+spark.sql("SELECT * FROM my_catalog.default.my_table LIMIT 10").show()
+```
+
+See [r2-data-catalog/patterns.md](../r2-data-catalog/patterns.md) for more engines.
+
+## Performance Optimization
+
+### Partitioning
+- **Time-series:** day/hour on timestamp
+- **Geographic:** region/country
+- **Avoid:** High-cardinality keys (user_id)
+
+```python
+from pyiceberg.partitioning import PartitionSpec, PartitionField
+from pyiceberg.transforms import DayTransform
+
+PartitionSpec(PartitionField(source_id=1, field_id=1000, transform=DayTransform(), name="day"))
+```
+
+### Query Optimization
+- **Always use LIMIT** for early termination
+- **Filter on partition keys first**
+- **Multiple filters** for better pruning
+
+```sql
+-- Better: Multiple filters on partition key
+SELECT * FROM logs.requests 
+WHERE timestamp >= '2025-01-15T00:00:00Z' AND status = 404 AND method = 'GET' LIMIT 100;
+```
+
+### File Organization
+- **Pipelines roll:** Dev 10-30s, Prod 300+s
+- **Target Parquet:** 100-500MB compressed
+
+## See Also
+
+- [api.md](api.md) - SQL syntax reference
+- [gotchas.md](gotchas.md) - Limitations and troubleshooting
+- [r2-data-catalog/patterns.md](../r2-data-catalog/patterns.md) - PyIceberg advanced patterns
+- [pipelines/patterns.md](../pipelines/patterns.md) - Streaming ingestion patterns