update skills

2026-05-09 00:41:27 -07:00 · 2026-03-17 16:53:22 -07:00
parent 0b0783ef8e
commit f9a530667e
389 changed files with 54512 additions and 1 deletions
@@ -0,0 +1,105 @@
+# Cloudflare Pipelines
+
+ETL streaming platform for ingesting, transforming, and loading data into R2 with SQL transformations.
+
+## Overview
+
+Pipelines provides:
+- **Streams**: Durable event buffers (HTTP/Workers ingestion)
+- **Pipelines**: SQL-based transformations
+- **Sinks**: R2 destinations (Iceberg tables or Parquet/JSON files)
+
+**Status**: Open beta (Workers Paid plan)  
+**Pricing**: No charge beyond standard R2 storage/operations
+
+## Architecture
+
+```
+Data Sources → Streams → Pipelines (SQL) → Sinks → R2
+                 ↑          ↓                ↓
+            HTTP/Workers  Transform     Iceberg/Parquet
+```
+
+| Component | Purpose | Key Feature |
+|-----------|---------|-------------|
+| Streams | Event ingestion | Structured (validated) or unstructured |
+| Pipelines | Transform with SQL | Immutable after creation |
+| Sinks | Write to R2 | Exactly-once delivery |
+
+## Quick Start
+
+```bash
+# Interactive setup (recommended)
+npx wrangler pipelines setup
+```
+
+**Minimal Worker example:**
+```typescript
+interface Env {
+  STREAM: Pipeline;
+}
+
+export default {
+  async fetch(request: Request, env: Env, ctx: ExecutionContext): Promise<Response> {
+    const event = { user_id: "123", event_type: "purchase", amount: 29.99 };
+    
+    // Fire-and-forget pattern
+    ctx.waitUntil(env.STREAM.send([event]));
+    
+    return new Response('OK');
+  }
+} satisfies ExportedHandler<Env>;
+```
+
+## Which Sink Type?
+
+```
+Need SQL queries on data?
+  → R2 Data Catalog (Iceberg)
+    ✅ ACID transactions, time-travel, schema evolution
+    ❌ More setup complexity (namespace, table, catalog token)
+
+Just file storage/archival?
+  → R2 Storage (Parquet)
+    ✅ Simple, direct file access
+    ❌ No built-in SQL queries
+
+Using external tools (Spark/Athena)?
+  → R2 Storage (Parquet with partitioning)
+    ✅ Standard format, partition pruning for performance
+    ❌ Must manage schema compatibility yourself
+```
+
+## Common Use Cases
+
+- **Analytics pipelines**: Clickstream, telemetry, server logs
+- **Data warehousing**: ETL into queryable Iceberg tables
+- **Event processing**: Mobile/IoT with enrichment
+- **Ecommerce analytics**: User events, purchases, views
+
+## Reading Order
+
+**New to Pipelines?** Start here:
+1. [configuration.md](./configuration.md) - Setup streams, sinks, pipelines
+2. [api.md](./api.md) - Send events, TypeScript types, SQL functions
+3. [patterns.md](./patterns.md) - Best practices, integrations, complete example
+4. [gotchas.md](./gotchas.md) - Critical warnings, troubleshooting
+
+**Task-based routing:**
+- Setup pipeline → [configuration.md](./configuration.md)
+- Send/query data → [api.md](./api.md)
+- Implement pattern → [patterns.md](./patterns.md)
+- Debug issue → [gotchas.md](./gotchas.md)
+
+## In This Reference
+
+- [configuration.md](./configuration.md) - wrangler.jsonc bindings, schema definition, sink options, CLI commands
+- [api.md](./api.md) - Pipeline binding interface, send() method, HTTP ingest, SQL function reference
+- [patterns.md](./patterns.md) - Fire-and-forget, schema validation with Zod, integrations, performance tuning
+- [gotchas.md](./gotchas.md) - Silent validation failures, immutable pipelines, latency expectations, limits
+
+## See Also
+
+- [r2](../r2/) - R2 storage backend for sinks
+- [queues](../queues/) - Compare with Queues for async processing
+- [workers](../workers/) - Worker runtime for event ingestion
@@ -0,0 +1,208 @@
+# Pipelines API Reference
+
+## Pipeline Binding Interface
+
+```typescript
+// From @cloudflare/workers-types
+interface Pipeline {
+  send(data: object | object[]): Promise<void>;
+}
+
+interface Env {
+  STREAM: Pipeline;
+}
+
+export default {
+  async fetch(request: Request, env: Env): Promise<Response> {
+    // send() returns Promise<void> - no result data
+    await env.STREAM.send([event]);
+    return new Response('OK');
+  }
+} satisfies ExportedHandler<Env>;
+```
+
+**Key points:**
+- `send()` accepts single object or array
+- Always returns `Promise<void>` (no confirmation data)
+- Throws on network/validation errors (wrap in try/catch)
+- Use `ctx.waitUntil()` for fire-and-forget pattern
+
+## Writing Events
+
+### Single Event
+
+```typescript
+await env.STREAM.send([{
+  user_id: "12345",
+  event_type: "purchase",
+  product_id: "widget-001",
+  amount: 29.99
+}]);
+```
+
+### Batch Events
+
+```typescript
+const events = [
+  { user_id: "user1", event_type: "view" },
+  { user_id: "user2", event_type: "purchase", amount: 50 }
+];
+await env.STREAM.send(events);
+```
+
+**Limits:**
+- Max 1 MB per request
+- 5 MB/s per stream
+
+### Fire-and-Forget Pattern
+
+```typescript
+export default {
+  async fetch(request: Request, env: Env, ctx: ExecutionContext): Promise<Response> {
+    const event = { /* ... */ };
+    
+    // Don't block response on send
+    ctx.waitUntil(env.STREAM.send([event]));
+    
+    return new Response('OK');
+  }
+};
+```
+
+### Error Handling
+
+```typescript
+try {
+  await env.STREAM.send([event]);
+} catch (error) {
+  console.error('Pipeline send failed:', error);
+  // Log to another system, retry, or return error response
+  return new Response('Failed to track event', { status: 500 });
+}
+```
+
+## HTTP Ingest API
+
+### Endpoint Format
+
+```
+https://{stream-id}.ingest.cloudflare.com
+```
+
+Get `{stream-id}` from: `npx wrangler pipelines streams list`
+
+### Request Format
+
+**CRITICAL:** Must send array, not single object
+
+```bash
+# ✅ Correct
+curl -X POST https://{stream-id}.ingest.cloudflare.com \
+  -H "Content-Type: application/json" \
+  -d '[{"user_id": "123", "event_type": "purchase"}]'
+
+# ❌ Wrong - will fail
+curl -X POST https://{stream-id}.ingest.cloudflare.com \
+  -H "Content-Type: application/json" \
+  -d '{"user_id": "123", "event_type": "purchase"}'
+```
+
+### Authentication
+
+```bash
+curl -X POST https://{stream-id}.ingest.cloudflare.com \
+  -H "Content-Type: application/json" \
+  -H "Authorization: Bearer YOUR_API_TOKEN" \
+  -d '[{"event": "data"}]'
+```
+
+**Required permission:** Workers Pipeline Send
+
+Create token: Dashboard → Workers → API tokens → Create with Pipeline Send permission
+
+### Response Codes
+
+| Code | Meaning | Action |
+|------|---------|--------|
+| 200 | Accepted | Success |
+| 400 | Invalid format | Check JSON array, schema match |
+| 401 | Auth failed | Verify token valid |
+| 413 | Payload too large | Split into smaller batches (<1 MB) |
+| 429 | Rate limited | Back off, retry with delay |
+| 5xx | Server error | Retry with exponential backoff |
+
+## SQL Functions Quick Reference
+
+Available in `INSERT INTO sink SELECT ... FROM stream` transformations:
+
+| Function | Example | Use Case |
+|----------|---------|----------|
+| `UPPER(s)` | `UPPER(event_type)` | Normalize strings |
+| `LOWER(s)` | `LOWER(email)` | Case-insensitive matching |
+| `CONCAT(...)` | `CONCAT(user_id, '_', product_id)` | Generate composite keys |
+| `CASE WHEN ... THEN ... END` | `CASE WHEN amount > 100 THEN 'high' ELSE 'low' END` | Conditional enrichment |
+| `CAST(x AS type)` | `CAST(timestamp AS string)` | Type conversion |
+| `COALESCE(x, y)` | `COALESCE(amount, 0.0)` | Default values |
+| Math operators | `amount * 1.1`, `price / quantity` | Calculations |
+| Comparison | `amount > 100`, `status IN ('active', 'pending')` | Filtering |
+
+**String types for CAST:** `string`, `int32`, `int64`, `float32`, `float64`, `bool`, `timestamp`
+
+Full reference: [Pipelines SQL Reference](https://developers.cloudflare.com/pipelines/sql-reference/)
+
+## SQL Transform Examples
+
+### Filter Events
+
+```sql
+INSERT INTO my_sink
+SELECT * FROM my_stream
+WHERE event_type = 'purchase' AND amount > 100
+```
+
+### Select Specific Fields
+
+```sql
+INSERT INTO my_sink
+SELECT user_id, event_type, timestamp, amount
+FROM my_stream
+```
+
+### Transform and Enrich
+
+```sql
+INSERT INTO my_sink
+SELECT
+  user_id,
+  UPPER(event_type) as event_type,
+  timestamp,
+  amount * 1.1 as amount_with_tax,
+  CONCAT(user_id, '_', product_id) as unique_key,
+  CASE
+    WHEN amount > 1000 THEN 'high_value'
+    WHEN amount > 100 THEN 'medium_value'
+    ELSE 'low_value'
+  END as customer_tier
+FROM my_stream
+WHERE event_type IN ('purchase', 'refund')
+```
+
+## Querying Results (R2 Data Catalog)
+
+```bash
+export WRANGLER_R2_SQL_AUTH_TOKEN=YOUR_CATALOG_TOKEN
+
+npx wrangler r2 sql query "warehouse_name" "
+SELECT 
+  event_type,
+  COUNT(*) as event_count,
+  SUM(amount) as total_revenue
+FROM default.my_table
+WHERE event_type = 'purchase'
+  AND timestamp >= '2025-01-01'
+GROUP BY event_type
+ORDER BY total_revenue DESC
+LIMIT 100"
+```
+
+**Note:** Iceberg tables support standard SQL queries with GROUP BY, JOINs, WHERE, ORDER BY, etc.
@@ -0,0 +1,98 @@
+# Pipelines Configuration
+
+## Worker Binding
+
+```jsonc
+// wrangler.jsonc
+{
+  "pipelines": [
+    { "pipeline": "<STREAM_ID>", "binding": "STREAM" }
+  ]
+}
+```
+
+Get stream ID: `npx wrangler pipelines streams list`
+
+## Schema (Structured Streams)
+
+```json
+{
+  "fields": [
+    { "name": "user_id", "type": "string", "required": true },
+    { "name": "event_type", "type": "string", "required": true },
+    { "name": "amount", "type": "float64", "required": false },
+    { "name": "timestamp", "type": "timestamp", "required": true }
+  ]
+}
+```
+
+**Types:** `string`, `int32`, `int64`, `float32`, `float64`, `bool`, `timestamp`, `json`, `binary`, `list`, `struct`
+
+## Stream Setup
+
+```bash
+# With schema
+npx wrangler pipelines streams create my-stream --schema-file schema.json
+
+# Unstructured (no validation)
+npx wrangler pipelines streams create my-stream
+
+# List/get/delete
+npx wrangler pipelines streams list
+npx wrangler pipelines streams get <ID>
+npx wrangler pipelines streams delete <ID>
+```
+
+## Sink Configuration
+
+**R2 Data Catalog (Iceberg):**
+```bash
+npx wrangler pipelines sinks create my-sink \
+  --type r2-data-catalog \
+  --bucket my-bucket --namespace default --table events \
+  --catalog-token $TOKEN \
+  --compression zstd --roll-interval 60
+```
+
+**R2 Raw (Parquet):**
+```bash
+npx wrangler pipelines sinks create my-sink \
+  --type r2 --bucket my-bucket --format parquet \
+  --path analytics/events \
+  --partitioning "year=%Y/month=%m/day=%d" \
+  --access-key-id $KEY --secret-access-key $SECRET
+```
+
+| Option | Values | Guidance |
+|--------|--------|----------|
+| `--compression` | `zstd`, `snappy`, `gzip` | `zstd` best ratio, `snappy` fastest |
+| `--roll-interval` | Seconds | Low latency: 10-60, Query perf: 300 |
+| `--roll-size` | MB | Larger = better compression |
+
+## Pipeline Creation
+
+```bash
+npx wrangler pipelines create my-pipeline \
+  --sql "INSERT INTO my_sink SELECT * FROM my_stream WHERE event_type = 'purchase'"
+```
+
+**⚠️ Pipelines are immutable** - cannot modify SQL. Must delete/recreate.
+
+## Credentials
+
+| Type | Permission | Get From |
+|------|------------|----------|
+| Catalog token | R2 Admin Read & Write | Dashboard → R2 → API tokens |
+| R2 credentials | Object Read & Write | `wrangler r2 bucket create` output |
+| HTTP ingest token | Workers Pipeline Send | Dashboard → Workers → API tokens |
+
+## Complete Example
+
+```bash
+npx wrangler r2 bucket create my-bucket
+npx wrangler r2 bucket catalog enable my-bucket
+npx wrangler pipelines streams create my-stream --schema-file schema.json
+npx wrangler pipelines sinks create my-sink --type r2-data-catalog --bucket my-bucket ...
+npx wrangler pipelines create my-pipeline --sql "INSERT INTO my_sink SELECT * FROM my_stream"
+npx wrangler deploy
+```
@@ -0,0 +1,80 @@
+# Pipelines Gotchas
+
+## Critical Issues
+
+### Events Silently Dropped
+
+**Most common issue.** Events accepted (HTTP 200) but never appear in sink.
+
+**Causes:**
+1. Schema validation fails - structured streams drop invalid events silently
+2. Waiting for roll interval (10-300s) - expected behavior
+
+**Solution:** Validate client-side with Zod:
+```typescript
+const EventSchema = z.object({ user_id: z.string(), amount: z.number() });
+try {
+  const validated = EventSchema.parse(rawEvent);
+  await env.STREAM.send([validated]);
+} catch (e) { /* get immediate feedback */ }
+```
+
+### Pipelines Are Immutable
+
+Cannot modify SQL after creation. Must delete and recreate.
+
+```bash
+npx wrangler pipelines delete old-pipeline
+npx wrangler pipelines create new-pipeline --sql "..."
+```
+
+**Tip:** Use version naming (`events-pipeline-v1`) and keep SQL in version control.
+
+### Worker Binding Not Found
+
+**`env.STREAM is undefined`**
+
+1. Use **stream ID** (not pipeline ID) in `wrangler.jsonc`
+2. Redeploy after adding binding
+
+```bash
+npx wrangler pipelines streams list  # Get stream ID
+npx wrangler deploy
+```
+
+## Common Errors
+
+| Error | Cause | Fix |
+|-------|-------|-----|
+| Events not in R2 | Roll interval not elapsed | Wait 10-300s, check `roll_interval` |
+| Schema validation failures | Type mismatch, missing fields | Validate client-side |
+| Rate limit (429) | >5 MB/s per stream | Batch events, request increase |
+| Payload too large (413) | >1 MB request | Split into smaller batches |
+| Cannot delete stream | Pipeline references it | Delete pipelines first |
+| Sink credential errors | Token expired | Recreate sink with new credentials |
+
+## Limits (Open Beta)
+
+| Resource | Limit |
+|----------|-------|
+| Streams/Sinks/Pipelines per account | 20 each |
+| Payload size | 1 MB |
+| Ingest rate per stream | 5 MB/s |
+| Event retention | 24 hours |
+| Recommended batch size | 100 events |
+
+## SQL Limitations
+
+- **No JOINs** - single stream per pipeline
+- **No window functions** - basic SQL only
+- **No subqueries** - must use `INSERT INTO ... SELECT ... FROM`
+- **No schema evolution** - cannot modify after creation
+
+## Debug Checklist
+
+- [ ] Stream exists: `npx wrangler pipelines streams list`
+- [ ] Pipeline healthy: `npx wrangler pipelines get <ID>`
+- [ ] SQL syntax matches schema
+- [ ] Worker redeployed after binding added
+- [ ] Waited for roll interval
+- [ ] Accepted vs processed count matches (no validation drops)
@@ -0,0 +1,87 @@
+# Pipelines Patterns
+
+## Fire-and-Forget
+
+```typescript
+export default {
+  async fetch(request, env, ctx) {
+    const event = { user_id: '...', event_type: 'page_view', timestamp: new Date().toISOString() };
+    ctx.waitUntil(env.STREAM.send([event])); // Don't block response
+    return new Response('OK');
+  }
+};
+```
+
+## Schema Validation with Zod
+
+```typescript
+import { z } from 'zod';
+
+const EventSchema = z.object({
+  user_id: z.string(),
+  event_type: z.enum(['purchase', 'view']),
+  amount: z.number().positive().optional()
+});
+
+const validated = EventSchema.parse(rawEvent); // Throws on invalid
+await env.STREAM.send([validated]);
+```
+
+**Why:** Structured streams drop invalid events silently. Client validation gives immediate feedback.
+
+## SQL Transform Patterns
+
+```sql
+-- Filter early (reduce storage)
+INSERT INTO my_sink
+SELECT user_id, event_type, amount
+FROM my_stream
+WHERE event_type = 'purchase' AND amount > 10
+
+-- Select only needed fields
+INSERT INTO my_sink
+SELECT user_id, event_type, timestamp FROM my_stream
+
+-- Enrich with CASE
+INSERT INTO my_sink
+SELECT user_id, amount,
+  CASE WHEN amount > 1000 THEN 'vip' ELSE 'standard' END as tier
+FROM my_stream
+```
+
+## Pipelines + Queues Fan-out
+
+```typescript
+await Promise.all([
+  env.ANALYTICS_STREAM.send([event]),  // Long-term storage
+  env.PROCESS_QUEUE.send(event)        // Immediate processing
+]);
+```
+
+| Need | Use |
+|------|-----|
+| Long-term storage, SQL queries | Pipelines |
+| Immediate processing, retries | Queues |
+| Both | Fan-out pattern |
+
+## Performance Tuning
+
+| Goal | Config |
+|------|--------|
+| Low latency | `--roll-interval 10` |
+| Query performance | `--roll-interval 300 --roll-size 100` |
+| Cost optimal | `--compression zstd --roll-interval 300` |
+
+## Schema Evolution
+
+Pipelines are immutable. Use versioning:
+
+```bash
+# Create v2 stream/sink/pipeline
+npx wrangler pipelines streams create events-v2 --schema-file v2.json
+
+# Dual-write during transition
+await Promise.all([env.EVENTS_V1.send([event]), env.EVENTS_V2.send([event])]);
+
+# Query across versions with UNION ALL
+```