mirror of
https://github.com/ksyasuda/dotfiles.git
synced 2026-03-20 06:11:27 -07:00
update skills
This commit is contained in:
105
.agents/skills/cloudflare-deploy/references/pipelines/README.md
Normal file
105
.agents/skills/cloudflare-deploy/references/pipelines/README.md
Normal file
@@ -0,0 +1,105 @@
|
||||
# Cloudflare Pipelines
|
||||
|
||||
ETL streaming platform for ingesting, transforming, and loading data into R2 with SQL transformations.
|
||||
|
||||
## Overview
|
||||
|
||||
Pipelines provides:
|
||||
- **Streams**: Durable event buffers (HTTP/Workers ingestion)
|
||||
- **Pipelines**: SQL-based transformations
|
||||
- **Sinks**: R2 destinations (Iceberg tables or Parquet/JSON files)
|
||||
|
||||
**Status**: Open beta (Workers Paid plan)
|
||||
**Pricing**: No charge beyond standard R2 storage/operations
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
Data Sources → Streams → Pipelines (SQL) → Sinks → R2
|
||||
↑ ↓ ↓
|
||||
HTTP/Workers Transform Iceberg/Parquet
|
||||
```
|
||||
|
||||
| Component | Purpose | Key Feature |
|
||||
|-----------|---------|-------------|
|
||||
| Streams | Event ingestion | Structured (validated) or unstructured |
|
||||
| Pipelines | Transform with SQL | Immutable after creation |
|
||||
| Sinks | Write to R2 | Exactly-once delivery |
|
||||
|
||||
## Quick Start
|
||||
|
||||
```bash
|
||||
# Interactive setup (recommended)
|
||||
npx wrangler pipelines setup
|
||||
```
|
||||
|
||||
**Minimal Worker example:**
|
||||
```typescript
|
||||
interface Env {
|
||||
STREAM: Pipeline;
|
||||
}
|
||||
|
||||
export default {
|
||||
async fetch(request: Request, env: Env, ctx: ExecutionContext): Promise<Response> {
|
||||
const event = { user_id: "123", event_type: "purchase", amount: 29.99 };
|
||||
|
||||
// Fire-and-forget pattern
|
||||
ctx.waitUntil(env.STREAM.send([event]));
|
||||
|
||||
return new Response('OK');
|
||||
}
|
||||
} satisfies ExportedHandler<Env>;
|
||||
```
|
||||
|
||||
## Which Sink Type?
|
||||
|
||||
```
|
||||
Need SQL queries on data?
|
||||
→ R2 Data Catalog (Iceberg)
|
||||
✅ ACID transactions, time-travel, schema evolution
|
||||
❌ More setup complexity (namespace, table, catalog token)
|
||||
|
||||
Just file storage/archival?
|
||||
→ R2 Storage (Parquet)
|
||||
✅ Simple, direct file access
|
||||
❌ No built-in SQL queries
|
||||
|
||||
Using external tools (Spark/Athena)?
|
||||
→ R2 Storage (Parquet with partitioning)
|
||||
✅ Standard format, partition pruning for performance
|
||||
❌ Must manage schema compatibility yourself
|
||||
```
|
||||
|
||||
## Common Use Cases
|
||||
|
||||
- **Analytics pipelines**: Clickstream, telemetry, server logs
|
||||
- **Data warehousing**: ETL into queryable Iceberg tables
|
||||
- **Event processing**: Mobile/IoT with enrichment
|
||||
- **Ecommerce analytics**: User events, purchases, views
|
||||
|
||||
## Reading Order
|
||||
|
||||
**New to Pipelines?** Start here:
|
||||
1. [configuration.md](./configuration.md) - Setup streams, sinks, pipelines
|
||||
2. [api.md](./api.md) - Send events, TypeScript types, SQL functions
|
||||
3. [patterns.md](./patterns.md) - Best practices, integrations, complete example
|
||||
4. [gotchas.md](./gotchas.md) - Critical warnings, troubleshooting
|
||||
|
||||
**Task-based routing:**
|
||||
- Setup pipeline → [configuration.md](./configuration.md)
|
||||
- Send/query data → [api.md](./api.md)
|
||||
- Implement pattern → [patterns.md](./patterns.md)
|
||||
- Debug issue → [gotchas.md](./gotchas.md)
|
||||
|
||||
## In This Reference
|
||||
|
||||
- [configuration.md](./configuration.md) - wrangler.jsonc bindings, schema definition, sink options, CLI commands
|
||||
- [api.md](./api.md) - Pipeline binding interface, send() method, HTTP ingest, SQL function reference
|
||||
- [patterns.md](./patterns.md) - Fire-and-forget, schema validation with Zod, integrations, performance tuning
|
||||
- [gotchas.md](./gotchas.md) - Silent validation failures, immutable pipelines, latency expectations, limits
|
||||
|
||||
## See Also
|
||||
|
||||
- [r2](../r2/) - R2 storage backend for sinks
|
||||
- [queues](../queues/) - Compare with Queues for async processing
|
||||
- [workers](../workers/) - Worker runtime for event ingestion
|
||||
208
.agents/skills/cloudflare-deploy/references/pipelines/api.md
Normal file
208
.agents/skills/cloudflare-deploy/references/pipelines/api.md
Normal file
@@ -0,0 +1,208 @@
|
||||
# Pipelines API Reference
|
||||
|
||||
## Pipeline Binding Interface
|
||||
|
||||
```typescript
|
||||
// From @cloudflare/workers-types
|
||||
interface Pipeline {
|
||||
send(data: object | object[]): Promise<void>;
|
||||
}
|
||||
|
||||
interface Env {
|
||||
STREAM: Pipeline;
|
||||
}
|
||||
|
||||
export default {
|
||||
async fetch(request: Request, env: Env): Promise<Response> {
|
||||
// send() returns Promise<void> - no result data
|
||||
await env.STREAM.send([event]);
|
||||
return new Response('OK');
|
||||
}
|
||||
} satisfies ExportedHandler<Env>;
|
||||
```
|
||||
|
||||
**Key points:**
|
||||
- `send()` accepts single object or array
|
||||
- Always returns `Promise<void>` (no confirmation data)
|
||||
- Throws on network/validation errors (wrap in try/catch)
|
||||
- Use `ctx.waitUntil()` for fire-and-forget pattern
|
||||
|
||||
## Writing Events
|
||||
|
||||
### Single Event
|
||||
|
||||
```typescript
|
||||
await env.STREAM.send([{
|
||||
user_id: "12345",
|
||||
event_type: "purchase",
|
||||
product_id: "widget-001",
|
||||
amount: 29.99
|
||||
}]);
|
||||
```
|
||||
|
||||
### Batch Events
|
||||
|
||||
```typescript
|
||||
const events = [
|
||||
{ user_id: "user1", event_type: "view" },
|
||||
{ user_id: "user2", event_type: "purchase", amount: 50 }
|
||||
];
|
||||
await env.STREAM.send(events);
|
||||
```
|
||||
|
||||
**Limits:**
|
||||
- Max 1 MB per request
|
||||
- 5 MB/s per stream
|
||||
|
||||
### Fire-and-Forget Pattern
|
||||
|
||||
```typescript
|
||||
export default {
|
||||
async fetch(request: Request, env: Env, ctx: ExecutionContext): Promise<Response> {
|
||||
const event = { /* ... */ };
|
||||
|
||||
// Don't block response on send
|
||||
ctx.waitUntil(env.STREAM.send([event]));
|
||||
|
||||
return new Response('OK');
|
||||
}
|
||||
};
|
||||
```
|
||||
|
||||
### Error Handling
|
||||
|
||||
```typescript
|
||||
try {
|
||||
await env.STREAM.send([event]);
|
||||
} catch (error) {
|
||||
console.error('Pipeline send failed:', error);
|
||||
// Log to another system, retry, or return error response
|
||||
return new Response('Failed to track event', { status: 500 });
|
||||
}
|
||||
```
|
||||
|
||||
## HTTP Ingest API
|
||||
|
||||
### Endpoint Format
|
||||
|
||||
```
|
||||
https://{stream-id}.ingest.cloudflare.com
|
||||
```
|
||||
|
||||
Get `{stream-id}` from: `npx wrangler pipelines streams list`
|
||||
|
||||
### Request Format
|
||||
|
||||
**CRITICAL:** Must send array, not single object
|
||||
|
||||
```bash
|
||||
# ✅ Correct
|
||||
curl -X POST https://{stream-id}.ingest.cloudflare.com \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '[{"user_id": "123", "event_type": "purchase"}]'
|
||||
|
||||
# ❌ Wrong - will fail
|
||||
curl -X POST https://{stream-id}.ingest.cloudflare.com \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"user_id": "123", "event_type": "purchase"}'
|
||||
```
|
||||
|
||||
### Authentication
|
||||
|
||||
```bash
|
||||
curl -X POST https://{stream-id}.ingest.cloudflare.com \
|
||||
-H "Content-Type: application/json" \
|
||||
-H "Authorization: Bearer YOUR_API_TOKEN" \
|
||||
-d '[{"event": "data"}]'
|
||||
```
|
||||
|
||||
**Required permission:** Workers Pipeline Send
|
||||
|
||||
Create token: Dashboard → Workers → API tokens → Create with Pipeline Send permission
|
||||
|
||||
### Response Codes
|
||||
|
||||
| Code | Meaning | Action |
|
||||
|------|---------|--------|
|
||||
| 200 | Accepted | Success |
|
||||
| 400 | Invalid format | Check JSON array, schema match |
|
||||
| 401 | Auth failed | Verify token valid |
|
||||
| 413 | Payload too large | Split into smaller batches (<1 MB) |
|
||||
| 429 | Rate limited | Back off, retry with delay |
|
||||
| 5xx | Server error | Retry with exponential backoff |
|
||||
|
||||
## SQL Functions Quick Reference
|
||||
|
||||
Available in `INSERT INTO sink SELECT ... FROM stream` transformations:
|
||||
|
||||
| Function | Example | Use Case |
|
||||
|----------|---------|----------|
|
||||
| `UPPER(s)` | `UPPER(event_type)` | Normalize strings |
|
||||
| `LOWER(s)` | `LOWER(email)` | Case-insensitive matching |
|
||||
| `CONCAT(...)` | `CONCAT(user_id, '_', product_id)` | Generate composite keys |
|
||||
| `CASE WHEN ... THEN ... END` | `CASE WHEN amount > 100 THEN 'high' ELSE 'low' END` | Conditional enrichment |
|
||||
| `CAST(x AS type)` | `CAST(timestamp AS string)` | Type conversion |
|
||||
| `COALESCE(x, y)` | `COALESCE(amount, 0.0)` | Default values |
|
||||
| Math operators | `amount * 1.1`, `price / quantity` | Calculations |
|
||||
| Comparison | `amount > 100`, `status IN ('active', 'pending')` | Filtering |
|
||||
|
||||
**String types for CAST:** `string`, `int32`, `int64`, `float32`, `float64`, `bool`, `timestamp`
|
||||
|
||||
Full reference: [Pipelines SQL Reference](https://developers.cloudflare.com/pipelines/sql-reference/)
|
||||
|
||||
## SQL Transform Examples
|
||||
|
||||
### Filter Events
|
||||
|
||||
```sql
|
||||
INSERT INTO my_sink
|
||||
SELECT * FROM my_stream
|
||||
WHERE event_type = 'purchase' AND amount > 100
|
||||
```
|
||||
|
||||
### Select Specific Fields
|
||||
|
||||
```sql
|
||||
INSERT INTO my_sink
|
||||
SELECT user_id, event_type, timestamp, amount
|
||||
FROM my_stream
|
||||
```
|
||||
|
||||
### Transform and Enrich
|
||||
|
||||
```sql
|
||||
INSERT INTO my_sink
|
||||
SELECT
|
||||
user_id,
|
||||
UPPER(event_type) as event_type,
|
||||
timestamp,
|
||||
amount * 1.1 as amount_with_tax,
|
||||
CONCAT(user_id, '_', product_id) as unique_key,
|
||||
CASE
|
||||
WHEN amount > 1000 THEN 'high_value'
|
||||
WHEN amount > 100 THEN 'medium_value'
|
||||
ELSE 'low_value'
|
||||
END as customer_tier
|
||||
FROM my_stream
|
||||
WHERE event_type IN ('purchase', 'refund')
|
||||
```
|
||||
|
||||
## Querying Results (R2 Data Catalog)
|
||||
|
||||
```bash
|
||||
export WRANGLER_R2_SQL_AUTH_TOKEN=YOUR_CATALOG_TOKEN
|
||||
|
||||
npx wrangler r2 sql query "warehouse_name" "
|
||||
SELECT
|
||||
event_type,
|
||||
COUNT(*) as event_count,
|
||||
SUM(amount) as total_revenue
|
||||
FROM default.my_table
|
||||
WHERE event_type = 'purchase'
|
||||
AND timestamp >= '2025-01-01'
|
||||
GROUP BY event_type
|
||||
ORDER BY total_revenue DESC
|
||||
LIMIT 100"
|
||||
```
|
||||
|
||||
**Note:** Iceberg tables support standard SQL queries with GROUP BY, JOINs, WHERE, ORDER BY, etc.
|
||||
@@ -0,0 +1,98 @@
|
||||
# Pipelines Configuration
|
||||
|
||||
## Worker Binding
|
||||
|
||||
```jsonc
|
||||
// wrangler.jsonc
|
||||
{
|
||||
"pipelines": [
|
||||
{ "pipeline": "<STREAM_ID>", "binding": "STREAM" }
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
Get stream ID: `npx wrangler pipelines streams list`
|
||||
|
||||
## Schema (Structured Streams)
|
||||
|
||||
```json
|
||||
{
|
||||
"fields": [
|
||||
{ "name": "user_id", "type": "string", "required": true },
|
||||
{ "name": "event_type", "type": "string", "required": true },
|
||||
{ "name": "amount", "type": "float64", "required": false },
|
||||
{ "name": "timestamp", "type": "timestamp", "required": true }
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
**Types:** `string`, `int32`, `int64`, `float32`, `float64`, `bool`, `timestamp`, `json`, `binary`, `list`, `struct`
|
||||
|
||||
## Stream Setup
|
||||
|
||||
```bash
|
||||
# With schema
|
||||
npx wrangler pipelines streams create my-stream --schema-file schema.json
|
||||
|
||||
# Unstructured (no validation)
|
||||
npx wrangler pipelines streams create my-stream
|
||||
|
||||
# List/get/delete
|
||||
npx wrangler pipelines streams list
|
||||
npx wrangler pipelines streams get <ID>
|
||||
npx wrangler pipelines streams delete <ID>
|
||||
```
|
||||
|
||||
## Sink Configuration
|
||||
|
||||
**R2 Data Catalog (Iceberg):**
|
||||
```bash
|
||||
npx wrangler pipelines sinks create my-sink \
|
||||
--type r2-data-catalog \
|
||||
--bucket my-bucket --namespace default --table events \
|
||||
--catalog-token $TOKEN \
|
||||
--compression zstd --roll-interval 60
|
||||
```
|
||||
|
||||
**R2 Raw (Parquet):**
|
||||
```bash
|
||||
npx wrangler pipelines sinks create my-sink \
|
||||
--type r2 --bucket my-bucket --format parquet \
|
||||
--path analytics/events \
|
||||
--partitioning "year=%Y/month=%m/day=%d" \
|
||||
--access-key-id $KEY --secret-access-key $SECRET
|
||||
```
|
||||
|
||||
| Option | Values | Guidance |
|
||||
|--------|--------|----------|
|
||||
| `--compression` | `zstd`, `snappy`, `gzip` | `zstd` best ratio, `snappy` fastest |
|
||||
| `--roll-interval` | Seconds | Low latency: 10-60, Query perf: 300 |
|
||||
| `--roll-size` | MB | Larger = better compression |
|
||||
|
||||
## Pipeline Creation
|
||||
|
||||
```bash
|
||||
npx wrangler pipelines create my-pipeline \
|
||||
--sql "INSERT INTO my_sink SELECT * FROM my_stream WHERE event_type = 'purchase'"
|
||||
```
|
||||
|
||||
**⚠️ Pipelines are immutable** - cannot modify SQL. Must delete/recreate.
|
||||
|
||||
## Credentials
|
||||
|
||||
| Type | Permission | Get From |
|
||||
|------|------------|----------|
|
||||
| Catalog token | R2 Admin Read & Write | Dashboard → R2 → API tokens |
|
||||
| R2 credentials | Object Read & Write | `wrangler r2 bucket create` output |
|
||||
| HTTP ingest token | Workers Pipeline Send | Dashboard → Workers → API tokens |
|
||||
|
||||
## Complete Example
|
||||
|
||||
```bash
|
||||
npx wrangler r2 bucket create my-bucket
|
||||
npx wrangler r2 bucket catalog enable my-bucket
|
||||
npx wrangler pipelines streams create my-stream --schema-file schema.json
|
||||
npx wrangler pipelines sinks create my-sink --type r2-data-catalog --bucket my-bucket ...
|
||||
npx wrangler pipelines create my-pipeline --sql "INSERT INTO my_sink SELECT * FROM my_stream"
|
||||
npx wrangler deploy
|
||||
```
|
||||
@@ -0,0 +1,80 @@
|
||||
# Pipelines Gotchas
|
||||
|
||||
## Critical Issues
|
||||
|
||||
### Events Silently Dropped
|
||||
|
||||
**Most common issue.** Events accepted (HTTP 200) but never appear in sink.
|
||||
|
||||
**Causes:**
|
||||
1. Schema validation fails - structured streams drop invalid events silently
|
||||
2. Waiting for roll interval (10-300s) - expected behavior
|
||||
|
||||
**Solution:** Validate client-side with Zod:
|
||||
```typescript
|
||||
const EventSchema = z.object({ user_id: z.string(), amount: z.number() });
|
||||
try {
|
||||
const validated = EventSchema.parse(rawEvent);
|
||||
await env.STREAM.send([validated]);
|
||||
} catch (e) { /* get immediate feedback */ }
|
||||
```
|
||||
|
||||
### Pipelines Are Immutable
|
||||
|
||||
Cannot modify SQL after creation. Must delete and recreate.
|
||||
|
||||
```bash
|
||||
npx wrangler pipelines delete old-pipeline
|
||||
npx wrangler pipelines create new-pipeline --sql "..."
|
||||
```
|
||||
|
||||
**Tip:** Use version naming (`events-pipeline-v1`) and keep SQL in version control.
|
||||
|
||||
### Worker Binding Not Found
|
||||
|
||||
**`env.STREAM is undefined`**
|
||||
|
||||
1. Use **stream ID** (not pipeline ID) in `wrangler.jsonc`
|
||||
2. Redeploy after adding binding
|
||||
|
||||
```bash
|
||||
npx wrangler pipelines streams list # Get stream ID
|
||||
npx wrangler deploy
|
||||
```
|
||||
|
||||
## Common Errors
|
||||
|
||||
| Error | Cause | Fix |
|
||||
|-------|-------|-----|
|
||||
| Events not in R2 | Roll interval not elapsed | Wait 10-300s, check `roll_interval` |
|
||||
| Schema validation failures | Type mismatch, missing fields | Validate client-side |
|
||||
| Rate limit (429) | >5 MB/s per stream | Batch events, request increase |
|
||||
| Payload too large (413) | >1 MB request | Split into smaller batches |
|
||||
| Cannot delete stream | Pipeline references it | Delete pipelines first |
|
||||
| Sink credential errors | Token expired | Recreate sink with new credentials |
|
||||
|
||||
## Limits (Open Beta)
|
||||
|
||||
| Resource | Limit |
|
||||
|----------|-------|
|
||||
| Streams/Sinks/Pipelines per account | 20 each |
|
||||
| Payload size | 1 MB |
|
||||
| Ingest rate per stream | 5 MB/s |
|
||||
| Event retention | 24 hours |
|
||||
| Recommended batch size | 100 events |
|
||||
|
||||
## SQL Limitations
|
||||
|
||||
- **No JOINs** - single stream per pipeline
|
||||
- **No window functions** - basic SQL only
|
||||
- **No subqueries** - must use `INSERT INTO ... SELECT ... FROM`
|
||||
- **No schema evolution** - cannot modify after creation
|
||||
|
||||
## Debug Checklist
|
||||
|
||||
- [ ] Stream exists: `npx wrangler pipelines streams list`
|
||||
- [ ] Pipeline healthy: `npx wrangler pipelines get <ID>`
|
||||
- [ ] SQL syntax matches schema
|
||||
- [ ] Worker redeployed after binding added
|
||||
- [ ] Waited for roll interval
|
||||
- [ ] Accepted vs processed count matches (no validation drops)
|
||||
@@ -0,0 +1,87 @@
|
||||
# Pipelines Patterns
|
||||
|
||||
## Fire-and-Forget
|
||||
|
||||
```typescript
|
||||
export default {
|
||||
async fetch(request, env, ctx) {
|
||||
const event = { user_id: '...', event_type: 'page_view', timestamp: new Date().toISOString() };
|
||||
ctx.waitUntil(env.STREAM.send([event])); // Don't block response
|
||||
return new Response('OK');
|
||||
}
|
||||
};
|
||||
```
|
||||
|
||||
## Schema Validation with Zod
|
||||
|
||||
```typescript
|
||||
import { z } from 'zod';
|
||||
|
||||
const EventSchema = z.object({
|
||||
user_id: z.string(),
|
||||
event_type: z.enum(['purchase', 'view']),
|
||||
amount: z.number().positive().optional()
|
||||
});
|
||||
|
||||
const validated = EventSchema.parse(rawEvent); // Throws on invalid
|
||||
await env.STREAM.send([validated]);
|
||||
```
|
||||
|
||||
**Why:** Structured streams drop invalid events silently. Client validation gives immediate feedback.
|
||||
|
||||
## SQL Transform Patterns
|
||||
|
||||
```sql
|
||||
-- Filter early (reduce storage)
|
||||
INSERT INTO my_sink
|
||||
SELECT user_id, event_type, amount
|
||||
FROM my_stream
|
||||
WHERE event_type = 'purchase' AND amount > 10
|
||||
|
||||
-- Select only needed fields
|
||||
INSERT INTO my_sink
|
||||
SELECT user_id, event_type, timestamp FROM my_stream
|
||||
|
||||
-- Enrich with CASE
|
||||
INSERT INTO my_sink
|
||||
SELECT user_id, amount,
|
||||
CASE WHEN amount > 1000 THEN 'vip' ELSE 'standard' END as tier
|
||||
FROM my_stream
|
||||
```
|
||||
|
||||
## Pipelines + Queues Fan-out
|
||||
|
||||
```typescript
|
||||
await Promise.all([
|
||||
env.ANALYTICS_STREAM.send([event]), // Long-term storage
|
||||
env.PROCESS_QUEUE.send(event) // Immediate processing
|
||||
]);
|
||||
```
|
||||
|
||||
| Need | Use |
|
||||
|------|-----|
|
||||
| Long-term storage, SQL queries | Pipelines |
|
||||
| Immediate processing, retries | Queues |
|
||||
| Both | Fan-out pattern |
|
||||
|
||||
## Performance Tuning
|
||||
|
||||
| Goal | Config |
|
||||
|------|--------|
|
||||
| Low latency | `--roll-interval 10` |
|
||||
| Query performance | `--roll-interval 300 --roll-size 100` |
|
||||
| Cost optimal | `--compression zstd --roll-interval 300` |
|
||||
|
||||
## Schema Evolution
|
||||
|
||||
Pipelines are immutable. Use versioning:
|
||||
|
||||
```bash
|
||||
# Create v2 stream/sink/pipeline
|
||||
npx wrangler pipelines streams create events-v2 --schema-file v2.json
|
||||
|
||||
# Dual-write during transition
|
||||
await Promise.all([env.EVENTS_V1.send([event]), env.EVENTS_V2.send([event])]);
|
||||
|
||||
# Query across versions with UNION ALL
|
||||
```
|
||||
Reference in New Issue
Block a user