6.1 KiB
R2 SQL Gotchas
Limitations, troubleshooting, and common pitfalls for R2 SQL.
Critical Limitations
No Workers Binding
Cannot call R2 SQL from Workers/Pages code - no binding exists.
// ❌ This doesn't exist
export default {
async fetch(request, env) {
const result = await env.R2_SQL.query("SELECT * FROM table"); // Not possible
return Response.json(result);
}
};
Solutions:
- HTTP API from external systems (not Workers)
- PyIceberg/Spark via r2-data-catalog REST API
- For Workers, use D1 or external databases
ORDER BY Limitations
Can only order by:
- Partition key columns - Always supported
- Aggregation functions - Supported via shuffle strategy
Cannot order by regular non-partition columns.
-- ✅ Valid: ORDER BY partition key
SELECT * FROM logs.requests ORDER BY timestamp DESC LIMIT 100;
-- ✅ Valid: ORDER BY aggregation
SELECT region, SUM(amount) FROM sales.transactions
GROUP BY region ORDER BY SUM(amount) DESC;
-- ❌ Invalid: ORDER BY non-partition column
SELECT * FROM logs.requests ORDER BY user_id;
-- ❌ Invalid: ORDER BY alias (must repeat function)
SELECT region, SUM(amount) as total FROM sales.transactions
GROUP BY region ORDER BY total; -- Use ORDER BY SUM(amount)
Check partition spec: DESCRIBE namespace.table_name
SQL Feature Limitations
| Feature | Supported | Notes |
|---|---|---|
| SELECT, WHERE, GROUP BY, HAVING | ✅ | Standard support |
| COUNT, SUM, AVG, MIN, MAX | ✅ | Standard aggregations |
| ORDER BY partition/aggregation | ✅ | See above |
| LIMIT | ✅ | Max 10,000 |
| Column aliases | ❌ | No AS alias |
| Expressions in SELECT | ❌ | No col1 + col2 |
| ORDER BY non-partition | ❌ | Fails at runtime |
| JOINs, subqueries, CTEs | ❌ | Denormalize at write time |
| Window functions, UNION | ❌ | Use external engines |
| INSERT/UPDATE/DELETE | ❌ | Use PyIceberg/Pipelines |
| Nested columns, arrays, JSON | ❌ | Flatten at write time |
Workarounds:
- No JOINs: Denormalize data or use Spark/PyIceberg
- No subqueries: Split into multiple queries
- No aliases: Accept generated names, transform in app
Common Errors
"Column not found"
Cause: Typo, column doesn't exist, or case mismatch
Solution: DESCRIBE namespace.table_name to check schema
"Type mismatch"
-- ❌ Wrong types
WHERE status = '200' -- string instead of integer
WHERE timestamp > '2025-01-01' -- missing time/timezone
-- ✅ Correct types
WHERE status = 200
WHERE timestamp > '2025-01-01T00:00:00Z'
"ORDER BY column not in partition key"
Cause: Ordering by non-partition column
Solution: Use partition key, aggregation, or remove ORDER BY. Check: DESCRIBE table
"Token authentication failed"
# Check/set token
echo $WRANGLER_R2_SQL_AUTH_TOKEN
export WRANGLER_R2_SQL_AUTH_TOKEN=<your-token>
# Or .env file
echo "WRANGLER_R2_SQL_AUTH_TOKEN=<your-token>" > .env
"Table not found"
-- Verify catalog and tables
SHOW DATABASES;
SHOW TABLES IN namespace_name;
Enable catalog: npx wrangler r2 bucket catalog enable <bucket>
"LIMIT exceeds maximum"
Max LIMIT is 10,000. For pagination, use WHERE filters with partition keys.
"No data returned" (unexpected)
Debug steps:
SELECT COUNT(*) FROM table- verify data exists- Remove WHERE filters incrementally
SELECT * FROM table LIMIT 10- inspect actual data/types
Performance Issues
Slow Queries
Causes: Too many partitions, large LIMIT, no filters, small files
-- ❌ Slow: No filters
SELECT * FROM logs.requests LIMIT 10000;
-- ✅ Fast: Filter on partition key
SELECT * FROM logs.requests
WHERE timestamp >= '2025-01-15T00:00:00Z' AND timestamp < '2025-01-16T00:00:00Z'
LIMIT 1000;
-- ✅ Faster: Multiple filters
SELECT * FROM logs.requests
WHERE timestamp >= '2025-01-15T00:00:00Z' AND status = 404 AND method = 'GET'
LIMIT 1000;
File optimization:
- Target Parquet size: 100-500MB compressed
- Pipelines roll interval: 300+ sec (prod), 10 sec (dev)
- Run compaction to merge small files
Query Timeout
Solution: Add restrictive WHERE filters, reduce time range, query smaller intervals
-- ❌ Times out: Year-long aggregation
SELECT status, COUNT(*) FROM logs.requests
WHERE timestamp >= '2024-01-01T00:00:00Z' GROUP BY status;
-- ✅ Faster: Month-long aggregation
SELECT status, COUNT(*) FROM logs.requests
WHERE timestamp >= '2025-01-01T00:00:00Z' AND timestamp < '2025-02-01T00:00:00Z'
GROUP BY status;
Best Practices
Partitioning
- Time-series: Partition by day/hour on timestamp
- Avoid: High-cardinality keys (user_id), >10,000 partitions
from pyiceberg.partitioning import PartitionSpec, PartitionField
from pyiceberg.transforms import DayTransform
PartitionSpec(PartitionField(source_id=1, field_id=1000, transform=DayTransform(), name="day"))
Query Writing
- Always use LIMIT for early termination
- Filter on partition keys first for pruning
- Combine filters with AND for more pruning
-- Good
WHERE timestamp >= '2025-01-15T00:00:00Z' AND status = 404 AND method = 'GET' LIMIT 100
Type Safety
- Quote strings:
'GET'notGET - RFC3339 timestamps:
'2025-01-01T00:00:00Z'not'2025-01-01' - ISO dates:
'2025-01-15'not'01/15/2025'
Data Organization
- Pipelines: Dev
roll_file_time: 10, Prodroll_file_time: 300+ - Compression: Use
zstd - Maintenance: Compaction for small files, expire old snapshots
Debugging Checklist
npx wrangler r2 bucket catalog enable <bucket>- Verify catalogecho $WRANGLER_R2_SQL_AUTH_TOKEN- Check tokenSHOW DATABASES- List namespacesSHOW TABLES IN namespace- List tablesDESCRIBE namespace.table- Check schemaSELECT COUNT(*) FROM namespace.table- Verify dataSELECT * FROM namespace.table LIMIT 10- Test simple query- Add filters incrementally
See Also
- api.md - SQL syntax
- patterns.md - Query optimization
- configuration.md - Setup
- Cloudflare R2 SQL Docs