A production-grade distributed event analytics platform. Events stream through Kafka, are processed by PySpark, stored as Parquet on S3, and served via DuckDB-powered ad-hoc SQL queries over a FastAPI REST API — with Redis caching and ~100x faster query times than Athena.
┌────────────┐ ┌───────────────┐ ┌──────────────────────────────────┐
│ Producers │───▶│ Kafka │───▶│ PySpark Structured Streaming │
│ (web/mobile)│ │ (6 partitions│ │ 1-min tumbling window agg │
└────────────┘ │ event-logs) │ └──────────┬───────────────────────┘
└───────────────┘ │
┌───────────▼────────────┐
│ S3 (Parquet, snappy) │
│ date=YYYY-MM-DD/hour=HH│
└───────────┬────────────┘
│ httpfs extension
┌───────────▼────────────┐
│ DuckDB (in-process) │
│ read_parquet on S3 │
└───────────┬────────────┘
│
┌────────────────┐ ┌───────────▼────────────┐
│ Redis Cache │◀─│ FastAPI REST API │
│ (5-min TTL) │ │ /api/v1/query │
└────────────────┘ │ /api/v1/ingest │
└────────────────────────┘
┌────────────────┐
│ DynamoDB │◀── PySpark writes aggregates
│ event-aggregates│
└────────────────┘
| Layer | Technology |
|---|---|
| Streaming | Apache Kafka (Confluent), PySpark 3.5 |
| Serialization | PyArrow 15 (IPC stream, zero-copy) |
| Storage | S3 (Parquet/snappy), DynamoDB (aggregates) |
| Query Engine | DuckDB 0.10 with httpfs |
| API | FastAPI 0.111, Uvicorn |
| Caching | Redis 7 |
| Dev Infrastructure | LocalStack 3, Docker Compose |
| Observability | structlog (structured JSON) |
git clone https://github.com/Lkumar209/distributed-query-platform.git
cd distributed-query-platform
cp .env.example .env
make up # Start all services (LocalStack, Kafka, Redis, Spark, API)
make seed # Generate and publish 1M synthetic eventsWait ~30 seconds for services to be healthy, then:
curl http://localhost:8000/healthExpected: {"status":"ok","duckdb":true,"redis":true,"kafka":true}
curl -X POST http://localhost:8000/api/v1/query \
-H "Content-Type: application/json" \
-H "X-API-Key: dev-secret-key" \
-d '{
"sql": "SELECT action, COUNT(*) AS cnt FROM events GROUP BY action ORDER BY cnt DESC",
"date_range": ["2024-01-01", "2024-01-07"],
"limit": 100
}'make install # pip install -r requirements-dev.txt
make test # pytest with coverage
make lint # ruff + mypy
make bench # Run DuckDB vs Athena benchmarks
make fmt # black formatter
make down # Stop and remove all containers + volumesDuckDB reads 10M-row Parquet files directly from S3 via the httpfs extension:
| Metric | DuckDB | Athena Baseline |
|---|---|---|
| p50 latency | ~1,000 ms | ~480,000 ms (8 min) |
| Speedup | ~480x faster | — |
| Cost model | Per-query CPU | Per-TB scanned |
See docs/benchmark_results.md for full results. See docs/architecture.md for design decisions.