Skip to content

Lkumar209/distributed-query-platform

Repository files navigation

DistributedQuery Platform

Python FastAPI DuckDB PySpark PyArrow Kafka License

A production-grade distributed event analytics platform. Events stream through Kafka, are processed by PySpark, stored as Parquet on S3, and served via DuckDB-powered ad-hoc SQL queries over a FastAPI REST API — with Redis caching and ~100x faster query times than Athena.

Architecture

┌────────────┐    ┌───────────────┐    ┌──────────────────────────────────┐
│  Producers  │───▶│     Kafka     │───▶│  PySpark Structured Streaming   │
│ (web/mobile)│    │  (6 partitions│    │  1-min tumbling window agg      │
└────────────┘    │   event-logs) │    └──────────┬───────────────────────┘
                  └───────────────┘               │
                                      ┌───────────▼────────────┐
                                      │   S3 (Parquet, snappy) │
                                      │ date=YYYY-MM-DD/hour=HH│
                                      └───────────┬────────────┘
                                                  │  httpfs extension
                                      ┌───────────▼────────────┐
                                      │    DuckDB (in-process) │
                                      │  read_parquet on S3    │
                                      └───────────┬────────────┘
                                                  │
                  ┌────────────────┐  ┌───────────▼────────────┐
                  │  Redis Cache   │◀─│    FastAPI REST API     │
                  │ (5-min TTL)    │  │  /api/v1/query         │
                  └────────────────┘  │  /api/v1/ingest        │
                                      └────────────────────────┘
                  ┌────────────────┐
                  │    DynamoDB    │◀── PySpark writes aggregates
                  │ event-aggregates│
                  └────────────────┘

Tech Stack

Layer Technology
Streaming Apache Kafka (Confluent), PySpark 3.5
Serialization PyArrow 15 (IPC stream, zero-copy)
Storage S3 (Parquet/snappy), DynamoDB (aggregates)
Query Engine DuckDB 0.10 with httpfs
API FastAPI 0.111, Uvicorn
Caching Redis 7
Dev Infrastructure LocalStack 3, Docker Compose
Observability structlog (structured JSON)

Quickstart

git clone https://github.com/Lkumar209/distributed-query-platform.git
cd distributed-query-platform
cp .env.example .env
make up          # Start all services (LocalStack, Kafka, Redis, Spark, API)
make seed        # Generate and publish 1M synthetic events

Wait ~30 seconds for services to be healthy, then:

curl http://localhost:8000/health

Expected: {"status":"ok","duckdb":true,"redis":true,"kafka":true}

Example Query

curl -X POST http://localhost:8000/api/v1/query \
  -H "Content-Type: application/json" \
  -H "X-API-Key: dev-secret-key" \
  -d '{
    "sql": "SELECT action, COUNT(*) AS cnt FROM events GROUP BY action ORDER BY cnt DESC",
    "date_range": ["2024-01-01", "2024-01-07"],
    "limit": 100
  }'

Development

make install     # pip install -r requirements-dev.txt
make test        # pytest with coverage
make lint        # ruff + mypy
make bench       # Run DuckDB vs Athena benchmarks
make fmt         # black formatter
make down        # Stop and remove all containers + volumes

Benchmark Results Summary

DuckDB reads 10M-row Parquet files directly from S3 via the httpfs extension:

Metric DuckDB Athena Baseline
p50 latency ~1,000 ms ~480,000 ms (8 min)
Speedup ~480x faster
Cost model Per-query CPU Per-TB scanned

See docs/benchmark_results.md for full results. See docs/architecture.md for design decisions.

About

Production-grade distributed query platform: Kafka → PySpark → S3/DynamoDB → DuckDB → FastAPI

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages