A production-grade MLOps Experiment Tracker for logging experiments, runs, metrics, and artifacts — with real-time loss curve streaming, GPU node scheduling, and a React/Redux dashboard.
┌─────────────────────────────────────────────────────────────────┐
│ React/Redux Frontend │
│ Dashboard ─ ExperimentList ─ RunDetail ─ GPUMonitor │
│ useMetricsStream (SSE) ─ Recharts live loss curves │
└───────────────────────────────┬─────────────────────────────────┘
│ HTTP / SSE
▼
┌─────────────────────────────────────────────────────────────────┐
│ Flask REST API (Gunicorn 4 workers) │
│ /api/v1/experiments /api/v1/runs /api/v1/metrics │
│ /api/v1/artifacts /api/v1/gpu-nodes /health /readiness │
│ API Key middleware ─ Marshmallow validation ─ structlog │
└──────────────┬────────────────────────────┬─────────────────────┘
│ SQLAlchemy ORM │ redis-py pub/sub
▼ ▼
┌──────────────────────┐ ┌──────────────────────────────────────┐
│ PostgreSQL 16 │ │ Redis 7 │
│ experiments │ │ run:{id}:metrics (pub/sub) │
│ runs │ │ run:{id}:latest_metrics (hash) │
│ metrics (4M+ rows) │ │ gpu:queue (sorted set) │
│ artifacts │ │ │
│ gpu_nodes │ │ Background Thread: metric_aggregator│
│ views: │ │ Subscribes run:*:metrics │
│ experiment_summary │ │ Updates latest_metrics hash │
│ run_metric_summary │ └──────────────────────────────────────┘
└──────────────────────┘
│
┌──────────────────────┐
│ GPU Nodes │
│ Available/Busy/ │
│ Offline │
│ Scheduled via │
│ Redis sorted set │
└──────────────────────┘
CI/CD: Jenkins → ruff/mypy/tsc → pytest/vitest → Docker build → kubectl rolling deploy
Infra: Kubernetes (2 backend replicas, 2 frontend replicas, Postgres StatefulSet, Redis)
git clone https://github.com/Lkumar209/mlops-experiment-tracker.git
cd mlops-experiment-tracker
cp .env.example .env
make up # docker compose up -d
make migrate # flask db upgrade
make seed # seed 50 experiments × 200 runs × 100 steps × 4 metrics
open http://localhost:3000export API_KEY=dev-api-key
export BASE=http://localhost:5000/api/v1
# Create an experiment
curl -s -X POST $BASE/experiments \
-H "X-API-Key: $API_KEY" \
-H "Content-Type: application/json" \
-d '{"name":"ResNet Ablation","tags":{"domain":"cv"}}' | jq .
# Create a run
curl -s -X POST $BASE/experiments/{experiment_id}/runs \
-H "X-API-Key: $API_KEY" \
-H "Content-Type: application/json" \
-d '{"name":"run-lr-0001","hyperparameters":{"lr":0.001,"batch_size":32}}' | jq .
# Log metrics (bulk, up to 1000 per request)
curl -s -X POST $BASE/runs/{run_id}/metrics \
-H "X-API-Key: $API_KEY" \
-H "Content-Type: application/json" \
-d '{"metrics":[{"key":"train_loss","value":1.5,"step":0},{"key":"train_loss","value":1.2,"step":1}]}' | jq .
# Query a loss curve
curl -s $BASE/runs/{run_id}/metrics/train_loss \
-H "X-API-Key: $API_KEY" | jq .
# Stream metrics in real-time (SSE)
curl -N "$BASE/runs/{run_id}/metrics/stream?api_key=$API_KEY"kubectl apply -f k8s/namespace.yaml
kubectl apply -f k8s/configmap.yaml
# Create secrets first:
kubectl create secret generic mlops-secrets \
--from-literal=DATABASE_URL=postgresql://... \
--from-literal=REDIS_URL=redis://... \
--from-literal=SECRET_KEY=prod-secret \
--from-literal=API_KEY=prod-api-key \
--from-literal=POSTGRES_USER=postgres \
--from-literal=POSTGRES_PASSWORD=securepassword \
-n mlops-tracker
kubectl apply -f k8s/- Install Docker Pipeline and Kubernetes CLI plugins.
- Add credentials:
dockerhub-credentials(Docker Hub),kubeconfig(kubeconfig file),slack-webhook-url(secret text). - Create a Multibranch Pipeline pointing to this repo.
- Pushes to
maintrigger full CI → Docker build → Kubernetes rolling deploy with auto-rollback.
mlops-experiment-tracker/
├── backend/ Flask API, SQLAlchemy models, services, migrations
├── frontend/ React/Redux SPA with Recharts, Vite, TypeScript
├── k8s/ Kubernetes manifests (namespace, deployments, ingress)
├── jenkins/ Declarative Jenkins CI/CD pipeline
├── scripts/ Seed scripts (50 experiments × 200 runs × 4M metrics)
└── docs/ Architecture, API reference, deployment guides