Our MLOps Observability Stack
How we monitor ML models in production. From data drift detection to latency percentiles. The tools and patterns that stuck.
MLOps AI
The Problem
Traditional application monitoring doesn’t capture ML-specific failure modes. Your API can return 200 OK while the model silently degrades.
What We Monitor
Model quality metrics
- Output distribution drift (KL divergence against baseline)
- Confidence score distributions
- Human feedback signals (thumbs up/down, corrections)
Infrastructure metrics
- Inference latency (p50, p95, p99)
- GPU utilisation and memory pressure
- Queue depth and throughput
Data pipeline health
- Feature freshness and completeness
- Embedding index staleness
- Data schema validation failures
The Stack
We settled on a combination of:
- Prometheus + Grafana for infrastructure metrics
- Custom Python service for model quality metrics
- BigQuery for long-term analysis and drift detection
- PagerDuty for alerting (with sensible thresholds)
Key Lesson
The most useful alert we built: “confidence distribution has shifted significantly in the last hour.” This caught three incidents before users noticed, including a corrupted embedding index that would have taken days to spot otherwise.
Don’t just monitor whether the system is running. Monitor whether it’s running well.