Our MLOps Observability Stack

The Problem

Traditional application monitoring doesn’t capture ML-specific failure modes. Your API can return 200 OK while the model silently degrades.

What We Monitor

Model quality metrics

Output distribution drift (KL divergence against baseline)
Confidence score distributions
Human feedback signals (thumbs up/down, corrections)

Infrastructure metrics

Inference latency (p50, p95, p99)
GPU utilisation and memory pressure
Queue depth and throughput

Data pipeline health

Feature freshness and completeness
Embedding index staleness
Data schema validation failures

The Stack

We settled on a combination of:

Prometheus + Grafana for infrastructure metrics
Custom Python service for model quality metrics
BigQuery for long-term analysis and drift detection
PagerDuty for alerting (with sensible thresholds)

Key Lesson

The most useful alert we built: “confidence distribution has shifted significantly in the last hour.” This caught three incidents before users noticed, including a corrupted embedding index that would have taken days to spot otherwise.

Don’t just monitor whether the system is running. Monitor whether it’s running well.