Our MLOps Observability Stack

How we monitor ML models in production. From data drift detection to latency percentiles. The tools and patterns that stuck.

MLOps AI

The Problem

Traditional application monitoring doesn’t capture ML-specific failure modes. Your API can return 200 OK while the model silently degrades.

What We Monitor

Model quality metrics

  • Output distribution drift (KL divergence against baseline)
  • Confidence score distributions
  • Human feedback signals (thumbs up/down, corrections)

Infrastructure metrics

  • Inference latency (p50, p95, p99)
  • GPU utilisation and memory pressure
  • Queue depth and throughput

Data pipeline health

  • Feature freshness and completeness
  • Embedding index staleness
  • Data schema validation failures

The Stack

We settled on a combination of:

  • Prometheus + Grafana for infrastructure metrics
  • Custom Python service for model quality metrics
  • BigQuery for long-term analysis and drift detection
  • PagerDuty for alerting (with sensible thresholds)

Key Lesson

The most useful alert we built: “confidence distribution has shifted significantly in the last hour.” This caught three incidents before users noticed, including a corrupted embedding index that would have taken days to spot otherwise.

Don’t just monitor whether the system is running. Monitor whether it’s running well.

Let's build something together

I'm always interested in hearing about new projects, particularly around AI systems, security, and infrastructure.