Building LLM Evaluation Pipelines That Actually Work

Most LLM evaluation frameworks are over-engineered. Here's how I built a lightweight pipeline that caught real regressions in production.

AI MLOps Build Logs

The Problem

Every LLM evaluation framework I tried wanted me to adopt an entire ecosystem. I needed something simpler: a pipeline that could run against our staging environment, flag regressions, and integrate with our existing CI.

What I Built

The pipeline has three stages:

  1. Golden dataset generation. Curated input/output pairs from production logs
  2. Automated scoring. A combination of exact match, semantic similarity, and LLM-as-judge
  3. Regression detection. Statistical comparison against the baseline
# The core evaluation loop is surprisingly simple
async def evaluate_batch(model, dataset, scorers):
    results = []
    for item in dataset:
        response = await model.generate(item.prompt)
        scores = {s.name: s.score(response, item.expected) for s in scorers}
        results.append(EvalResult(item=item, response=response, scores=scores))
    return EvalReport(results)

Key Decisions

Why not use an existing framework? Most assume you’re evaluating a model in isolation. We needed to evaluate a model within our application, including retrieval, prompt construction, and post-processing.

LLM-as-judge calibration. The trick is using a stronger model as judge and regularly auditing its decisions against human labels. We found GPT-4o agreed with human reviewers 87% of the time on our specific tasks.

Results

  • Caught 3 regressions before they hit production in the first month
  • Evaluation suite runs in under 4 minutes on CI
  • Zero false positives after calibration (though we review monthly)

What I’d Do Differently

Start with fewer, higher-quality test cases. We initially over-indexed on coverage and ended up with noisy signals. Twenty carefully chosen examples beat two hundred scraped ones.

Let's build something together

I'm always interested in hearing about new projects, particularly around AI systems, security, and infrastructure.