Building LLM Evaluation Pipelines That Actually Work

The Problem

Every LLM evaluation framework I tried wanted me to adopt an entire ecosystem. I needed something simpler: a pipeline that could run against our staging environment, flag regressions, and integrate with our existing CI.

What I Built

The pipeline has three stages:

Golden dataset generation. Curated input/output pairs from production logs
Automated scoring. A combination of exact match, semantic similarity, and LLM-as-judge
Regression detection. Statistical comparison against the baseline

# The core evaluation loop is surprisingly simple
async def evaluate_batch(model, dataset, scorers):
    results = []
    for item in dataset:
        response = await model.generate(item.prompt)
        scores = {s.name: s.score(response, item.expected) for s in scorers}
        results.append(EvalResult(item=item, response=response, scores=scores))
    return EvalReport(results)

Key Decisions

Why not use an existing framework? Most assume you’re evaluating a model in isolation. We needed to evaluate a model within our application, including retrieval, prompt construction, and post-processing.

LLM-as-judge calibration. The trick is using a stronger model as judge and regularly auditing its decisions against human labels. We found GPT-4o agreed with human reviewers 87% of the time on our specific tasks.

Results

Caught 3 regressions before they hit production in the first month
Evaluation suite runs in under 4 minutes on CI
Zero false positives after calibration (though we review monthly)

What I’d Do Differently

Start with fewer, higher-quality test cases. We initially over-indexed on coverage and ended up with noisy signals. Twenty carefully chosen examples beat two hundred scraped ones.

The Problem

What I Built

Key Decisions

Results

What I’d Do Differently

Let's build something together