Building LLM Evaluation Pipelines That Actually Work
Most LLM evaluation frameworks are over-engineered. Here's how I built a lightweight pipeline that caught real regressions in production.
The Problem
Every LLM evaluation framework I tried wanted me to adopt an entire ecosystem. I needed something simpler: a pipeline that could run against our staging environment, flag regressions, and integrate with our existing CI.
What I Built
The pipeline has three stages:
- Golden dataset generation. Curated input/output pairs from production logs
- Automated scoring. A combination of exact match, semantic similarity, and LLM-as-judge
- Regression detection. Statistical comparison against the baseline
# The core evaluation loop is surprisingly simple
async def evaluate_batch(model, dataset, scorers):
results = []
for item in dataset:
response = await model.generate(item.prompt)
scores = {s.name: s.score(response, item.expected) for s in scorers}
results.append(EvalResult(item=item, response=response, scores=scores))
return EvalReport(results)
Key Decisions
Why not use an existing framework? Most assume you’re evaluating a model in isolation. We needed to evaluate a model within our application, including retrieval, prompt construction, and post-processing.
LLM-as-judge calibration. The trick is using a stronger model as judge and regularly auditing its decisions against human labels. We found GPT-4o agreed with human reviewers 87% of the time on our specific tasks.
Results
- Caught 3 regressions before they hit production in the first month
- Evaluation suite runs in under 4 minutes on CI
- Zero false positives after calibration (though we review monthly)
What I’d Do Differently
Start with fewer, higher-quality test cases. We initially over-indexed on coverage and ended up with noisy signals. Twenty carefully chosen examples beat two hundred scraped ones.