Skip to content
LLM evaluation · Eval infrastructure · Production quality

LLM Evals and Grading: Building Production-Grade Evaluation Infrastructure

Most production LLM workloads ship without systematic evaluation. Quality drift happens silently; regressions surface in customer complaints. The eval stack that catches problems before users do isn't optional for serious production deployment.

By Andy Gaber, Founder, Digital Dashboard HubUpdated

If you've shipped any LLM workload to production, you've experienced the awkward moment: a prompt change that 'felt better' in the 5 hand-tested examples broke 30% of production queries you didn't think to test. Or a model upgrade that seemed identical degraded a specific task quality you only noticed when customers complained. Both are symptoms of missing evaluation infrastructure — vibes-based quality assessment doesn't survive scale, model changes, or prompt iteration.

Below: the 5 components of a production-grade LLM eval stack, the techniques for each, when to build vs. buy each layer, and the canonical reference implementations. Sources include OpenAI Evals framework on GitHub, Anthropic's evals documentation, Stanford HELM benchmark framework, LangChain's evaluation guide, Promptfoo open-source eval framework at promptfoo.dev, Zheng et al. 2023 'Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena' (arXiv:2306.05685), and the Eleuther AI lm-evaluation-harness on GitHub.

5 eval-stack components — what each catches

Feature
Catches
Cost to build
Where it lives
Golden datasetsDistribution shift, edge cases1-2 weeks per workloadVersion-controlled with code
Rubric-based gradingSubjective quality dimensions1 week to define + calibrateEval framework (Promptfoo, custom)
Automated metricsObjective regressions (schema, accuracy)3-5 days per workloadCI/CD pipeline
A/B testingConfident prompt + model changes1-2 weeks initial setupPromptfoo / Helicone / Phoenix
Production monitoringDrift, prompt injection, provider regression1 week + ongoing maintenanceObservability platform

Reference implementations and frameworks: [OpenAI Evals](https://github.com/openai/evals), [Eleuther AI lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness), [Promptfoo](https://www.promptfoo.dev/), [Phoenix](https://docs.arize.com/phoenix), [Helicone](https://www.helicone.ai/), [LangSmith](https://www.langchain.com/langsmith), [Stanford HELM](https://crfm.stanford.edu/helm/latest/). Pick based on existing infrastructure compatibility.

Component 1 — Golden datasets (the foundation)

**What it is:** A curated set of representative inputs + expected-correct outputs for each production workload. The 'ground truth' against which you measure model + prompt changes. Typically 50-500 examples per workload depending on complexity.

**How to build:** Start by sampling 100 real production queries. For each, generate the expected-correct output (manually or via your best current process). Label edge cases, common mistakes, and tricky scenarios explicitly. Maintain in version control like code — golden datasets evolve as workloads evolve.

**Common mistake:** Treating golden datasets as static. Production query distribution shifts; your golden set should be refreshed quarterly to reflect current real-world inputs. Stale golden datasets produce evaluation that doesn't predict production quality.

**Reference:** Stanford's HELM framework at crfm.stanford.edu/helm is the academic gold standard for systematic evaluation methodology; for production-tactical guidance, LangChain's evaluation guide is a starting point.


Component 2 — Rubric-based grading (for subjective quality)

**What it is:** Multi-dimensional scoring of LLM outputs against explicit criteria. Each dimension scores 1-5; total score is a weighted sum. Captures the texture of 'good' that automated metrics can't.

**Dimensions to include:** Vary by workload. Common: factual accuracy, format adherence, audience-appropriateness, completeness, specificity. For creative outputs: coherence, voice, novelty.

**Who grades:** Three options. (1) Human raters — gold standard but expensive ($5-15 per output rated). (2) LLM-as-judge — cheap and fast but introduces evaluator biases. (3) Hybrid — humans calibrate the LLM-judge on a sample; LLM-judge scales the rest.

**LLM-as-judge research:** Per Zheng et al. 2023 'Judging LLM-as-a-Judge with MT-Bench' (arXiv:2306.05685), GPT-4-class judges agree with human ratings at ~85% rate — comparable to human-human agreement on subjective tasks. Strong frontier-model judges are now the cost-effective default for high-volume eval.


Component 3 — Automated metrics (for objective dimensions)

**What it is:** Programmatic checks that don't require LLM-as-judge or human review. Schema validation, exact-match scoring, classification accuracy, BLEU/ROUGE scores for similarity-based tasks.

**Where they apply:** Structured extraction (does the output match the schema? are required fields populated correctly?), classification (precision/recall/F1), code generation (does it compile? do tests pass?), translation/paraphrasing (similarity scores against reference).

**Why they matter:** Free to compute at scale. Use them as the first-line filter — outputs that fail automated metrics never need expensive human/LLM evaluation; outputs that pass move to subjective rubric grading.

**Reference:** Eleuther AI's lm-evaluation-harness at github.com/EleutherAI is the most-used open-source framework for automated LLM benchmark evaluation.


Component 4 — A/B testing infrastructure (for prompt + model changes)

**What it is:** Running two versions of a prompt or two model configurations against the same evaluation set, with statistical comparison of outputs. The mechanism for confidently deploying changes.

**How it works:** For each example in the eval set, generate output from version A and version B (with temperature 0 if you want determinism, otherwise N samples to average). Score both via your rubric or automated metrics. Use a paired statistical test (Wilcoxon signed-rank for ordinal scores, McNemar's for binary correctness) to determine whether the difference is significant.

**Sample size matters:** Per general A/B testing statistics, detecting a 5% quality difference with 95% confidence requires roughly 200-400 examples per variant for typical effect sizes. Smaller eval sets only detect larger differences.

**Reference implementation:** Promptfoo at promptfoo.dev is the most popular open-source LLM A/B testing framework as of 2026. Helicone at helicone.ai and Phoenix at phoenix.arize.com provide production observability + eval infrastructure.


Component 5 — Production monitoring (for drift detection)

**What it is:** Continuous evaluation of a sample of production outputs to catch quality drift. Eval infrastructure isn't only for pre-deployment testing; it's for production monitoring.

**How it works:** Sample 1-5% of production outputs daily. Run them through the same eval rubric you use for pre-deployment. Track quality scores over time. Alert on degradation.

**What causes drift:** Model provider updates (silent improvements or regressions), input distribution shift (your users' query patterns change), prompt-injection attacks, accumulated edge cases.

**Reference frameworks:** Phoenix open-source at phoenix.arize.com, Helicone at helicone.ai, LangSmith from LangChain — all provide production-grade eval + monitoring infrastructure with different feature trade-offs.

Shipping LLM workloads without systematic eval: vibes-based testing, regressions discovered in customer complaints, no confident model upgrades, prompt iteration based on gut feel.
Production eval stack (golden datasets + rubric + automated metrics + A/B + monitoring): regressions caught pre-deployment, model upgrades confident, prompt changes data-driven, quality drift detected at 1-2% impact instead of 30% impact at the customer-complaint stage.

Build the eval stack in production (4 steps)

  1. 1

    Build golden datasets for your top 3 production workloads

    Sample 100 real queries per workload. Generate expected-correct outputs (manual or best-current-process). Tag edge cases. Maintain in version control. 1-2 weeks of work; the foundation for everything else.

    → Open the Code Prompt Builder
  2. 2

    Implement automated metrics for the objective dimensions

    Schema validation for structured outputs. Exact-match or fuzzy-match for extractive tasks. Precision/recall/F1 for classification. Compile + test pass for code generation. Per Eleuther AI's harness, these scale to thousands of examples cheaply.

  3. 3

    Set up LLM-as-judge for subjective dimensions

    Define multi-dimensional rubric (4-7 dimensions per workload). Use frontier model (Claude Opus class, GPT-4 class) as judge. Calibrate against human ratings on 20-30 examples; verify judge agreement >85% before trusting at scale. Per Zheng et al. 2023 MT-Bench paper, frontier judges approach human-rater agreement on subjective tasks.

  4. 4

    Wire continuous production monitoring + alerting

    Sample 1-5% of production outputs. Run through eval rubric daily. Alert on score degradation. Use Phoenix open-source or Helicone or build with LangSmith — pick based on your existing infrastructure stack.

Where to start building the eval stack

If you have no eval infrastructure today: Start with golden datasets — they're the foundation everything else builds on. Sample 100 real production queries per workload, generate expected-correct outputs, version control. 1-2 weeks; high ROI.

If you have golden datasets but no grading framework: Add rubric-based grading next. Define 4-7 dimensions per workload. Use frontier-model LLM-as-judge per Zheng 2023 MT-Bench for cost-effective subjective scoring.

If you have eval but no production monitoring: Wire continuous monitoring via Phoenix, Helicone, or LangSmith. 1-5% production sampling daily catches drift before users do.

If you want to A/B test prompt changes: Promptfoo at promptfoo.dev is the canonical open-source A/B framework. Run version A vs. version B across the eval set with paired statistical tests. Use the Code Prompt Builder to structure the prompts under test.

Frequently Asked Questions

What is an LLM eval and why do I need one?

Eval = systematic evaluation infrastructure that measures LLM output quality against expected results. Without it, you're shipping based on vibes — regressions only surface when customers complain. The 5-component stack (golden datasets, rubric grading, automated metrics, A/B testing, production monitoring) is the canonical setup. Reference frameworks: OpenAI Evals, Promptfoo, Eleuther AI lm-evaluation-harness, Stanford HELM.

Should I use LLM-as-judge or human raters?

Per Zheng et al. 2023 MT-Bench paper (arXiv:2306.05685), frontier-model LLM-as-judge (GPT-4 class, Claude Opus class) agrees with human ratings at ~85% rate — comparable to human-human agreement on subjective tasks. Use LLM-as-judge for high-volume eval at $0.05-0.20 per output graded. Use humans to calibrate the judge on a 20-30 example sample; periodically re-calibrate (quarterly) to catch judge drift.

How many examples do I need in a golden dataset?

Minimum 50 per workload for noise reduction; 200-400 if you want to detect 5% quality differences with statistical confidence; 1000+ for high-stakes production workloads or model-comparison studies. Per Stanford HELM methodology, the right number depends on the variance in your quality scores — high-variance tasks need more examples.

Which open-source eval framework should I use?

Depends on the layer. Promptfoo for prompt A/B testing — most popular, easy setup. Eleuther AI lm-evaluation-harness for academic benchmark evaluation. OpenAI Evals for OpenAI-aligned evaluation patterns. Phoenix for production observability + eval. Most production teams use 2-3 of these together for different layers.

How often should I run evals?

Pre-deployment: every prompt change, model swap, or system architecture change. Continuous: 1-5% production sampling daily for drift detection. Full eval set: weekly or after major changes. Golden dataset refresh: quarterly to keep current with production distribution shifts.

What's the cost of building eval infrastructure?

1-3 weeks engineering time per workload for the initial stack (golden dataset + automated metrics + LLM-as-judge + monitoring hookup). Ongoing maintenance: ~5-10% of engineering capacity. The ROI: every model upgrade and prompt change becomes data-driven instead of vibes-driven, regressions caught at 1-2% impact instead of 30% customer-complaint stage, model upgrades confident instead of risky.

Build the eval stack before shipping the next prompt change.

The Code Prompt Builder structures prompts that can be A/B tested cleanly. Free, no signup. Part of 40+ free prompt tools.

Browse all prompt tools →