The 4 components of a production-grade eval set
**Component 1 — Golden examples (30-50% of the eval set).** Known-good inputs with verified-correct expected outputs. Per Braintrust at braintrust.dev, these are the baseline correctness checks. Should cover the common-case workflow paths.
**Component 2 — Edge cases (20-30%).** Inputs that historically caused issues or that probe specific failure modes: empty input, very long input, ambiguous input, multi-step reasoning, edge punctuation/encoding. Per Anthropic's evaluation guidance at docs.anthropic.com, edge cases catch the failure modes regression-testing should catch.
**Component 3 — Regression catches (20-30%).** Inputs that broke previous versions. Each production incident or quality regression should add an example to this category. Per LangSmith at docs.langchain.com, this category grows over time + becomes irreplaceable institutional knowledge — 'we will never re-ship this specific failure'.
**Component 4 — Adversarial cases (10-20%).** Prompt injection attempts, jailbreak attempts, content-policy edge cases. Per arxiv research on LLM red-teaming at arxiv.org, adversarial coverage is essential for any user-facing LLM system. The set should grow as new attack patterns emerge.