Component 1 — Golden datasets (the foundation)
**What it is:** A curated set of representative inputs + expected-correct outputs for each production workload. The 'ground truth' against which you measure model + prompt changes. Typically 50-500 examples per workload depending on complexity.
**How to build:** Start by sampling 100 real production queries. For each, generate the expected-correct output (manually or via your best current process). Label edge cases, common mistakes, and tricky scenarios explicitly. Maintain in version control like code — golden datasets evolve as workloads evolve.
**Common mistake:** Treating golden datasets as static. Production query distribution shifts; your golden set should be refreshed quarterly to reflect current real-world inputs. Stale golden datasets produce evaluation that doesn't predict production quality.
**Reference:** Stanford's HELM framework at crfm.stanford.edu/helm is the academic gold standard for systematic evaluation methodology; for production-tactical guidance, LangChain's evaluation guide is a starting point.