1. Define what 'quality' means before writing a single eval
The first failure mode in LLM evaluation is measuring something generic — 'accuracy', 'helpfulness' — without grounding it in what your application actually needs. A customer support bot, a code assistant, and a medical summarizer all need quality defined differently. Before touching any eval framework or benchmark, write down three to five dimensions your application must get right, each with a concrete failure example.
Typical dimensions by application type: for RAG systems, measure factual grounding (did the model cite something in the retrieved context?), completeness (did it answer the whole question?), and format adherence (did it follow the schema?). For conversational assistants, measure helpfulness, safety, and tone. For code generation, measure correctness (does it run?), efficiency, and security. For summarization, measure coverage, conciseness, and faithfulness. See Measuring Prompt Quality: A Practical Evaluation Guide for per-dimension rubric templates.
The key discipline: each dimension must be independently gradeable by a human or automated judge in under 30 seconds per sample. If a rater has to think more than 30 seconds to assign a score, the dimension is too vague. Operationalize it with 2–3 example outputs at each score level before your evaluators see production data. This investment in spec clarity pays back ten-to-one when you scale to automated judging — the judge model will behave exactly as well as your rubric spec allows.