Skip to content
Prompt engineering · Quality rubric · Iteration framework

The 7-Point Prompt Grading Rubric: Turn Prompt Iteration From Vibe to Measurable Comparison

Most prompt iteration happens by gut feel — read the output, decide it's better or worse, tweak again. The 7-point grading rubric replaces vibe with measurable dimensions. Same iteration loop, dramatically better outcomes.

By Andy Gaber, Founder, Digital Dashboard HubUpdated

If you've iterated on a prompt for more than 10 minutes, you've probably hit the moment where you can't remember whether the current version is better than three versions ago. The output reads OK. The output before read OK. Neither one was clearly broken; neither was clearly excellent. You'd switch back to the previous version if you could remember exactly what was different. This is the universal failure mode of prompt iteration by vibe.

The fix is a grading rubric — a set of explicit dimensions you score each output against, on a 1–5 scale, summed into a comparable number. Same iteration loop, but now version 3 vs. version 6 is a comparison of 21 vs. 27 instead of 'they both read OK.' The rubric also surfaces WHICH dimension is weak, which tells you what to change in the next iteration. Vibe-based iteration tells you nothing about the next move; rubric-based iteration tells you exactly where to push.

Below: the 7-point rubric I use across content generation, structured output, classification, and code generation workflows; how to score each dimension; the iteration protocol that uses the rubric; and four common mis-scoring patterns to avoid. The rubric isn't original — variants exist in academic prompt-engineering research (White et al. 2023 'A Prompt Pattern Catalog' arXiv:2302.11382, Liu et al. 2023 NLP prompting survey arXiv:2107.13586) and provider documentation (Anthropic's prompt engineering overview, OpenAI's prompt engineering best practices, and Google's prompting strategies for Gemini). The specific 7 dimensions and the 1–5 scoring approach have been refined through approximately 600 prompt iterations across 2024–2026.

**Research + further reading:** Additional authoritative sources informing this guide: LangChain at python.langchain.com, LlamaIndex at docs.llamaindex.ai, Pinecone at pinecone.io, Weaviate at weaviate.io, HuggingFace at huggingface.co. Cross-reference these for broader context, peer-reviewed research, and ongoing developments in this domain.

The 7 dimensions at a glance

Feature
Dimension
What 5 looks like
What 1 looks like
1. SpecificityUses concrete details from inputGeneric — fits any input
2. Audience-appropriatenessMember of audience feels addressed preciselyWrong expertise level / references
3. Format adherenceExactly matches requested structureIgnores format requirements entirely
4. Constraint complianceEvery constraint honoredMultiple constraints ignored
5. CoherenceTight logical flow paragraph-to-paragraphDisconnected — reorderable without loss
6. Insight / non-obviousnessSurfaces non-obvious connectionsMost-common surface response
7. ActionabilityReader can act immediatelyDescribes without pointing to action

Total out of 35. Most production prompts cluster at 22–28 after 4–6 iteration cycles. 30+ is excellent; 20- is breaking on multiple dimensions and needs structural revision, not tweaks. Further reading: [LangChain at python.langchain.com](https://python.langchain.com/), [LlamaIndex at docs.llamaindex.ai](https://docs.llamaindex.ai/), [Pinecone at pinecone.io](https://www.pinecone.io/learn/).

The 7 dimensions and what each scores

**Dimension 1 — Specificity (1–5).** Does the output address THIS specific case, or could it have been written for any similar input? Score 5 if the output uses concrete details from the input (audience, context, examples). Score 1 if the output is generic — would fit any input in the category. Most weak LLM outputs score 2–3 here; the lift from this dimension alone is large.

**Dimension 2 — Audience-appropriateness (1–5).** Does the language, examples, and level of complexity match the defined audience? Score 5 if a member of the target audience would feel addressed precisely. Score 1 if the output assumes the wrong expertise level or uses inappropriate references. Common failure: writing for 'beginners' but using jargon a beginner wouldn't recognize.

**Dimension 3 — Format adherence (1–5).** Does the output match the requested structure (length, sections, bullet vs. paragraph, headings, tone)? Score 5 if format is exactly what was specified. Score 1 if the output ignores format requirements entirely. The cheapest dimension to fix; LLMs respond well to specific format requirements when stated clearly.

**Dimension 4 — Constraint compliance (1–5).** Does the output respect explicit constraints (forbidden words, required inclusions, word count, etc.)? Score 5 if every constraint is honored. Score 1 if multiple constraints are ignored. Specific constraints (avoid words X, Y, Z) work better than vague constraints (avoid clichés); your prompt's constraint quality affects this score.

**Dimension 5 — Coherence (1–5).** Does the output hang together — paragraphs flow, claims connect, sections support each other? Score 5 for tight logical flow. Score 1 for disconnected paragraphs that could be reordered without losing meaning. Coherence drops in long outputs; weak prompts produce weak coherence at length.

**Dimension 6 — Insight / non-obviousness (1–5).** Does the output say something that wouldn't have been in the first response a competent human would write? Score 5 if the output surfaces non-obvious connections or analysis. Score 1 if it's the most-common surface-level response. The hardest dimension to score consistently because it's somewhat subjective; useful to score against 'would a competent expert in this domain learn anything from this output?'

**Dimension 7 — Actionability (1–5).** If the output is meant to drive action (recommendations, advice, decisions), is the action clear and executable? Score 5 if the reader can act immediately on the output. Score 1 if the output describes a situation without pointing to specific next moves. Many LLM outputs score high on description and low on actionability; the gap is fixable in the prompt.


Scoring methodology that produces consistent results

**Rule 1 — Score each dimension independently.** Don't let your overall impression bias individual scores. Read the output, score Dimension 1 (Specificity) without looking at others; then 2, etc. Independent scoring captures the texture of where the output is strong and weak.

**Rule 2 — Use the full 1–5 range.** Most graders cluster around 3–4 because '5 feels too strong.' Force the use of 5 (genuinely excellent on that dimension) and 1 (genuinely failed). Without using the full range, the rubric loses its discriminating power.

**Rule 3 — Anchor scores to examples.** Write down what a 5 looks like for each dimension before scoring. 'Specificity 5 = mentions the audience's specific pain point from the input by name.' 'Specificity 1 = could be reused for any input in the category.' Anchoring stops drift across iterations.

**Rule 4 — Score across multiple outputs at once, not one at a time.** Run 10 outputs from version A and 10 from version B. Score all 20 in one session against the rubric. Comparison signal is stronger than absolute scoring because you can see the dimensions where one version consistently beats the other.

**Rule 5 — Compute total + dimension-specific deltas.** Total score (out of 35) is your headline metric. But also track per-dimension averages across the 10 outputs — if Version B is +0.4 on Specificity but -0.6 on Format, you've gained one thing and lost another, and the next iteration should preserve the gain while fixing the loss.


The iteration protocol using the rubric

**Step 1 — Establish baseline.** Run your current prompt on 10 representative inputs. Score all 10 outputs against the rubric. Compute average per-dimension score and total. This is your baseline.

**Step 2 — Identify the weakest dimension.** Look at the per-dimension averages. The lowest score is the biggest opportunity. Don't try to lift everything; pick the dimension where the gap to a 4 average is largest.

**Step 3 — Make ONE targeted change to address that dimension.** If weak dimension is Format, add a more specific format instruction. If weak dimension is Specificity, add explicit instruction to use specific details from the input. Don't change multiple things at once — you won't know what helped.

**Step 4 — Re-run + re-score.** Run the new prompt on the same 10 inputs (or 10 new ones). Score again. Compare per-dimension deltas. The targeted dimension should improve; other dimensions should be neutral. If they degraded, you made a trade-off; decide if the trade is acceptable.

**Step 5 — Iterate until total exceeds quality threshold or you hit diminishing returns.** Most prompts iterate 4–8 cycles before hitting their realistic ceiling. After 8 cycles with marginal improvement, the prompt is probably near its ceiling and further iteration is diminishing-return work. Stop iterating; ship.


Four common mis-scoring patterns

**Pattern 1 — Halo effect from one strong dimension.** A response that's brilliantly specific tempts you to score every dimension high. Force yourself to score each dimension in isolation. A specifically brilliant response can still have weak coherence (Dimension 5) or weak actionability (Dimension 7); the rubric only works if you score them separately.

**Pattern 2 — Confirmation bias toward the version you wrote.** If you wrote the new prompt, you'll subconsciously want it to win. Mitigate by blinding scores when possible — have someone else label the outputs as 'A' or 'B' without telling you which version produced which, then score, then reveal.

**Pattern 3 — Scoring at the wrong altitude.** The rubric is for evaluating prompt quality, not output quality on dimensions the prompt didn't ask about. If the prompt didn't request actionability, don't score Dimension 7. Only score dimensions the prompt was actually shaping toward.

**Pattern 4 — Insufficient sample size.** Scoring 2 outputs gives you noise, not signal. 10 outputs is the minimum reliable sample size for noticing patterns. Below 10, individual output variance overwhelms the prompt-version variance you're trying to measure.


When to use this rubric vs. other quality measures

**Use this rubric when:** iterating on a prompt for content generation, long-form writing, advice/recommendation outputs, or any task where output quality is subjectively assessed. The 7 dimensions cover the bulk of what 'good' means for these tasks.

**Don't use this rubric for:** structured extraction with binary correct/incorrect outcomes (use accuracy metrics), classification (use precision/recall/F1), code generation (use compilation + functional tests), or any task where automated metrics are stronger than human rubric scoring.

**Hybrid approach for production systems:** automated metrics where they exist; this rubric for the human-judgment dimensions automated metrics don't capture. Many production LLM workflows benefit from both — accuracy/precision metrics for what's measurable; rubric scores for craft and audience-fit dimensions.

(Note: variants of this rubric exist in academic prompt-engineering literature; nothing here is original framework. The specific 7-dimension cut and the 1–5 scoring approach are what worked across the workflows I've iterated on; other configurations work too.)

Iterating by vibe ('the output reads better now'): no memory of what changed, no signal on which dimension to push next, no comparison signal across iterations. The default approach and the slowest path to a good prompt.
Iterating with the 7-point rubric: explicit scores per dimension, clear next-iteration target (lowest-scoring dimension), comparison signal across versions, faster convergence to the prompt's quality ceiling.

Apply the rubric to a real prompt this week

  1. 1

    Pick a prompt you've been iterating on for 3+ cycles

    A prompt you've worked on by feel for a while is the ideal first rubric application. You'll have a sense of whether the rubric's scores match your intuition; calibration helps the rubric become reliable for future work.

    → Open the Blog Post Outline Generator
  2. 2

    Run 10 outputs from your current version + score against the 7 dimensions

    Score each output independently across all 7 dimensions on the 1–5 scale. Total per output is out of 35. Average across the 10 outputs to get per-dimension averages and overall average. This is the baseline you'll compare future versions against.

  3. 3

    Identify the weakest dimension and make ONE change to target it

    The lowest-scoring dimension is your biggest opportunity. Make a single targeted prompt change to address it. If weak on Format, strengthen the format instruction. If weak on Specificity, require explicit use of input details. One change only — multiple simultaneous changes obscure what helped.

  4. 4

    Re-run + re-score + compare deltas

    Run the new prompt on 10 outputs. Score against the rubric. The targeted dimension should improve; other dimensions should be roughly stable. If a different dimension degraded, you made a trade — decide if it's worth keeping. Iterate 4–8 cycles maximum; further iteration produces diminishing returns.

Where to use the rubric this week

If you're iterating on a prompt right now: stop the next iteration, score the current version on 10 outputs, then decide what to change based on the weakest dimension. The change you'd make from rubric-scored data is usually different from the change vibe would suggest.

If you score and the prompt is at 22–28: you're in normal production territory. Pick the weakest dimension, iterate. Don't expect to push past 30 on every dimension; some dimensions hit ceilings determined by the task itself, not the prompt.

If your prompt scores under 18: the issue is probably structural (wrong role / wrong format / missing audience), not tweakable. Rebuild from a structured template like the Blog Post Outline Generator and rescore — the structural rebuild often jumps the score 8–12 points in one move.

If you can't get past 28 after 8 iteration cycles: you're probably at the prompt's ceiling. The remaining gap may need a different technique (RAG to add information, fine-tuning to shape behavior, or a different model). Stop tweaking the prompt; address the upstream constraint.

Frequently Asked Questions

Why score prompts against a rubric instead of judging by gut?

Vibe-based iteration produces no memory of what changed, no signal on which dimension to push next, and no comparison across versions. After 5 cycles of vibe iteration you can't tell whether the current prompt is better than version 2. The rubric replaces vibe with measurable comparison — version 3 scoring 24 vs. version 6 scoring 29 is unambiguous, and the per-dimension averages tell you exactly which dimension to improve next. Same iteration loop, dramatically better outcomes.

What are the 7 dimensions?

Specificity (does the output address this specific case?), audience-appropriateness (matches the defined audience?), format adherence (matches the requested structure?), constraint compliance (respects explicit constraints?), coherence (paragraphs flow and connect?), insight/non-obviousness (surfaces non-obvious analysis?), and actionability (reader can act immediately?). Each on a 1–5 scale; total out of 35.

How many outputs do I need to score to get reliable signal?

10 is the minimum reliable sample size. Below 10, individual output variance overwhelms the prompt-version variance you're trying to measure. Scoring 2 outputs gives you noise. 10 outputs lets you compute per-dimension averages that compare meaningfully across prompt versions; 20 is stronger but the marginal value over 10 diminishes.

What's a good rubric score?

Most production prompts cluster at 22–28 after 4–6 iteration cycles. 30+ is excellent; 20- is breaking on multiple dimensions and needs structural revision rather than tweaks. Don't expect to push past 30 on every dimension; some dimensions hit ceilings determined by the task itself (e.g., a classification task may not have high actionability potential), not the prompt's quality.

What if I can't push past 28 no matter how I iterate?

You're probably at the prompt's ceiling. The remaining gap may need a different technique — RAG to add information the model lacks, fine-tuning to shape behavior the prompt can't reliably produce, or a different model entirely. When 8 iteration cycles produce marginal improvement, stop tweaking the prompt and address the upstream constraint.

Should I use this rubric for classification or extraction tasks?

No, generally. Classification has cleaner automated metrics (accuracy, precision, recall, F1). Extraction has structural validity metrics (does the output match the schema?). The 7-point rubric is for tasks with subjective quality dimensions — content generation, advice outputs, long-form writing. Use automated metrics where they exist; use the rubric for what they can't capture (craft, audience-fit, insight).

Iterate prompts with a rubric instead of by vibe.

The Blog Post Outline Generator (and other structured prompt tools) give you starting frameworks the rubric can measure. Free, no signup. Part of 40+ free prompt tools.

Browse all prompt tools →