The 6 factors and how to score them
**Factor 1 — Specificity Gap (0–3).** Does the prompt ask for specific facts the model may not know? Score 0 if the request is general/inferential (an analysis, an opinion, a structural summary). Score 1 if it asks for common facts (capital cities, well-known historical dates). Score 2 if it asks for niche facts (specific industry statistics, particular individuals' biographical details, technical specifications). Score 3 if it asks for facts the model almost certainly doesn't have reliably (real-time data, post-training-cutoff events, internal company information).
**Factor 2 — Citation Invitation (0–3).** Does the prompt invite the model to provide citations or sources? Score 0 if the prompt requires structured citations with source titles + exact quotes that downstream validation can check. Score 1 if the prompt mentions citations vaguely ('cite sources where applicable'). Score 2 if the prompt doesn't mention citations but the task implies factual claims. Score 3 if the prompt asks for confident factual output without any citation structure (this is the highest-risk configuration for factual hallucination).
**Factor 3 — Recency Requirement (0–3).** Does the task require information that may be after the model's training cutoff? Score 0 if the topic is timeless (definitions, historical facts before 2023, mathematical reasoning). Score 1 if the topic is mostly stable with occasional updates (well-established frameworks, classical references). Score 2 if the topic evolves quarterly (industry trends, software versions, technology landscape). Score 3 if the topic changes daily/weekly (current events, prices, stock data, recent product releases).
**Factor 4 — Niche Depth (0–3).** How deep into a specialized domain does the prompt go? Score 0 for general-public-knowledge topics. Score 1 for college-level specialist topics (basic economics, common medical conditions). Score 2 for professional-specialist topics (specific drug interactions, advanced legal doctrine, niche technical domains). Score 3 for expert-specialist topics with thin web corpus (rare conditions, narrow professional sub-specialties, recently-emerged technical patterns). Deeper niche = thinner training data = higher hallucination risk.
**Factor 5 — Claim Type (0–3).** What kind of claims is the prompt asking for? Score 0 if the output is opinions, analysis, or structured transformations (paraphrase, summarize, classify). Score 1 if the output mixes opinion and verifiable claims. Score 2 if the output is primarily verifiable factual claims (statistics, attributions, dates). Score 3 if the output is high-confidence factual claims meant for downstream use (legal advice, medical recommendations, financial guidance). Higher claim certainty + higher downstream stakes = higher hallucination risk and cost.
**Factor 6 — Output Length (0–3).** How long is the expected output? Score 0 for short outputs (under 200 words). Score 1 for moderate outputs (200–800 words). Score 2 for long outputs (800–2500 words). Score 3 for very long outputs (2500+ words). Longer outputs accumulate hallucination opportunities — each paragraph is another chance for a confident-but-wrong claim. The risk isn't quite linear with length; it grows faster because later paragraphs lose context to earlier ones in the model's attention.