What each hallucination benchmark actually measures (and why the numbers diverge)
**Vectara HHEM** — the Hughes Hallucination Evaluation Model leaderboard at https://github.com/vectara/hallucination-leaderboard — is the most-cited LLM hallucination benchmark in 2026. The methodology is narrow but useful: given a source document and a model-generated summary, an evaluator model scores how much of the summary is faithfully grounded in the source. HHEM does not measure open-ended factuality. It measures whether the model added claims that are not in the document. The frontier-tier models cluster in the low single digits — roughly 1 to 5 percent — and the open-weight tier sits a few points higher. HHEM is the right benchmark for RAG, document summarization, and customer support — it is the wrong benchmark for open-ended Q&A.
**TruthfulQA** at https://github.com/sylinrl/TruthfulQA is the canonical 'will the model repeat common misconceptions' benchmark. It is a 817-question multiple-choice set covering finance, health, law, politics, and folklore. The MC1 (single correct answer) and MC2 (multi-answer) scores are the standard. Frontier models score in the 70s and 80s on MC2 — Claude and GPT-5 are top-tier, Llama and Mistral trail. TruthfulQA punishes models that confidently repeat plausible-sounding falsehoods (e.g., 'Vitamin C cures the common cold'), and rewards models that hedge or refuse. This is the right benchmark for consumer-facing chat — wrong for grounded enterprise summarization.
**HaluEval** (https://arxiv.org/abs/2305.11747, the foundational hallucination evaluation paper) is a 35,000-sample benchmark covering question answering, dialog, and summarization with both general and task-specific hallucinations annotated by humans. It is broader than HHEM and harder to game. The HaluEval scores you see in vendor marketing are usually the dialog or summarization subset — confirm which subset the vendor cites before comparing across reports. The dataset is canonical and the methodology survives scrutiny better than vendor-internal evals.
**FELM** (Factuality Evaluation of large Language Models) extends the methodology to domain-expert factuality — math, reasoning, writing, science, technology, and world knowledge. It is the closest thing to a 'will this model hallucinate on my professional task' benchmark. The paper at https://arxiv.org/abs/2310.00741 documents the methodology. Frontier models lead on math and code; open-weight models close the gap on writing and general reasoning but lag on niche science.
**FACTOR** (Factual Assessment via Corpus TransfORmation) measures hallucination by perturbing factually correct text into subtly wrong variants and asking the model to score them. It is the most adversarial benchmark in the stack and the one most likely to expose models that pattern-match rather than ground claims. Frontier models pass FACTOR comfortably; smaller distilled models often fail.
**RAGAS** (https://github.com/explodinggradients/ragas) is the production-grade RAG evaluation framework. It measures faithfulness (is the answer grounded in the retrieved context), answer relevancy, context precision, and context recall. RAGAS is not a model benchmark — it is a pipeline benchmark. You can have GPT-5 as your generator and still score badly on RAGAS faithfulness if your retrieval layer surfaces the wrong chunks. This is the benchmark to run on your own production data, not the one to argue about on Twitter.
**Galileo Luna** at https://www.galileo.ai/hallucinationindex is a vendor-published hallucination index that tests model behavior across short, medium, and long context lengths with retrieved documents. The methodology is documented; the numbers update quarterly. Treat it as one data point alongside HHEM and HaluEval, not as the single source of truth — but it is the best long-context-specific hallucination benchmark in 2026.