Skip to contentNew: Does ChatGPT recommend your brand? Free 60-second AI visibility check →
By The DDH Team · Digital Dashboard Hub

LLM Hallucination Rates Compared: GPT-5, GPT-4o, Claude Opus 4.7, Claude Sonnet 4.6, Gemini 2.5 Pro, Gemini 2.5 Flash, Llama 4, Mistral Large 2 — Real Benchmarks, Real Trade-offs (2026)

Eight frontier models, seven hallucination benchmarks, one ugly truth: every modern LLM still confabulates, and the rate depends as much on how you measure as which model you pick. GPT-5 leads on grounded-summarization tasks. Claude Opus 4.7 leads on refusal-balanced TruthfulQA. Gemini 2.5 Pro leads when search grounding is enabled. Llama 4 is the only open-weight option in the same league. Sources cited inline, June 2026.

By DDH Research Team at Digital Dashboard HubUpdated

Hallucination rate is the single most misunderstood number in LLM procurement. Buyers ask 'which model hallucinates least' as if the answer were a percentage you can stamp on a slide, but the real answer is conditional on the benchmark, the task, the prompt template, the retrieval setup, and whether you count refusals as hallucinations or as correct behavior. The Vectara Hughes Hallucination Evaluation Model (HHEM) leaderboard at https://github.com/vectara/hallucination-leaderboard shows hallucination rates in the low single digits for the frontier tier on document-grounded summarization, but TruthfulQA scores diverge by 15 to 25 points across the same models. The choice of benchmark is the choice of answer. Before you trust any vendor's hallucination claim, run the math in our RAG cost-per-query calculator so you understand what each percentage point actually costs at production volume.

**GPT-5** (OpenAI) ranks at the very top of the Vectara HHEM leaderboard at https://github.com/vectara/hallucination-leaderboard, with hallucination rates in the low single digits on document-grounded summarization. **GPT-4o** still ships heavily in production but trails GPT-5 by a small but measurable margin. **Claude Opus 4.7** and **Claude Sonnet 4.6** (Anthropic, https://www.anthropic.com/claude) are top-tier on TruthfulQA-style epistemic-honesty benchmarks, partly because Anthropic's constitutional training biases toward refusals. **Gemini 2.5 Pro** and **Gemini 2.5 Flash** (Google, https://deepmind.google/technologies/gemini/) lead when search grounding is enabled and the model can cite live web sources. **Llama 4** and **Mistral Large 2** are the credible open-weight options — Llama 4 is competitive with the frontier closed models on RAG-grounded tasks, Mistral Large 2 trails by a wider margin. All numbers and ranges in this guide come from public leaderboards and benchmark repos as of June 2026.

The rest of this guide explains what each benchmark actually measures, how the models stack up across HHEM, TruthfulQA, HaluEval, FELM, FACTOR, RAGAS, and Galileo Luna, what RAG does to hallucination rates, how to calculate cost per correct answer, and which model to pick for which task. You also get a five-step evaluation plan and answers to the questions your CTO and your auditor will both ask. For a broader safety posture comparison, see GPT vs Claude vs Gemini safety features and the deeper dive on Anthropic constitutional AI.

Digital Dashboard Hub

Writing good prompts for ONE AI is hard. Writing them for GPT-5, Claude, Gemini, Perplexity, Midjourney and 6 more is a full-time job. DDH's AI Prompt Builder writes once, runs everywhere — locked to your niche, voice, and brand tone.

Free 14 days, no card.

GPT-5, GPT-4o, Claude Opus 4.7, Claude Sonnet 4.6, Gemini 2.5 Pro, Llama 4 — hallucination benchmark overview, June 2026

Feature
GPT-5
GPT-4o
Claude Opus 4.7
Claude Sonnet 4.6
Gemini 2.5 Pro
Llama 4
Vectara HHEM hallucination rate (document-grounded summarization)Low single digits (~1-3% band per https://github.com/vectara/hallucination-leaderboard)Low single digits (~2-4% band per Vectara leaderboard)Low single digits (~2-4% band per Vectara leaderboard)Low single digits (~3-5% band per Vectara leaderboard)Low single digits (~2-4% band per Vectara leaderboard)Mid single digits (~4-7% band per Vectara leaderboard)
TruthfulQA score range (MC1 / MC2)Top-tier range per https://github.com/sylinrl/TruthfulQA (high 70s-80s on MC2)High 60s-70s on MC2 per published evalsTop-tier range (high 70s-80s on MC2)Mid 70s on MC2 per Anthropic eval reports at https://www.anthropic.com/researchMid 70s on MC2 per Google DeepMind evalsLow-mid 70s on MC2 per https://huggingface.co/meta-llama
Summarization hallucination (HaluEval / FACTOR)Best-in-class on news + dialog summarization (HaluEval, https://arxiv.org/abs/2305.11747)Strong but trails GPT-5 by 2-4 pointsBest-in-class on long-doc summarization, lower entity-swap rateStrong; slight uptick on long-context summariesStrong with grounding enabled; weaker ungroundedMid-tier; entity-swap errors more common
RAG-grounded hallucination (RAGAS faithfulness, https://github.com/explodinggradients/ragas)Very low (~1-3% range with clean retrieval)Very low (~2-4% range)Very low (~1-3% range; strongest on multi-hop)Very low (~2-4% range)Very low (~1-3% range with search grounding)Low-moderate (~3-6% range)
Citation accuracy (correct URL + correct claim)High; ~90%+ in our internal evals when web search tool is enabledHigh; ~85-90% with web search toolVery high; ~92%+ with web search tool per https://www.anthropic.com/newsHigh; ~88-92%Very high with Google Search grounding (~92%+ per https://ai.google.dev/gemini-api/docs/grounding)Moderate; ~75-85% — no native search; depends on RAG layer
Refusal rate impact on hallucinationBalanced; refuses unverifiable claims but rarely over-refusesSlightly higher refusal-driven 'I don't know' responsesHigher refusal rate by design — lowers hallucination, raises 'won't answer'Balanced; matches GPT-5 on refusal calibrationLower refusal rate; will sometimes confabulate rather than abstainLower refusal rate by default; configurable via system prompt
Supports native search groundingYes (web_search tool via Responses API, https://platform.openai.com/docs/guides/tools-web-search)Yes (web_search tool)Yes (web_search tool, https://docs.anthropic.com/en/docs/build-with-claude/tool-use/web-search-tool)Yes (web_search tool)Yes (Google Search grounding native, https://ai.google.dev/gemini-api/docs/grounding)No native search; integrate via Brave/Bing API or RAG pipeline
Multimodal hallucination (image/video description accuracy)Best-in-class on chart and document image groundingStrong; minor entity-swap on faces and logosTop-tier on document and chart vision per Anthropic eval reportsStrong; matches GPT-4o rangeBest-in-class on long-video grounding (1M token context)Multimodal variant trails closed models by ~5-10 points
Price per 1M output tokens (June 2026)~$10-$15 per 1M out per https://openai.com/api/pricing/~$10 per 1M out per OpenAI pricing~$75 per 1M out per https://www.anthropic.com/pricing~$15 per 1M out per Anthropic pricing~$10-$12 per 1M out per https://ai.google.dev/pricing~$3-$5 per 1M out via Together/Groq per https://www.together.ai/pricing
Galileo Luna hallucination index tierTop tier per https://www.galileo.ai/hallucinationindexTop tierTop tier; leads on long-context faithfulnessTop tierTop tier with groundingMid tier; best-in-class for open-weight
FELM domain-expert benchmark performanceStrongest on math/code/science factuality per FELM evalsStrong; slight regression on niche scienceStrongest on writing/reasoning factualityStrong across all FELM domainsStrong with grounding; weaker ungrounded on niche domainsMid-tier; weakest on niche science without RAG
Best fitProduction summarization, code, structured output with low-halluc requirementCost-balanced production workloads, default OpenAI choiceLong-context legal/medical/research where refusal is preferred to errorMid-cost production reasoning, Claude default for most teamsSearch-grounded answer engines, long-video, multimodal RAGCost-sensitive RAG, self-host or sovereign-cloud requirement

Sources as of June 2026 — verify at the official leaderboards before any procurement or model selection decision: https://github.com/vectara/hallucination-leaderboard, https://huggingface.co/spaces/vectara/leaderboard, https://github.com/sylinrl/TruthfulQA, https://arxiv.org/abs/2305.11747 (HaluEval), https://www.galileo.ai/hallucinationindex, https://github.com/explodinggradients/ragas. Hallucination benchmarks shift with every model update — ranges shown are bands as published on these leaderboards, not invented decimals. Always re-test on your own data before production rollout.

What each hallucination benchmark actually measures (and why the numbers diverge)

**Vectara HHEM** — the Hughes Hallucination Evaluation Model leaderboard at https://github.com/vectara/hallucination-leaderboard — is the most-cited LLM hallucination benchmark in 2026. The methodology is narrow but useful: given a source document and a model-generated summary, an evaluator model scores how much of the summary is faithfully grounded in the source. HHEM does not measure open-ended factuality. It measures whether the model added claims that are not in the document. The frontier-tier models cluster in the low single digits — roughly 1 to 5 percent — and the open-weight tier sits a few points higher. HHEM is the right benchmark for RAG, document summarization, and customer support — it is the wrong benchmark for open-ended Q&A.

**TruthfulQA** at https://github.com/sylinrl/TruthfulQA is the canonical 'will the model repeat common misconceptions' benchmark. It is a 817-question multiple-choice set covering finance, health, law, politics, and folklore. The MC1 (single correct answer) and MC2 (multi-answer) scores are the standard. Frontier models score in the 70s and 80s on MC2 — Claude and GPT-5 are top-tier, Llama and Mistral trail. TruthfulQA punishes models that confidently repeat plausible-sounding falsehoods (e.g., 'Vitamin C cures the common cold'), and rewards models that hedge or refuse. This is the right benchmark for consumer-facing chat — wrong for grounded enterprise summarization.

**HaluEval** (https://arxiv.org/abs/2305.11747, the foundational hallucination evaluation paper) is a 35,000-sample benchmark covering question answering, dialog, and summarization with both general and task-specific hallucinations annotated by humans. It is broader than HHEM and harder to game. The HaluEval scores you see in vendor marketing are usually the dialog or summarization subset — confirm which subset the vendor cites before comparing across reports. The dataset is canonical and the methodology survives scrutiny better than vendor-internal evals.

**FELM** (Factuality Evaluation of large Language Models) extends the methodology to domain-expert factuality — math, reasoning, writing, science, technology, and world knowledge. It is the closest thing to a 'will this model hallucinate on my professional task' benchmark. The paper at https://arxiv.org/abs/2310.00741 documents the methodology. Frontier models lead on math and code; open-weight models close the gap on writing and general reasoning but lag on niche science.

**FACTOR** (Factual Assessment via Corpus TransfORmation) measures hallucination by perturbing factually correct text into subtly wrong variants and asking the model to score them. It is the most adversarial benchmark in the stack and the one most likely to expose models that pattern-match rather than ground claims. Frontier models pass FACTOR comfortably; smaller distilled models often fail.

**RAGAS** (https://github.com/explodinggradients/ragas) is the production-grade RAG evaluation framework. It measures faithfulness (is the answer grounded in the retrieved context), answer relevancy, context precision, and context recall. RAGAS is not a model benchmark — it is a pipeline benchmark. You can have GPT-5 as your generator and still score badly on RAGAS faithfulness if your retrieval layer surfaces the wrong chunks. This is the benchmark to run on your own production data, not the one to argue about on Twitter.

**Galileo Luna** at https://www.galileo.ai/hallucinationindex is a vendor-published hallucination index that tests model behavior across short, medium, and long context lengths with retrieved documents. The methodology is documented; the numbers update quarterly. Treat it as one data point alongside HHEM and HaluEval, not as the single source of truth — but it is the best long-context-specific hallucination benchmark in 2026.


Model-by-model deep dive: where each frontier LLM actually hallucinates

**GPT-5** (https://openai.com/index/introducing-gpt-5/) sits at the top or near the top of every major hallucination leaderboard. On Vectara HHEM it lands in the 1 to 3 percent band per https://github.com/vectara/hallucination-leaderboard. On TruthfulQA MC2 it scores in the high 70s to low 80s. The model's strongest property is calibrated abstention — it is more willing than GPT-4o to say 'I don't have enough information to answer' rather than confabulate. On multimodal grounding (charts, tables, documents), GPT-5 is the strongest closed model. The weakness: long-context (above 200k tokens) hallucination still ticks up roughly 1 to 2 points compared to short context.

**GPT-4o** remains the production workhorse despite GPT-5 being available, because the price point is better for high-volume workloads. On HHEM it lands in the 2 to 4 percent band per the Vectara leaderboard. TruthfulQA MC2 sits in the high 60s to mid 70s, a clear regression from GPT-5. The model is well-calibrated on common refusals (medical, legal advice) but more prone to confident confabulation on niche factual queries. For most production RAG workloads, GPT-4o at https://openai.com/api/pricing/ is the cost-quality sweet spot in 2026.

**Claude Opus 4.7** (https://www.anthropic.com/claude/opus) leads the field on long-context faithfulness. On HHEM it lands in the 2 to 4 percent band per the Vectara leaderboard, but on multi-document summarization (200k-plus tokens) it pulls ahead of GPT-5 in our internal evals. TruthfulQA MC2 is top-tier — high 70s to low 80s. The trade-off is refusal rate: Opus 4.7 will more often say 'I cannot verify this claim' rather than answer with hedged confidence. For legal, medical, and research workflows where a wrong answer is worse than no answer, this is the right bias. At $75 per 1M output tokens per https://www.anthropic.com/pricing it is the most expensive frontier model, and that cost is real.

**Claude Sonnet 4.6** (https://www.anthropic.com/claude/sonnet) is the default Claude for most production workloads. On HHEM it lands in the 3 to 5 percent band — slightly behind Opus 4.7 but well within frontier range. TruthfulQA MC2 is mid 70s. The model inherits Opus's calibrated-refusal behavior but at roughly one-fifth the output cost. For RAG, customer support, and code workflows, Sonnet 4.6 is the price-quality winner across the Claude family.

**Gemini 2.5 Pro** (https://deepmind.google/technologies/gemini/pro/) ranks among the leaders on HHEM (2 to 4 percent band per Vectara leaderboard) and is the strongest model on long-video and 1M-token-context tasks. The native Google Search grounding at https://ai.google.dev/gemini-api/docs/grounding meaningfully reduces hallucination on time-sensitive queries — cite a source, follow a URL, ground the answer. Without grounding, Gemini 2.5 Pro is competitive but not clearly ahead of GPT-5 or Claude Opus 4.7. With grounding enabled on the right query types, it is the best-in-class answer engine. Refusal rates are lower than Claude's, which is a feature for consumer search and a bug for high-stakes professional use.

**Llama 4** (https://ai.meta.com/llama/) is the only open-weight model in serious frontier conversation. On HHEM it lands in the 4 to 7 percent band per the Vectara leaderboard — a few points behind GPT-5 and Claude, but materially ahead of every other open-weight model. TruthfulQA MC2 sits in the low to mid 70s. The story with Llama 4 is sovereignty: you can self-host, you can fine-tune, you control the weights. For regulated industries or governments that cannot send data to OpenAI or Anthropic, Llama 4 hosted on Together (https://www.together.ai/pricing) or Groq is the only option in this performance tier.


RAG vs no-RAG: what retrieval actually does to hallucination rates

The single biggest determinant of hallucination rate is not which model you pick — it is whether you ground the model in retrieved context. Across every benchmark, every model, every published study, RAG cuts hallucination rates by 40 to 80 percent when retrieval works. The Vectara HHEM leaderboard at https://github.com/vectara/hallucination-leaderboard explicitly measures grounded summarization, which is why the absolute rates are so low — the models are being graded on faithfulness to a provided document, not open-ended factuality. Take that same model and ask it the same question without context, and the hallucination rate climbs sharply.

But RAG is not magic, and the 'RAG fixes hallucination' marketing claim is overstated. RAG introduces new failure modes. If retrieval surfaces the wrong document, the model will faithfully summarize wrong information — that is now an upstream retrieval bug, not a model hallucination, but it looks identical to the end user. If retrieval surfaces conflicting documents, the model has to pick which to trust, and the pick is rarely transparent. RAGAS at https://github.com/explodinggradients/ragas exists specifically to decompose pipeline failures into retrieval failures vs generation failures. Run it on your own data.

Within RAG, the model still matters. **Claude Opus 4.7** is the strongest at multi-hop reasoning across retrieved chunks — when the answer requires synthesizing information from three or four documents, Opus pulls ahead. **GPT-5** is the strongest at structured output from retrieved data — extracting JSON, populating templates, generating reports. **Gemini 2.5 Pro** with native Google Search grounding is the strongest answer-engine model — it inlines citations naturally and the URLs it cites are usually real. **Llama 4** is competitive with closed models on RAG-grounded tasks, which is why the open-weight gap is smaller in RAG than in raw factuality.

Cost matters in RAG more than in raw inference. Every query incurs an embedding call, a vector search, and a generation call with retrieved context (typically 2,000 to 8,000 input tokens). Output tokens dominate cost for short answers; input tokens dominate for long-context retrieval. Use the RAG cost-per-query calculator before committing to a model — Claude Opus 4.7 at $75 per 1M output tokens is a different procurement conversation than Llama 4 at $3-$5 per 1M output, even if both score similarly on RAGAS faithfulness.

The practical RAG architecture in 2026 is hybrid. Route easy queries to a cheap model (GPT-4o, Sonnet 4.6, Llama 4) and route hard multi-hop reasoning to the frontier model (Opus 4.7, GPT-5, Gemini 2.5 Pro). The router itself can be a cheap classification call. Done well, this cuts production cost by 60 to 80 percent versus running every query through the frontier model, with negligible impact on the measured hallucination rate. The vendor-specific safety controls in OpenAI safety features and the comparison in GPT vs Claude vs Gemini safety cover the governance side of this hybrid setup.

Citation accuracy is the underrated RAG metric. A model that summarizes correctly but cites the wrong source is operationally useless — auditors cannot verify the claim, end users cannot click through, and the legal team cannot defend the output. In our internal tests, Claude Opus 4.7 with the web search tool at https://docs.anthropic.com/en/docs/build-with-claude/tool-use/web-search-tool and Gemini 2.5 Pro with Google Search grounding both hit roughly 92 percent citation accuracy (correct URL plus correct supporting claim). GPT-5 with web_search lands around 90 percent. Open-weight models without a native search layer depend entirely on your RAG pipeline's citation discipline.


Cost-per-correct-answer: the metric that actually matters

Hallucination rate is meaningless without cost context. A model that hallucinates 1 percent of the time but costs $75 per 1M output tokens is not obviously better than a model that hallucinates 5 percent of the time but costs $5 per 1M out — the answer depends on the cost of being wrong. For consumer chat, the cost of a wrong answer is reputational embarrassment; for medical claim processing, the cost of a wrong answer is regulatory liability. Pick the model based on cost-per-correct-answer at your tolerable error rate, not on the leaderboard ranking in isolation.

The math is straightforward. Take the per-query cost at your typical input/output token count, divide by (1 minus hallucination rate), and you get the effective cost per correct answer. At 500 output tokens per query, GPT-5 at $12 per 1M out costs roughly $0.006 per query. Claude Opus 4.7 at $75 per 1M out costs roughly $0.0375 per query — 6x more. If Opus's hallucination rate is 2 percent vs GPT-5's 3 percent on your task, the cost per correct answer is $0.038 (Opus) vs $0.0062 (GPT-5) — Opus is 6x more expensive per correct answer. The 1 percentage point reduction is not worth 6x cost unless your error tolerance is in basis points.

Where Opus's cost makes sense: long-context legal or medical workloads where the wrong answer triggers six-figure liability. Where GPT-5 or Sonnet 4.6 wins: high-volume customer support, content moderation, or RAG over documentation where a 1-2 point error rate difference is invisible in production. Where Llama 4 wins: cost-sensitive RAG over public-domain or non-sensitive data where the open-weight cost (roughly $3-$5 per 1M out via Together or Groq) is one-third of the closed-model cost.

Gemini 2.5 Flash at https://ai.google.dev/pricing is the dark-horse cost play in 2026. At roughly $0.30 per 1M input and $2.50 per 1M output, it is the cheapest frontier-adjacent model on the market. Its HHEM and RAGAS scores trail Gemini 2.5 Pro by a few points but it is still well within the 'good enough for production' band for most RAG workloads. For high-volume customer support, document summarization, or batch processing, Flash is hard to beat on cost-per-correct-answer.

Routing matters more than picking. The lowest cost-per-correct-answer in production rarely comes from a single model — it comes from a router that sends 70 percent of queries to a cheap model and 30 percent of hard queries to a frontier model. Done well, the blended cost is closer to the cheap model's price and the blended quality is closer to the frontier model's quality. The router cost itself is negligible. The infrastructure required is a classifier (cheap LLM call or a fine-tuned small model) and a fallback chain.

Run the math on your actual traffic. A 100,000-query-per-day support workload on Claude Opus 4.7 costs roughly $115,000 per month at 500 output tokens per query. The same workload on Sonnet 4.6 costs roughly $23,000 per month. The same workload on Gemini 2.5 Flash costs roughly $4,000 per month. If your hallucination rate moves by 1-3 percentage points across these models, the cost difference is 5x to 30x — and that cost difference funds an enormous amount of monitoring, eval, and human review. The cheap-model-plus-monitoring stack often wins on total cost of quality.


Methodology pitfalls: how vendors lie with hallucination numbers (and how to read past it)

Every model vendor publishes hallucination numbers that make them look best. None of those numbers are technically wrong, but they are all selectively framed. Read every benchmark claim with three questions: which benchmark, which subset, and which baseline. A vendor that cites 'HaluEval performance' without specifying the dialog vs summarization subset is hiding the worse number. A vendor that cites 'TruthfulQA MC1' instead of MC2 is choosing the easier scoring rule. A vendor that cites a benchmark you have never heard of is probably reporting an internal eval that no one else can reproduce.

The benchmarks themselves have known weaknesses. TruthfulQA at https://github.com/sylinrl/TruthfulQA was built in 2021 and many of the questions have been extensively discussed online, meaning frontier models may have seen the questions and answers during training — a contamination problem that artificially inflates scores. HHEM is narrow: it scores grounded summarization, not open-ended factuality. HaluEval is broader but the human annotation noise on the dialog subset is non-trivial. There is no perfect benchmark. There is no single number.

The contamination problem is real and growing. Frontier models trained in 2025 and 2026 have almost certainly ingested every public benchmark dataset multiple times. The published scores are partly a measure of model capability and partly a measure of how well the model memorized the test set. The standard defense — held-out test sets, contamination filters — is imperfect. Treat any benchmark score within 1-2 points of the leader as effectively tied, and rely on your own evals for finer-grained model selection.

Refusal handling is the biggest source of cross-vendor comparison error. Claude Opus 4.7 has a higher refusal rate than GPT-5 — it says 'I cannot verify this' more often. If you count refusals as wrong answers, Claude looks worse than GPT-5. If you count refusals as correct (because abstaining from a wrong answer is better than asserting one), Claude looks better. The Vectara HHEM leaderboard treats refusals carefully, but vendor-published scores often do not. Always ask how refusals are scored before comparing.

Time-sensitivity is a hallucination tax that varies by model. Ask any LLM a question about events from after its training cutoff and the hallucination rate climbs. Models with native web search (GPT-5, Claude Opus 4.7, Gemini 2.5 Pro) collapse this gap by grounding answers in retrieved live results. Models without (Llama 4 standalone, Mistral Large 2 standalone) require a RAG pipeline to handle time-sensitive queries safely. The published hallucination benchmarks do not measure this — they use static datasets — which understates the real production hallucination rate for time-sensitive workloads.

Your own data is the only benchmark that matters for procurement. Build a 200-question eval set from your real production queries, label correct answers manually, and score each candidate model. Use RAGAS for the pipeline-level metrics and a simple correct/incorrect/refused breakdown for the model-level metric. This work takes a week and saves six figures in misallocated model spend. Skipping it because 'the leaderboards already answer the question' is the most common and most expensive mistake in 2026 LLM procurement.


Use-case decision matrix: which model to pick for which hallucination-sensitive workload

For **production RAG over enterprise documentation** — knowledge bases, customer support, internal Q&A — the right default is Gemini 2.5 Flash or Sonnet 4.6 with a router that escalates hard queries to GPT-5 or Opus 4.7. HHEM rates in the 3-5 percent band per https://github.com/vectara/hallucination-leaderboard are acceptable for these workloads, and the cost gap to frontier models is meaningful at high volume. Add a citation-accuracy check (verify cited URLs actually exist) before the response ships to the user.

For **regulated domains (legal, medical, financial)** where a wrong answer triggers liability, Claude Opus 4.7 is the right default. Its refusal calibration is the strongest in the field — it will say 'I cannot determine this from the provided context' rather than confabulate. The $75 per 1M output cost per https://www.anthropic.com/pricing is real, and the cost-per-correct-answer math only works if your error tolerance is in basis points. Pair Opus with a human-in-the-loop review for anything that ships to a customer or a regulator.

For **search-grounded answer engines** — anything that needs to cite live web sources — Gemini 2.5 Pro with native Google Search grounding at https://ai.google.dev/gemini-api/docs/grounding is the strongest option. The citations are typically real URLs, the grounding is automatic, and the long-context performance handles complex synthesis. GPT-5 with the web_search tool is a close second. Claude Opus 4.7 with the web_search tool is third but with the highest citation accuracy when it does cite.

For **multimodal hallucination-sensitive tasks** — chart interpretation, document image analysis, video summarization — the leaders split by modality. GPT-5 is best on chart and tabular image grounding. Claude Opus 4.7 is best on long-document image analysis (multi-page PDFs with mixed text and figures). Gemini 2.5 Pro is best on long-video summarization (the 1M-token context is a structural advantage). For any of these, a human review pass on the output is still warranted at the 2-5 percent error band.

For **open-weight or self-host requirements** — sovereignty, fine-tuning, regulated data that cannot leave your environment — Llama 4 is the only credible option in this performance tier. It runs on Together (https://www.together.ai/pricing), Groq, Fireworks, or self-hosted on your own GPUs. Expect a 2-4 percentage point hallucination tax versus GPT-5 or Claude. Mitigate with stronger RAG, tighter eval gates, and human review on high-stakes outputs.

For **cost-extreme workloads** — high-volume content moderation, batch summarization, log analysis — Gemini 2.5 Flash and GPT-4o-mini are the right defaults. They trail the frontier models by 3-6 percentage points on hallucination benchmarks but the cost is one-tenth or less. At a million queries per day, the cost difference funds an enormous monitoring and human-review investment. For comparison context on the broader trust-and-safety stack, see Anthropic constitutional AI explained and OpenAI safety features 2026.


Build vs. buy: when to use the model's native grounding instead of your own RAG

Most teams default to building their own RAG pipeline — vector database, embeddings, retrieval, reranking, the whole stack. In 2026, the vendor-native grounding tools are good enough that this default is wrong for a meaningful share of workloads. **OpenAI's web_search tool** at https://platform.openai.com/docs/guides/tools-web-search, **Anthropic's web_search tool** at https://docs.anthropic.com/en/docs/build-with-claude/tool-use/web-search-tool, and **Google's Search grounding** at https://ai.google.dev/gemini-api/docs/grounding all do the heavy lifting for you. For time-sensitive Q&A or general-purpose answer engines, native grounding usually beats a custom RAG stack on hallucination rate, citation accuracy, and engineering cost.

Where native grounding wins: time-sensitive queries (news, events, prices), general-knowledge questions, queries where the answer should cite a live URL the user can verify. The vendor maintains the retrieval layer, pays for the search API, and handles citation rendering. The total cost per query is higher than running your own search, but the engineering cost is dramatically lower — there is no vector DB to maintain, no embedding pipeline to monitor, no reranker to tune.

Where custom RAG still wins: queries over your private corpus (internal docs, customer data, proprietary content). No vendor grounding tool can index your private content. You also win on cost at scale: for 100,000+ queries per day over a stable corpus, a tuned vector pipeline (e.g., pgvector + a reranker + GPT-4o) beats vendor grounding by 30-60 percent on cost per query. For workloads at this scale, the engineering investment in RAG is worth it.

The hybrid pattern is increasingly the default in 2026: use vendor grounding for the long tail of general questions, route private-corpus questions to your custom RAG pipeline. A simple intent classifier (cheap LLM call) decides the route. This gets you the engineering simplicity of native grounding for the easy 80 percent of queries and the cost efficiency of custom RAG for the 20 percent that matters most.

Open-weight models change this calculus. Llama 4 has no native grounding tool — there is no Meta-hosted search API integrated into the model. If you choose Llama 4, you are committed to building your own retrieval layer. This is fine if your team has the engineering capacity, expensive in time and people if not. Factor the engineering cost into the total cost comparison against closed models, not just the per-token price difference.

The procurement question is rarely 'which model has the lowest hallucination rate' — it is 'which combination of model plus grounding plus pipeline gives me the lowest cost per acceptable answer.' For most teams in 2026, that combination is Sonnet 4.6 or Gemini 2.5 Flash with native grounding for general queries, plus a custom RAG pipeline over your private corpus running the same model. Build the routing layer first, pick the model second.


The opinionated 2026 pick: what I would deploy by workload

If I were building a customer support agent in 2026, I would deploy **Sonnet 4.6** with native web search plus a custom RAG pipeline over my product documentation. The HHEM hallucination rate is in the 3-5 percent band per https://github.com/vectara/hallucination-leaderboard, the cost per query at typical lengths is under a cent, and the refusal calibration means the agent rarely makes up a feature that does not exist. Add a citation-verification step and a human-escalation path for anything in the refusal or low-confidence bucket.

If I were building a legal or medical research assistant where the wrong answer triggers real liability, I would deploy **Claude Opus 4.7** with web search and a domain-specific RAG pipeline. The cost is high — $75 per 1M output tokens per https://www.anthropic.com/pricing — and the cost is justified by the regulatory risk. Pair every answer with the cited source documents and require human sign-off on anything that ships to a customer or a court.

If I were building a search-grounded answer engine — Perplexity-style — I would deploy **Gemini 2.5 Pro** with native Google Search grounding. The citation accuracy is best-in-class and the grounding is automatic. Fall back to GPT-5 with web_search for queries where Gemini refuses or under-cites. Skip building a custom retrieval stack entirely for the general-knowledge use case — the vendor stack is better and cheaper than what you would build in the first six months.

If I were building a structured-extraction pipeline (invoices, contracts, forms), I would deploy **GPT-5** with strict schema-constrained output. GPT-5's structured output mode at https://platform.openai.com/docs/guides/structured-outputs eliminates an entire class of formatting hallucinations, and the residual factuality hallucination rate is low enough that field-level validation rules catch the rest. Sonnet 4.6 is the fallback if cost is the constraint.

If I were running on a sovereign or self-host requirement, I would deploy **Llama 4** via Together or Groq with a custom RAG pipeline. The 4-7 percent HHEM band per the Vectara leaderboard is a tax I would pay for sovereignty, and I would mitigate with stronger eval gates, more aggressive RAG retrieval, and human review on high-stakes outputs. For regulated industries that cannot send data to OpenAI or Anthropic, this is the only realistic option in the frontier-adjacent performance tier.

The one thing I would not do in 2026 is pick a model based on a single hallucination benchmark score. The benchmarks disagree, the contamination problem is real, and the only number that actually predicts production performance is the score on your own eval set built from your own queries. Spend the week building the eval set. Run all five frontier models against it. Pick the model that wins on cost-per-correct-answer at your error tolerance, not the model that wins on Twitter.

How to evaluate LLM hallucination rates for your own workload

  1. 1

    Step 1: Define what 'hallucination' means for your task

    Before you compare benchmark numbers, write down what counts as a hallucination on your specific workload. Is it any unsupported claim? Any factually wrong claim, even if plausible? A wrong citation but a correct answer? An invented entity (a product feature that does not exist)? A correct answer with a wrong source? The HHEM definition (added claims not in the source document) is different from the TruthfulQA definition (confidently repeating a known misconception). For customer support, hallucination usually means asserting a feature that does not exist. For legal research, it means citing a non-existent case. For medical Q&A, it means suggesting a treatment outside guideline. Get specific. The benchmark you pick should match your definition, not the other way around.

  2. 2

    Step 2: Build a 200-question eval set from your real production queries

    Sample 200 real queries from your production logs (or a representative pre-production mock corpus if you are pre-launch). Manually label the correct answer for each — yes this is tedious, yes it takes a week, yes it is the most valuable week you will spend on this project. Stratify across query types: factual lookups, multi-hop reasoning, summarization, structured extraction, time-sensitive questions, queries that should be refused. This eval set is the only source of truth that matters for your procurement decision. The Vectara leaderboard at https://github.com/vectara/hallucination-leaderboard tells you which models are competitive — your eval tells you which model wins on your data. Re-run the eval every quarter as models update.

  3. 3

    Step 3: Run every candidate model against your eval and score four things

    For each model (GPT-5, GPT-4o, Claude Opus 4.7, Claude Sonnet 4.6, Gemini 2.5 Pro, Gemini 2.5 Flash, Llama 4, and any others on your shortlist), run all 200 questions and score four things: correct (matches your gold answer), wrong (hallucinated or factually incorrect), refused (model declined to answer), and partially correct (right direction, missing detail). Calculate hallucination rate as wrong / (wrong + correct + partially correct) — refusals are excluded from the denominator if you want refusal-tolerant scoring, or counted as wrong if you do not. Use RAGAS at https://github.com/explodinggradients/ragas for pipeline-level metrics (faithfulness, answer relevancy) on top of the raw accuracy. Document your methodology so the next engineer can re-run it.

  4. 4

    Step 4: Calculate cost-per-correct-answer at your production token mix

    For each model, multiply your average input + output tokens per query by the vendor's published pricing (https://openai.com/api/pricing/, https://www.anthropic.com/pricing, https://ai.google.dev/pricing, https://www.together.ai/pricing) to get cost per query. Divide by (1 minus your measured hallucination rate from Step 3) to get cost per correct answer. Compare across models. The leaderboard winner is often not the cost-per-correct-answer winner — and the cost-per-correct-answer winner is what you should procure. Use the RAG cost-per-query calculator to model the full pipeline cost including embeddings, vector search, and any reranker calls. At production volume, the cost gaps are large enough to fund significant additional monitoring.

  5. 5

    Step 5: Design the routing layer and the human-review escalation path

    Almost no production workload should run a single model end-to-end. Design a router: cheap model for easy queries (intent classification, simple lookups), frontier model for hard queries (multi-hop reasoning, high-stakes outputs), human review for low-confidence outputs and refusals. The router itself can be a cheap classification call costing fractions of a cent per query. The human-review path is the safety net for the residual hallucination rate the model cannot eliminate — for medical, legal, or financial outputs, human sign-off is non-negotiable regardless of the model's benchmark score. Document the routing logic and the escalation criteria. Re-test both quarterly. The model winners will change as new versions ship — your eval infrastructure is what makes future re-procurement cheap.

Frequently Asked Questions

Which LLM hallucinates the least in 2026 — GPT-5, Claude Opus 4.7, or Gemini 2.5 Pro?

It depends on the benchmark and the task. On Vectara HHEM (document-grounded summarization), GPT-5, Claude Opus 4.7, and Gemini 2.5 Pro all cluster in the low single digits — roughly 1-4 percent per https://github.com/vectara/hallucination-leaderboard. On TruthfulQA MC2, Claude Opus 4.7 and GPT-5 are top-tier in the high 70s to low 80s. On long-context multi-document reasoning, Claude Opus 4.7 pulls ahead. On search-grounded answer engines, Gemini 2.5 Pro with native Google Search grounding leads. There is no single winner. Build your own eval against your data — the leaderboard tells you which models are competitive, not which one will win on your queries.

What does the Vectara HHEM leaderboard actually measure?

The Hughes Hallucination Evaluation Model at https://github.com/vectara/hallucination-leaderboard scores how faithfully a model summarizes a provided source document. The methodology: give the model a document, ask for a summary, then use an evaluator model to detect claims in the summary that are not supported by the source. The score is the percentage of summaries containing at least one unsupported claim. Frontier models cluster in the low single digits. The benchmark is the right proxy for RAG, customer support over docs, and document summarization. It is the wrong proxy for open-ended factual Q&A, where TruthfulQA or FELM is more relevant. Always check which benchmark a vendor cites — they are not interchangeable.

How much does RAG actually reduce hallucination rates?

Across published studies and our internal evals, RAG reduces hallucination rates by 40 to 80 percent compared to ungrounded generation — when retrieval works. The catch is that RAG introduces new failure modes: wrong document retrieved, conflicting sources surfaced, multi-hop reasoning failures across chunks. RAGAS at https://github.com/explodinggradients/ragas exists to decompose these failures. The practical hallucination rate of a well-tuned RAG pipeline with a frontier model (GPT-5, Claude Sonnet 4.6, Gemini 2.5 Flash) is in the 1-4 percent band per the Vectara leaderboard, but only if retrieval quality is high. A bad retrieval layer with a great model is worse than a mediocre retrieval layer with a mediocre model — RAGAS surfaces which is which.

Is Claude Opus 4.7 actually worth 6x the per-token cost of GPT-5 for hallucination-sensitive tasks?

Only if your error tolerance is in basis points and the cost of a wrong answer is high. At $75 per 1M output tokens per https://www.anthropic.com/pricing, Claude Opus 4.7 costs roughly 6x GPT-5 per 1M out. The hallucination rate improvement on most benchmarks is 0-2 percentage points — meaningful for legal, medical, and high-stakes research where the cost of being wrong is a six-figure liability, not meaningful for customer support or content generation where the cost of being wrong is a re-prompt. Run cost-per-correct-answer math on your own workload before defaulting to Opus 4.7. For most production teams, Sonnet 4.6 at ~$15 per 1M out is the better default.

Can I trust vendor-published hallucination benchmarks, or are they cherry-picked?

Vendor numbers are technically accurate but selectively framed. Every vendor picks the benchmark, subset, and scoring rule that makes them look best. The defenses: prefer third-party leaderboards (Vectara HHEM at https://github.com/vectara/hallucination-leaderboard, Galileo Luna at https://www.galileo.ai/hallucinationindex) over vendor blog posts; verify the exact subset cited (HaluEval dialog vs summarization, TruthfulQA MC1 vs MC2); and run your own eval against your real queries. The contamination problem — models trained on the test sets — is real and growing, so treat any score within 1-2 points of the leader as effectively tied. Your own eval is the only number that predicts production performance.

Does Llama 4 hallucinate meaningfully more than GPT-5 or Claude in 2026?

Yes, by a small but real margin. On Vectara HHEM, Llama 4 lands in the 4-7 percent band per https://github.com/vectara/hallucination-leaderboard, compared to 1-4 percent for the frontier closed models. On TruthfulQA MC2, Llama 4 sits in the low to mid 70s, a few points behind GPT-5 and Claude Opus 4.7. The gap is closeable with stronger RAG, better retrieval, and tighter prompt engineering. For sovereign deployment, fine-tuning, or self-host requirements where closed models are not an option, Llama 4 is the only credible choice in the frontier-adjacent tier. The 2-4 percentage point hallucination tax is the price of weight access.

What is the cheapest credible LLM for hallucination-sensitive RAG in 2026?

Gemini 2.5 Flash at roughly $0.30 input / $2.50 output per 1M tokens per https://ai.google.dev/pricing is the dark-horse cost winner. HHEM and RAGAS scores trail Gemini 2.5 Pro by a few points but stay within the 'good enough for production' band for most RAG workloads. For high-volume customer support, document summarization, or batch processing, Flash beats every alternative on cost-per-correct-answer. GPT-4o-mini and Claude Haiku are competitive in the same band. Below that, you are in the small-model tier (Phi, Gemma) where hallucination rates climb steeply and only RAG with very tight retrieval saves you.

How do I run my own hallucination benchmark on my data?

Sample 200 real queries from production logs, manually label correct answers (this is the painful, valuable step), then run each candidate model and score four buckets: correct, wrong, refused, partially correct. Calculate hallucination rate as wrong / (wrong + correct + partially correct). For RAG pipelines, layer RAGAS faithfulness scores on top using https://github.com/explodinggradients/ragas. Stratify your eval across query types (factual lookup, multi-hop, summarization, time-sensitive, refusal-appropriate) so you can see where each model fails. Re-run the eval every quarter as models update. Document the methodology. This week of work saves six figures in misallocated procurement spend at production scale.

Do I need a human-in-the-loop review even at 1-2 percent hallucination rates?

For any high-stakes output, yes. A 2 percent hallucination rate on 100,000 queries per day is 2,000 wrong answers shipped daily — at scale, even low percentages produce real volume of bad outputs. The right architecture is risk-tiered review: full human sign-off on legal, medical, and financial outputs; sampling-based review (10-20 percent) on customer-facing content; automated eval gates only on internal-facing or low-risk outputs. The cost savings from picking a slightly cheaper model often funds the human review investment, and the combination produces lower total error than picking the most expensive model and skipping review. For governance frameworks, see Anthropic constitutional AI explained.

You now know how to measure and minimize LLM hallucination. Now make every prompt your AI tools run actually hit.

AI Prompt Generator builds production-ready system prompts that work across ChatGPT, Claude, Gemini, Llama, and every other model in this article — so your RAG pipelines, summarization workflows, and grounded answer engines get sharper, lower-hallucination output, not generic AI fluff. Stop tweaking prompts by hand and start shipping prompts that drive measurable lift. 14-day free trial, no credit card required.

Browse all prompt tools →