By The DDH Team · Digital Dashboard Hub

When RAG Fails: 7 Root Causes, Real Symptoms, and Proven Fixes

By The DDH Team at Digital Dashboard Hub·Updated June 21, 2026

Stop writing AI prompts from scratch.

Tell us your business + your task + your model. We write the prompt — perfectly tuned for ChatGPT, Claude, Grok, Gemini, Midjourney, or any model. Plus 500+ pre-built prompts in your library.

Retrieval-Augmented Generation promised to ground language models in your actual documents, eliminating the hallucination problem once and for all. In practice, teams deploy RAG, watch it underperform, and spend weeks debugging a system where the failure could be happening at any of a dozen layers. The embarrassing part is that most RAG failures are not exotic. They fall into a small, well-understood set of patterns that repeat across industries, embedding providers, and vector databases. Recognizing which pattern you are dealing with is the difference between a week of targeted fixes and months of unfocused experimentation.

This guide covers seven failure modes in detail. For each one you will find the telltale symptom that distinguishes it from the others, the underlying engineering reason it happens, and the specific techniques that address the root cause rather than masking symptoms. The failure modes are ordered roughly by how early in the pipeline they occur, starting with document processing and ending with LLM-layer defenses. Most production outages involve two or three overlapping modes rather than just one, so the final section covers how fixes compound when you stack them correctly.

If you are evaluating your retrieval architecture from scratch, the RAG architecture decision tree is a good companion to this piece. For the specific embedding model choices that affect vocabulary mismatch and semantic relevance, see the embedding model leaderboard 2026. Teams debugging chunking decisions specifically will want to read chunking strategies 2026 alongside section two below.

Digital Dashboard Hub

Writing good prompts for ONE AI is hard. Writing them for GPT-5, Claude, Gemini, Perplexity, Midjourney and 6 more is a full-time job. DDH's AI Prompt Builder writes once, runs everywhere — locked to your niche, voice, and brand tone.

Free 14 days, no card. →

RAG failure modes at a glance

Feature	Failure mode	Telltale symptom	Primary fix	Effort
Bad chunking	Bad chunking	Right doc retrieved, wrong section; boundary questions fail	Semantic chunking or parent-child architecture	Medium
Vocabulary mismatch	Vocabulary mismatch	Terminology gaps between queries and documents; jargon fails	HyDE + hybrid BM25/dense + domain embedding models	Medium
No reranker	No reranker	Positions 3-10 in retrieval beat position 1 for answer quality	Cross-encoder reranker on top-100 results	Low
Stale data	Stale data	Bot cites superseded policies; users correct the bot	Freshness metadata + delta re-embedding pipeline	High
Hallucination on missing context	Hallucination on missing context	Confident answers to questions the corpus does not contain	Refusal prompting + cosine similarity confidence gate	Low
Retrieval poisoning	Retrieval poisoning	Injected documents manipulate outputs; off-topic answers appear	Ingestion validation + source allowlist + trust namespaces	High
Semantic similarity not equal relevance	Semantic similarity not equal relevance	Topically related but answer-irrelevant docs dominate retrieval	Instruction-tuned embeddings + reranker	Medium

Effort ratings assume an existing vector database and embedding pipeline. 'Low' means a configuration or prompt change; 'Medium' means a pipeline modification; 'High' means an ongoing operational process.

A failure taxonomy for RAG systems

Retrieval-Augmented Generation is a pipeline, not a single model. Documents are chunked and embedded at index time, then queries are embedded at runtime, a nearest-neighbor search retrieves candidate chunks, and an LLM synthesizes an answer from those candidates. Failure can enter at every stage: the chunking strategy, the embedding model, the retrieval step, the ranking of results, the LLM's use of retrieved context, and the freshness of the index. Because the stages are loosely coupled, a failure in one stage often looks like a failure in another, which is why debugging RAG systems is notoriously difficult without a structured taxonomy.

The seven failure modes in this guide correspond to distinct engineering problems at distinct pipeline layers. They are not exhaustive — you can also fail at the infrastructure layer with slow vector queries, or at the query parsing layer with ambiguous question routing — but they cover the majority of quality defects seen in production deployments. Each failure mode has a characteristic symptom fingerprint, which is the most efficient diagnostic tool: if you can match the symptom to the mode, you can skip the generic search and go directly to the known remediation.

One important framing note: a RAG system that produces wrong answers is not necessarily a retrieval failure. It may be a generation failure, meaning the retrieval was correct but the LLM misused the context. Separating retrieval quality from generation quality is the first diagnostic step. Evaluate your retrieval in isolation by checking recall@10 and precision@5 on a labeled eval set before assuming the LLM is at fault. The RAG cost calculator can help estimate the cost implications of different retrieval configurations as you experiment.

Failure mode 1: Bad chunking

The symptom of bad chunking is deceptively specific: the bot retrieves the correct document but the wrong passage within it. Questions about a topic that spans a natural section boundary — say, a refund policy that begins on page 4 and continues on page 5 — will fail consistently because the relevant information is split across two chunks, neither of which alone is sufficient to answer the question. You might also see questions answered only partially, or correctly for one half of a compound question but not the other, which is a strong signal that the answer concept crossed a chunk boundary.

The root cause is almost always fixed-size chunking applied naively. Splitting at exactly 512 tokens regardless of sentence or paragraph structure means that mid-sentence splits are common, and mid-concept splits are certain on any document longer than a few hundred words. The embedding model then creates a vector that represents a semantically incomplete fragment, and that fragment's retrieval score will be lower than it should be even when the full concept was present in the original document. This is an information loss that happens at index time and cannot be recovered at retrieval time without reprocessing.

The best fixes depend on your document structure and performance requirements. Semantic chunking — splitting on embedding-similarity boundaries rather than character counts — preserves concept integrity at the cost of variable chunk sizes and higher preprocessing compute. A more practical intermediate approach is a recursive character splitter with meaningful overlap: splitting on paragraph breaks first, then sentence breaks, then character count as a last resort, with a 15-20% token overlap so boundary concepts appear in both adjacent chunks. For question-answering use cases specifically, a parent-child architecture works well: embed small child chunks for high-precision retrieval, but return the full parent chunk to the LLM for context. This gives you the retrieval precision of small chunks without the context truncation problem. See chunking strategies 2026 for benchmark comparisons across these approaches on standard RAG datasets.

Failure mode 2: Vocabulary mismatch between query and corpus

The symptom here is systematic failure on specific terminology. A query like 'myocardial infarction treatment protocols' returns nothing useful while 'heart attack treatment' retrieves the same document correctly. Or a financial corpus answers questions about 'equity compensation' but fails on 'stock options' even though the documents use both terms interchangeably. Abbreviation mismatches are especially common in technical domains: an IT helpdesk bot trained on documentation that uses 'AD' throughout will fail on queries that spell out 'Active Directory.' The diagnostic test is to look for categories of failed queries that share a terminology pattern.

Dense embeddings partially address vocabulary mismatch because they operate in semantic space rather than keyword space. A well-trained general embedding model knows that 'myocardial infarction' and 'heart attack' are semantically close and will map their query vectors to similar positions. But this generalization breaks down at the edges of the training distribution. Highly specialized domain vocabulary — recent product names, proprietary jargon, regulatory abbreviations, medical subspecialty terminology — may be poorly represented in a general embedding model trained on web text. When a term appears infrequently or inconsistently in pre-training data, the model's representation of it is unreliable.

There are three distinct fixes with different trade-off profiles. The first is query rewriting: use an LLM to expand the user query with synonyms and related terminology before embedding it, or implement Hypothetical Document Embedding (HyDE), where you prompt the LLM to generate a hypothetical answer document, embed that document instead of the raw query, and use the answer-space vector for retrieval. HyDE works because the hypothetical answer uses corpus-native language even when the user's query does not. The second fix is hybrid search, which combines BM25 keyword matching with dense retrieval and lets BM25 handle exact-match terminology that the dense model misses. For a detailed implementation guide see hybrid search: BM25 plus dense. The third fix is domain-specific embedding models: Voyage AI publishes voyage-finance-2 and voyage-law-2 specifically trained on domain corpora. For highly specialized domains, the difference in retrieval quality between a general model and a domain model can be substantial enough to justify the switch.

Failure mode 3: No reranker in the pipeline

This failure mode produces a specific and diagnosable pattern: when you manually inspect the retrieval results, the most useful document for answering the question is at position 3, 5, or 7 rather than position 1. The top-1 retrieved document is technically relevant — it is about the right topic — but it is not the best answer to the specific question. If you measure answer quality as a function of which retrieved document the LLM is given, you find that the LLM's answer improves noticeably when you manually select a lower-ranked document. This is not a coincidence; it is the signature of a system that is doing reasonable retrieval but poor ranking.

The root cause is a fundamental mismatch between what embedding models optimize for and what RAG systems need. Embedding models are trained to place semantically similar texts near each other in vector space. 'Semantic similarity' means 'about the same topic,' not 'one of these best answers the other.' A chunk about customer churn rates will be very close in embedding space to a query about customer churn, but so will a chunk about customer churn measurement methodology, and a chunk about reducing churn, and a general overview of churn as a metric. All of these are topically similar. Only one of them might best answer 'what was our Q3 churn rate?' The embedding model cannot distinguish between them on this dimension because it was not trained to.

One team's experience illustrates the impact directly: 'We added a reranker and overnight precision@5 jumped from 54% to 78%. The embedding retrieval was surfacing the right pool of docs; the reranker found the best one. We had been blaming the embeddings when the architecture gap was one layer higher.' The fix is to add a cross-encoder reranker between retrieval and generation. Unlike bi-encoder embedding models that score documents independently, a cross-encoder takes the query and a candidate document together and scores how well the document answers that specific query. The standard pattern is to retrieve the top 100 candidates by embedding similarity, then rerank to top 10 using the cross-encoder, then pass the top 5-10 to the LLM. Cohere rerank-v3.5, Voyage rerank-2, and the open-source BGE-reranker-v2 family are all solid options. The latency cost is real — cross-encoder inference adds 100-300ms depending on batch size — but the precision improvement typically justifies it in any user-facing system. Compare model options at Cohere vs Voyage vs OpenAI embeddings.

Failure mode 4: Stale data in the index

Stale data is one of the most insidious RAG failures because it can go undetected for weeks. The bot retrieves documents confidently, cites sources that appear legitimate, and produces grammatically coherent answers. The answers are just wrong because they reflect an earlier state of the world. The telltale symptom is users correcting the bot: 'that policy changed last year,' 'that pricing is out of date,' 'we no longer support that feature.' In support contexts, stale data is a trust-destruction event because users receive wrong information from a system that appears authoritative.

The operational story is common: 'Our customer support bot kept retrieving 2022 refund policies after we rewrote them in 2024. Users were getting incorrect information for 6 weeks before we found the root cause: the re-embedding pipeline ran manually, not automatically on document update.' The root cause is an index that is treated as static infrastructure rather than as a living mirror of the source document state. Many teams embed their document corpus once during initial deployment and then treat the index as finished. Document sources — wikis, policy documents, product documentation, database snapshots — continue to change, but the index does not.

Fixing stale data requires an operational process, not just a technical patch. The foundation is metadata: every chunk must carry a last_modified timestamp that reflects when its source document last changed. Without this metadata, you cannot even measure staleness, let alone address it. Once metadata is in place, you have two retrieval-time options: freshness-weighted retrieval that penalizes old vectors in scoring, or hard cutoffs that exclude chunks whose source was modified before a configurable threshold. The more robust fix is a delta-indexing pipeline that monitors source documents for changes — via webhooks, polling, or a change data capture feed — and re-embeds and re-indexes only the changed documents within 24 hours. Combined with an explicit date-awareness instruction in the system prompt ('prefer information from documents modified after [date]'), this brings staleness to a manageable level. The pipeline investment is higher than most fixes in this guide, but the alternative is quietly wrong answers that erode user trust.

Failure mode 5: Hallucination on missing context

Every RAG system has a coverage boundary: questions outside that boundary cannot be answered from the corpus. The failure mode occurs when the LLM produces a confident-sounding answer anyway, drawing on parametric memory (knowledge baked into model weights during pre-training) rather than signaling that the retrieved context does not contain the answer. The symptom is specific: users report fabricated citations, invented facts, or answers that reference real-sounding but non-existent documents. The distinguishing feature is that the question is genuinely outside the corpus — it is not a retrieval quality problem, it is a generation guard problem.

The root cause is that language models have a strong inductive bias toward producing answers. When given a question and context that does not answer the question, many LLMs will attempt to synthesize an answer from partial evidence rather than refusing. This is a desirable behavior in some contexts — interpolating across incomplete information is useful — but it is the wrong default for a system intended to ground answers in specific documents. The problem is compounded by retrieval systems that always return some result even when the cosine similarity is very low: the LLM receives low-quality context and, rather than recognizing it as noise, builds on it.

The most effective defense is layered. Start with refusal prompting: the system prompt should explicitly instruct the LLM to respond 'I don't have information about that in the provided documents' when retrieved context does not contain sufficient evidence to answer. This is a surprisingly effective and underused technique. The second layer is a retrieval confidence gate: before passing retrieved chunks to the LLM, check the maximum cosine similarity score in the result set. If it falls below a calibrated threshold — typically 0.55-0.65, though this varies by domain and embedding model — route the query to a fallback path such as a human handoff, a search engine suggestion, or a templated refusal response. The threshold requires calibration against your eval set rather than using a universal value. Combining these two defenses — prompting the LLM to refuse and gating on retrieval confidence — reduces hallucination on out-of-domain questions substantially without degrading performance on in-domain questions.

Failure mode 6: Retrieval poisoning and adversarial documents

Retrieval poisoning is the RAG equivalent of SQL injection: an attacker influences the system's behavior by controlling input data rather than query syntax. The symptom is anomalous: the bot produces outputs that seem optimized for a different purpose than answering the user's question — off-brand content, competitor references, unusual call-to-action language, or answers that seem designed to mislead. In some cases, poisoning is not deliberate; poorly curated corpora that ingest low-quality external content can produce similar symptoms through accidental rather than adversarial contamination.

The root cause is a lack of trust boundaries in the ingestion pipeline. Every document that enters the vector index gains the ability to influence retrieval results. In a closed-corpus RAG system where you control all source documents, this is not a concern. In any system that ingests external content — user-submitted documents, web crawls, third-party data feeds — it becomes a real attack surface. An optimized adversarial document can be crafted to have high cosine similarity to common query types while containing manipulated content, effectively hijacking the retrieval step for those queries.

Defense requires treating ingestion as a security boundary rather than a data plumbing problem. The immediate fixes are a source allowlist (only documents from approved sources are eligible for indexing), content classification before indexing (a lightweight classifier that flags documents containing off-topic or suspicious content for human review), and document-level trust scores stored as metadata. The trust score can then be used as a filter or boosting signal at retrieval time: high-trust sources (your own documentation) are weighted more heavily than low-trust sources (user-submitted content). Namespace separation provides an additional layer: high-trust and low-trust documents live in separate collections within the vector database, and retrieval policies specify how much weight each namespace gets. This is higher operational overhead than most other fixes in this list, but it is the correct architecture for any RAG system with external content inputs.

Failure mode 7: Semantic similarity does not equal relevance

This failure mode is subtle because the retrieval appears to be working correctly on inspection. If you query for 'customer churn analysis' and look at the top-10 retrieved documents, they are all about customer churn — which looks correct. But the LLM's answer is vague and non-specific because 'about customer churn' is not the same as 'answers the question about customer churn.' You are retrieving all the topically related documents rather than the specifically answer-relevant ones. The symptom is answers that are correct in domain but wrong in specificity: technically accurate but not actually useful.

The root cause is that standard embedding models are trained to optimize for topical proximity, not for question-answering relevance. The objective function during training is something like 'similar documents should be close in vector space,' where similarity is defined by co-occurrence, paraphrase datasets, or contrastive signals from the training corpus. None of these objectives specifically train the model to place 'Q3 churn was 4.2%' close to 'what was our Q3 churn rate?' — which is the question-answering retrieval signal we actually want.

The most practical fix is the same as for failure mode 3: a cross-encoder reranker. The reranker is trained specifically on (question, passage) pairs and learns to score relevance in the question-answering sense rather than the topical-proximity sense. If you are already adding a reranker for failure mode 3, you address failure mode 7 simultaneously. For teams that want to address this at the embedding layer rather than the reranking layer, instruction-tuned embedding models are the answer. Models like Voyage-3-large and NVIDIA's NV-Embed-v2 accept explicit task instructions at embedding time: you can tell the model 'embed this as a query about the following question' rather than 'embed this text,' and the resulting vectors are optimized for retrieval rather than general similarity. Late-interaction models like ColBERT take this further by preserving token-level interaction signals rather than compressing the full document into a single vector, which enables finer-grained relevance matching. The trade-off is storage cost: ColBERT requires storing per-token embeddings rather than a single vector per chunk.

Stacking fixes for production-grade RAG

The seven failure modes are not independent. Fixing one often reduces the severity of another, and stacking multiple fixes produces compounding improvements that exceed the sum of individual gains. The most important stacking insight is that chunking quality determines the ceiling for all downstream fixes. If your chunks are semantically incoherent, the best reranker in the world cannot rescue retrieval, because the information needed to answer the question may simply not be present in any single chunk. Fix chunking first, measure your recall@10 baseline, then add the reranker, then add the confidence gate. This ordering is important because each fix changes the error distribution that the next fix needs to handle.

The second key stacking insight is that metadata quality multiplies the value of freshness, trust, and confidence gating. Timestamps enable freshness filtering. Source provenance enables trust scoring. Both enable the confidence gate to make better routing decisions. Teams that add metadata to their index at the beginning of their RAG buildout find that every subsequent improvement is easier to implement and measure. Teams that skip metadata at the start often find themselves reprocessing their entire corpus later, which is expensive and disruptive. Adding metadata is low engineering effort up front and high leverage across the pipeline.

A practical production stack, ordered by implementation sequence, looks like this: start with semantic or overlap-aware chunking; add comprehensive metadata (source, last_modified, document_type, trust_level) at index time; implement hybrid BM25 plus dense retrieval if your domain has vocabulary gaps; add a cross-encoder reranker on the top-100 candidates; implement the retrieval confidence gate with a tuned cosine threshold; add refusal prompting to the system prompt; and set up delta re-indexing for document freshness. This is a week to two weeks of engineering for a team starting from a basic vector search implementation, and the improvement in precision and recall is typically large enough to justify the investment before any user-visible product changes.

Measuring and monitoring your RAG pipeline

Improvements are only meaningful if you can measure them. The foundational RAG evaluation metrics are recall@k (what fraction of the time does the correct chunk appear in the top-k retrieved results) and precision@k (what fraction of the top-k retrieved results are actually relevant). To compute these you need a held-out evaluation set: a collection of (question, relevant_chunk_id) pairs drawn from your actual use case. Building a 100-200 question eval set from production queries with human relevance labels is the highest-leverage investment you can make before tuning any RAG parameter. Without it, all changes are guesses.

Beyond retrieval metrics, you should track generation quality metrics: answer faithfulness (does the answer only make claims supported by the retrieved context?), answer relevance (does the answer address the question asked?), and context precision (of the retrieved chunks passed to the LLM, what fraction were actually used in forming the answer?). The RAGAS framework provides automated metrics for all of these, though automated metrics should be validated against human judgment before being used to drive decisions. Faithfulness in particular is a useful proxy for hallucination on missing context (failure mode 5): low faithfulness scores indicate the LLM is drawing on parametric memory rather than retrieved context.

Monitoring in production requires tracking the distribution of cosine similarity scores for production queries. Sudden drops in the average max similarity score indicate either a shift in user query distribution (users are asking questions the corpus was not built to answer) or a staleness problem (the corpus no longer reflects the current state of the topic). Both warrant investigation. A weekly automated report of the 50 lowest-similarity queries is a simple and effective early warning system for coverage gaps. For cost-per-query benchmarking as you tune these parameters, the RAG cost calculator provides a useful framework.

Diagnosing your RAG pipeline

1
Establish baseline recall@10 on a held-out eval set
Before changing anything in your pipeline, measure where you are. Build or sample 100-200 (question, relevant_chunk_id) pairs from production queries or domain expert annotation. Run your current retrieval against this eval set and compute recall@10: what fraction of questions have the correct chunk in the top-10 results. This number is your diagnostic baseline. A recall@10 below 0.70 indicates retrieval is the primary problem. A recall@10 above 0.85 with poor answer quality suggests the problem is in ranking, generation, or context utilization rather than retrieval. You cannot know which path to take without this measurement.
2
Add metadata and timestamps to all chunks
Audit your current chunk schema. Every chunk should carry at minimum: source_url or document_id, last_modified timestamp, document_type (policy, FAQ, product spec, etc.), and a trust_level indicator if your corpus has mixed sources. If chunks are missing these fields, reprocess the corpus to add them before making any other changes. Metadata is the substrate that freshness filtering, trust scoring, and staleness detection are built on. Adding it retroactively is more expensive than adding it during initial indexing; do it once, do it completely, and your subsequent improvements will be faster to implement and easier to measure.
3
Add a cross-encoder reranker layer
After verifying that recall@10 is acceptable (the right document appears in the top-10), add a cross-encoder reranker to improve precision. Retrieve the top 50 or 100 candidates by embedding similarity, then apply the reranker to score each (query, candidate) pair and select the top 5-10 for the LLM. Cohere rerank-v3.5, Voyage rerank-2, and the BGE-reranker-v2-m3 open-source model are all viable starting points. Measure precision@5 before and after. In most systems, adding a reranker is the single highest-ROI change you can make after chunking quality is addressed, with precision@5 improvements of 15-30 percentage points reported across multiple published evaluations.
4
Implement refusal prompting for low-confidence retrievals
Add two defenses against hallucination on missing context. First, update your system prompt to explicitly instruct the LLM to refuse when context is insufficient: something like 'If the provided documents do not contain enough information to answer the question, respond with: I do not have that information in the available documents.' Second, implement a pre-LLM confidence gate: check the maximum cosine similarity score in the retrieved result set, and if it falls below a threshold calibrated on your eval set (typically 0.55-0.65), route to a fallback path rather than sending the low-quality context to the LLM. Calibrate the threshold by finding the similarity score that separates correct retrievals from incorrect ones on your eval set, not by using a universal value.
5
Set up a freshness pipeline for document updates
Implement automated detection and re-indexing for changed source documents. The mechanism depends on your document sources: webhook callbacks for wiki platforms, filesystem watchers for local document stores, polling with ETag or Last-Modified header comparison for web sources, or change data capture for database-backed content. When a source document changes, re-chunk and re-embed only the affected document and replace its chunks in the index. Store the re-embedding timestamp in chunk metadata and log it for monitoring. Set up a weekly audit that lists all chunks whose last_modified timestamp is more than 30 days older than their source document's known modification date. This audit will catch gaps in your freshness pipeline before they become user-visible incidents.

Digital Dashboard Hub

The prompt patterns above work 10x better when they live in a library you actually own — tunable to your niche, exportable to GPT-5, Claude, Gemini, Perplexity, Midjourney, Llama. Stop pasting across 6 tools.

Try DDH's AI Prompt Builder — free 14 days, no card. →

Related calculators

OpenAI Pricing Calculator →GPT-5.5, 5.4, mini, nano — full per-call cost in one input.Claude Pricing Calculator →Opus 4.8, Sonnet 4.6, Haiku 4.5, Fable 5 — input + output combined.Context Window Comparison →Max input length and price per 1M for every current model.

Related prompt tools

Chunking Strategies for RAG (2026): Fixed-Size vs Semantic vs Hierarchical, Benchmarked→How to Implement Hybrid Search: BM25 Plus Dense Retrieval→RAG Architecture Decision Tree 2026→

Frequently Asked Questions

How do I know if my RAG system is failing?

The clearest signals are: users correcting the bot's answers, questions about topics you know are in the corpus returning 'I don't know' responses, answers that are vague and topically correct but not specifically useful, and users reporting fabricated citations or invented information. For a systematic diagnosis, build a small held-out eval set of 50-100 questions with known correct answers and measure recall@10 and answer faithfulness. Low recall@10 (below 0.70) points to a retrieval problem. Acceptable recall but low faithfulness points to hallucination or generation-layer issues. Low precision@5 despite high recall points to a missing reranker.

What is HyDE (Hypothetical Document Embedding) and does it actually work?

HyDE addresses vocabulary mismatch by generating a hypothetical answer document with an LLM before performing retrieval. Instead of embedding the user's raw query, you prompt the LLM to write a short passage that would answer the question, then embed that passage and use it for nearest-neighbor search. The hypothesis is that a generated answer will use the same vocabulary as real answers in your corpus, making the retrieval vector more similar to relevant documents than the original query vector would be. In benchmarks on knowledge-intensive tasks, HyDE improves retrieval quality by 5-15% depending on domain. It adds one LLM call per query, so the latency and cost trade-off should be evaluated against your specific use case. For highly specialized domains with large vocabulary gaps, the improvement can be substantially larger.

How much does adding a reranker improve RAG accuracy?

Published evaluations from Cohere, Voyage AI, and academic benchmarks show precision@5 improvements of 15-30 percentage points when adding a cross-encoder reranker to a system that previously used only embedding similarity for ranking. The improvement is largest when the initial retrieval recall is good but precision is poor — that is, when the correct documents are being retrieved but not ranked first. The war story earlier in this post represents a real pattern: jumping from 54% to 78% precision@5 overnight is not unusual for a system that had never used a reranker. The latency cost is typically 100-300ms for reranking 50-100 candidates, which is acceptable for most user-facing applications.

What is retrieval poisoning and how real a threat is it?

Retrieval poisoning is an attack where an adversary inserts documents into your RAG corpus that are crafted to appear highly relevant to specific query types but contain manipulated content. When a poisoned document ranks highly for a targeted query, the LLM uses it as context and produces a manipulated answer. The threat is most concrete in systems that ingest external content: user-submitted documents, web crawls, third-party feeds. For purely internal corpora where you control all source documents, the threat model is lower but not zero — insider threat and accidental contamination are still real. The primary defenses are ingestion validation with content classifiers, source allowlists, and namespace separation by trust level.

How do I prevent LLM hallucination in a RAG system?

There are three complementary defenses. First, refusal prompting: explicitly instruct the LLM in the system prompt to acknowledge when the provided context does not contain sufficient information to answer a question. This is the most immediate and lowest-cost fix. Second, retrieval confidence gating: check the maximum cosine similarity score before passing context to the LLM, and route low-confidence retrievals to a fallback path rather than asking the LLM to generate from poor context. Third, faithfulness evaluation in production: use an automated metric or a lightweight LLM judge to flag answers that contain claims not supported by the retrieved documents. No single defense eliminates hallucination entirely; the combination of all three is significantly more effective than any one alone.

What is a retrieval confidence threshold and how do I choose one?

A retrieval confidence threshold is a minimum cosine similarity score that the top-retrieved document must meet before the query proceeds to the LLM. If the best-matching document has a similarity score below the threshold, the system routes to a fallback rather than generating from weak context. Choosing the threshold requires a labeled eval set: compute the cosine similarity score for the top-1 retrieved document for each question in your eval set, and separate questions where the correct document was retrieved from questions where it was not. Find the similarity score that best separates these two groups — this is your calibrated threshold. A universal threshold of 0.6 is a reasonable starting point for general-purpose embedding models, but domain-specific corpora and models may require different values. Expect to recalibrate after any change to the embedding model or corpus.

How often should I re-embed my entire corpus?

Full corpus re-embedding is expensive and usually unnecessary if you have a delta re-indexing pipeline that keeps individual documents current. The cases where full re-embedding is justified are: a major upgrade to the embedding model (for example, switching from a first-generation to a third-generation model), a significant expansion of the corpus domain that suggests the index distribution has shifted substantially, or a discovered data quality issue that affected a large fraction of chunks at index time. For routine operations, a delta pipeline that re-embeds changed documents within 24 hours and a weekly audit that checks for staleness should be sufficient. If your corpus changes rapidly — daily updates to many documents — invest in the delta pipeline early rather than relying on periodic full re-embeds.

Are there domains where RAG consistently underperforms fine-tuning?

RAG tends to underperform fine-tuning in two scenarios: highly procedural tasks where the model needs to internalize a process rather than retrieve a fact (for example, code generation following specific style patterns), and domains where the 'knowledge' is implicit rather than document-representable (for example, calibrated numerical judgment or tacit expert intuition). For factual question-answering over a known corpus — which is the primary use case for enterprise RAG — RAG outperforms fine-tuning on knowledge freshness and interpretability because retrieved context is auditable and updatable without retraining. The most capable production systems often combine both: a fine-tuned model for domain vocabulary and reasoning style, and RAG for factual grounding.

Build RAG system prompts that actually constrain the LLM

The difference between a RAG bot that hallucinates and one that reliably refuses out-of-scope questions is often just the system prompt. Use our prompt builder to construct retrieval-grounded system prompts with refusal clauses, context-utilization instructions, and confidence-signaling templates — no prompt engineering experience required.

Browse all prompt tools →