A failure taxonomy for RAG systems
Retrieval-Augmented Generation is a pipeline, not a single model. Documents are chunked and embedded at index time, then queries are embedded at runtime, a nearest-neighbor search retrieves candidate chunks, and an LLM synthesizes an answer from those candidates. Failure can enter at every stage: the chunking strategy, the embedding model, the retrieval step, the ranking of results, the LLM's use of retrieved context, and the freshness of the index. Because the stages are loosely coupled, a failure in one stage often looks like a failure in another, which is why debugging RAG systems is notoriously difficult without a structured taxonomy.
The seven failure modes in this guide correspond to distinct engineering problems at distinct pipeline layers. They are not exhaustive — you can also fail at the infrastructure layer with slow vector queries, or at the query parsing layer with ambiguous question routing — but they cover the majority of quality defects seen in production deployments. Each failure mode has a characteristic symptom fingerprint, which is the most efficient diagnostic tool: if you can match the symptom to the mode, you can skip the generic search and go directly to the known remediation.
One important framing note: a RAG system that produces wrong answers is not necessarily a retrieval failure. It may be a generation failure, meaning the retrieval was correct but the LLM misused the context. Separating retrieval quality from generation quality is the first diagnostic step. Evaluate your retrieval in isolation by checking recall@10 and precision@5 on a labeled eval set before assuming the LLM is at fault. The RAG cost calculator can help estimate the cost implications of different retrieval configurations as you experiment.