Phase 1: Why hybrid outperforms either approach — BEIR benchmark evidence
The BEIR benchmark (Thakur et al., 2021, arxiv.org/abs/2104.08663) is the standard evaluation suite for information retrieval. It spans 18 datasets across diverse domains — biomedical (TREC-COVID, NFCorpus), legal (FiQA), code (CodeSearchNet), news (TREC-NEWS), and more. The metric is NDCG@10 (normalized discounted cumulative gain at 10 results): higher is better, range 0-1, meaningful differences are >0.01.
On BEIR, hybrid BM25 + dense retrieval outperforms either approach alone by 5-12 NDCG@10 points on most datasets. The intuition: TREC-COVID queries like 'what is the effect of COVID-19 on hemoglobin levels' benefit from both BM25 (exact term 'hemoglobin') and dense retrieval (semantic similarity to COVID pathophysiology). FiQA financial queries benefit from BM25 on exact ticker symbols and dense retrieval on conceptual questions.
Published results on BEIR from Formal et al. (2021, SPLADE paper) and subsequent work show that SPLADE (a sparse learned model that approximates BM25 with neural expansion) + dense retrieval achieves state-of-the-art on 12 of 18 BEIR datasets. In production systems without SPLADE infrastructure, classical BM25 + dense hybrid with RRF recovers most of the gain — typically within 1-3 NDCG@10 points of SPLADE hybrid at a fraction of the operational complexity.
Practical evidence from production teams (Cohere blog, Pinecone blog, Elastic blog — all 2024-2026): hybrid consistently outperforms dense-only on enterprise RAG datasets, particularly for: (1) product catalogs with SKUs and model numbers; (2) technical documentation with function names and error codes; (3) legal/compliance documents with specific clause references; (4) medical records with procedure codes. Dense-only is competitive on general QA over natural language corpora without specialized terminology.