Cross-encoder architecture: why rerankers work at all
To understand why rerankers exist, you need to understand the trade-off that embedding-based retrieval makes. Embedding models (bi-encoders) encode the query and each document independently into fixed-dimension vectors. Similarity search finds the documents whose vectors are closest to the query vector. The key word is independently — the query and document are never processed together. The model has no way to reason about how well a specific document answers a specific query; it can only compare vector proximity in an abstract embedding space.
Cross-encoders are architecturally different. They take the query and a candidate document as a single concatenated input — [CLS] query [SEP] document [SEP] — and output a single relevance score. Because query and document are processed together, the model can attend to the specific words in the query when scoring the document, and attend to the specific words in the document when scoring the query. The interaction is direct. This is why cross-encoders consistently outperform bi-encoders on precision metrics: they can detect that a document uses the word 'bond' in a financial context (relevant) vs a chemical context (not relevant) based on the specific query asking about 'corporate bonds.'
The cost of this accuracy is inference time. A bi-encoder encodes a query once and then does dot-product similarity against pre-computed document vectors in milliseconds. A cross-encoder must run a forward pass for each (query, document) pair — if you have 100 candidate documents, you run 100 forward passes. This is O(N) per query, where N is the candidate count. At a 1-billion-document corpus, running a cross-encoder over all documents is computationally infeasible. This is why rerankers are always second-stage: first-pass retrieval (BM25 or embedding similarity) narrows candidates to 50-200 documents, then the cross-encoder reranks those candidates precisely.
The latency math: a typical cross-encoder forward pass on a T4 GPU takes 10-50ms depending on document length. Reranking 100 documents sequentially = 1-5 seconds, which is too slow for user-facing latency. Production deployments either (1) batch all 100 pairs into a single GPU inference call (reduces wall-clock time to 50-200ms) or (2) use the cloud API (Cohere, Voyage) which handles batching server-side. Self-hosted BGE needs batching logic in your application; cloud APIs abstract this away.
Why the quality lift is so consistent: the BEIR benchmark (Benchmarking Information Retrieval, 18 heterogeneous datasets covering news, biomedical, financial, code, legal, and general web corpora) consistently shows that reranking on top of first-stage BM25 or dense retrieval adds 3-10 nDCG@10 points. On some BEIR tasks (TREC-COVID, FiQA, SCIDOCS) the lift exceeds 10 points. The average lift across all BEIR tasks is 5-8 nDCG@10 when replacing BM25-only retrieval with BM25 + cross-encoder rerank. This is not a marginal improvement — it moves systems from 'borderline useful' to 'reliably useful' on real retrieval tasks.
Why not just use a bigger embedding model instead of a reranker? Bigger embedding models (e.g., upgrading from voyage-3-lite to voyage-3-large) add 3-6 nDCG@10 points. Adding a reranker adds another 5-10 points. They compound rather than substitute. The optimal architecture for quality-critical RAG is: strong embedding model for first-stage recall + cross-encoder reranker for second-stage precision.