Rerankers are the second-stage filter in a modern retrieval pipeline. After your embedding model returns the top-50 candidates from the vector DB, a reranker scores each (query, document) pair using a cross-encoder model that reads both pieces of text together — far more accurate than the bi-encoder embeddings, which encode query and document independently. The result is a re-ordered list where the top-5 are dramatically more likely to contain the correct chunk. Pricing in 2026 is tiered cleanly: Cohere Rerank v3 at $1.00 per 1M reranked pairs is the quality leader; Voyage Rerank-1 runs roughly $0.05 per 1,000 pairs (i.e., $50 per 1M); Jina Reranker v2 prices at $0.02 per 1M tokens (a different unit — counts tokens across query and document, not pairs); and MixedBread's open-weights rerank model hosted via Together AI lands near $0.0005 per 1M tokens, the cheapest production-grade option.
The unit matters. Reranker bills count pairs, not tokens, on Cohere and Voyage. A 'pair' is one query combined with one candidate document. If you retrieve top-50 from the vector DB and rerank them against a single query, that is 50 pairs — not 50 × document_length tokens. Jina's token-based pricing reads differently: a typical 500-token document plus a 50-token query is 550 tokens per pair, so 50 pairs at 550 tokens = 27,500 tokens per query. At Jina's $0.02/1M that is $0.00055 per query for the rerank step. At Cohere Rerank v3, 50 pairs × $1/1M = $0.00005 per query. At Voyage Rerank-1, 50 pairs × $50/1M = $0.0025 per query. The cheapest is roughly 50x cheaper than the most expensive, but all are sub-cent.
A typical RAG retrieval pipeline at scale prices out cleanly. For a single user query: embed the query string (~50 tokens × $0.02/1M for text-embedding-3-small) = $0.000001. Vector search against the index is a fixed infrastructure cost — call it $0.00001 of amortized Pinecone serverless time per query at 1M queries/month. Rerank the top-50 with Cohere Rerank v3 = $0.00005. Pass the top-5 reranked chunks plus the user query into the LLM call — at GPT-4.1 ($2/1M input, $8/1M output) with 3,000 input tokens and 500 output tokens, that is $0.010 per query. The LLM call is the entire bill, roughly 100-200x larger than every retrieval step combined.
Reranker quality gain often exceeds the gain from upgrading the embedding model. On a representative internal-knowledge-base eval — 50,000 chunks, 200 hand-labeled queries — text-embedding-3-small alone returned recall@5 of 78%. Upgrading to text-embedding-3-large (a 6.5x cost increase) lifted it to 83%. Keeping text-embedding-3-small and adding Cohere Rerank v3 lifted recall@5 to 91% — a 13-point gain at $0.00005 per query. The reranker path wins on both quality and total cost: $0.02/1M for embeddings plus $1/1M-pairs for rerank beats $0.13/1M for embeddings alone, while delivering 8 points more recall. This pattern repeats across most public retrieval benchmarks where rerank ablations are reported.
The mechanism is straightforward. Embeddings compress meaning into a fixed vector before ever seeing the query — they cannot adapt their representation to the question being asked. A cross-encoder reranker reads the query and the candidate document together and produces a relevance score conditioned on the specific query. That conditional view catches near-misses the embedding step ranks similarly but for irrelevant reasons (shared topic keywords, similar phrasing, popular concepts). On corpora with high lexical overlap between irrelevant documents — legal filings, support tickets, academic papers in adjacent subfields — the reranker gap over embeddings alone often reaches 15-20 points of recall@5.
Rerankers do not help in every case. Three patterns where the reranker pass is wasted spend. First, very small corpora (under 5,000 chunks): the embedding model alone reliably returns the right chunk in the top-5 because there are so few candidates to confuse it. Second, corpora where the embedding model is already at 95%+ recall@10 — the reranker has little signal left to extract and the latency penalty (50-200ms per query for a remote rerank call) starts to hurt UX. Third, pipelines that already combine lexical (BM25) and semantic (vector) retrieval with reciprocal rank fusion: the hybrid step covers most of the failure modes a reranker would catch, and the marginal recall gain typically drops below 2 points. Measure before adding the pass.
Worked $ math for a production RAG app at 1M queries per month. Without reranker: 1M × ($0.000001 embed + $0.00001 vector search + $0.010 LLM) = $10,011/month, with about 78% top-5 recall. With Cohere Rerank v3: 1M × ($0.000001 embed + $0.00001 vector search + $0.00005 rerank + $0.010 LLM) = $10,061/month, with 91% top-5 recall. The reranker adds $50/month — about 0.5% of total spend — and adds 13 points of recall. With Voyage Rerank-1 the rerank line jumps to $2,500/month, still under 25% of total spend, with marginally higher recall on Voyage-internal evals. With MixedBread open-weights via Together: the rerank line is about $14/month at the same volume — effectively free relative to the LLM bill. The cheapest reranker is rarely the best on quality, but every option in 2026 is small enough that the choice should be driven by recall@k on your own eval, not by $/1M.
Two practical notes for budgeting. Reranker latency adds up: Cohere Rerank v3 returns in 80-150ms for 50 candidates; Voyage Rerank-1 lands closer to 200ms; open-weights rerankers self-hosted on a single GPU can return in 30-50ms but require you to operate the infrastructure. If your end-to-end query budget is under 800ms, a remote rerank pass burns 15-25% of the budget. Second, reranking is one of the few RAG components that benefits from caching at the pair level: identical (query, document) pairs return identical scores, so a small Redis cache in front of the reranker often cuts the bill 30-50% on apps with repeated queries. See the GPT vs Claude vs Gemini cost calculator to size the LLM step that dominates the rest of the stack.