By The DDH Team · Digital Dashboard Hub

Cohere Rerank vs Voyage Rerank vs BGE Rerankers (2026): The Honest RAG Pipeline Comparison

By The DDH Team at Digital Dashboard Hub·Updated June 21, 2026

Stop writing AI prompts from scratch.

Tell us your business + your task + your model. We write the prompt — perfectly tuned for ChatGPT, Claude, Grok, Gemini, Midjourney, or any model. Plus 500+ pre-built prompts in your library.

Reranking is the second stage of a two-stage RAG pipeline: your embedding model retrieves the top-100 candidate documents cheaply, then a cross-encoder reranker scores each (query, document) pair jointly and re-orders the list. The result is a top-10 that is dramatically more precise than what embedding similarity alone produces. The quality lift is consistent — typically 5-15 nDCG@10 points over first-stage-only retrieval across BEIR benchmarks — and is the single cheapest improvement available in a production RAG system. The three options that actually matter in 2026 are Cohere rerank-v3.5, Voyage rerank-2, and the BGE reranker family from BAAI.

The pricing models are structurally different in ways that matter at scale. Cohere rerank-v3.5 prices per query ($1 per 1,000 queries, regardless of document length or count, up to 100 documents per query). Voyage rerank-2 prices per token ($0.05 per 1M reranked tokens — cheaper at short documents, more expensive at long ones). BGE rerankers are free to download under Apache 2 license and run on your own GPU infrastructure — the cost is GPU-hours, not API fees. At 10M queries/month, Cohere costs $10,000; Voyage costs roughly $1,000-3,000 depending on document length; BGE on a single T4 GPU costs ~$290. That 97% cost gap is the core economic argument for self-hosting, and it is why engineering teams doing high-volume retrieval seriously evaluate BGE.

Below: the cross-encoder architecture explained (why rerankers work at all), a full pricing matrix sourced from vendor pages, BEIR benchmark figures from the MTEB leaderboard, the 1M/10M/100M query cost models, multilingual coverage, latency and throughput analysis, the BGE model variants and hardware requirements, migration from no-reranker to two-stage pipeline, and worked cost scenarios. For embedding cost context, see our embeddings cost calculator and the Cohere vs Voyage vs OpenAI embeddings comparison. If you are also evaluating vector databases, the Pinecone vs Weaviate vs Qdrant comparison covers the downstream storage layer.

Digital Dashboard Hub

Writing good prompts for ONE AI is hard. Writing them for GPT-5, Claude, Gemini, Perplexity, Midjourney and 6 more is a full-time job. DDH's AI Prompt Builder writes once, runs everywhere — locked to your niche, voice, and brand tone.

Free 14 days, no card. →

Reranker comparison — June 2026

Feature	Pricing model	Cost at 10M queries/mo	Architecture	Multilingual
Cohere rerank-v3.5	$1 per 1,000 queries (up to 100 docs/query)	~$10,000/month	Cross-encoder, cloud API	100+ languages, cross-lingual
Voyage rerank-2	$0.05 per 1M tokens; rerank-2-lite $0.02/1M	$1,000–3,000/month (doc-length dependent)	Cross-encoder, cloud API	English + major European languages
BGE rerank-v2-m3 (self-hosted)	~$290/month (1× T4 GPU, GCP/AWS)	~$290/month (GPU cost, not per-query)	Cross-encoder, self-hosted	Multilingual, strong cross-lingual
BGE rerank-v2-gemma (self-hosted)	~$1,400–2,400/month (1× L4/A100 GPU)	$1,400–2,400/month (GPU cost)	Cross-encoder (Gemma-2B base), self-hosted	Multilingual, stronger than m3

Source, as of June 2026: Cohere pricing (https://cohere.com/pricing); Voyage AI pricing (https://docs.voyageai.com/docs/reranker); BGE rerankers on Hugging Face (https://huggingface.co/BAAI); GCP T4 on-demand pricing ~$0.40/hr (https://cloud.google.com/compute/gpus-pricing); AWS L4 ~$2-3/hr (https://aws.amazon.com/ec2/pricing/). Cohere query pricing: each query may rerank up to 100 documents at the flat $1/1k price. Voyage token pricing: total tokens = sum of (query + each candidate document) across all rerank calls. BGE GPU cost assumes continuous 24/7 operation — spot/preemptible instances reduce cost by 60-80% at the price of occasional interruption. BEIR benchmark scores: verify current rankings at https://huggingface.co/spaces/mteb/leaderboard under the reranking category before procurement. All figures as of June 2026.

Cross-encoder architecture: why rerankers work at all

To understand why rerankers exist, you need to understand the trade-off that embedding-based retrieval makes. Embedding models (bi-encoders) encode the query and each document independently into fixed-dimension vectors. Similarity search finds the documents whose vectors are closest to the query vector. The key word is independently — the query and document are never processed together. The model has no way to reason about how well a specific document answers a specific query; it can only compare vector proximity in an abstract embedding space.

Cross-encoders are architecturally different. They take the query and a candidate document as a single concatenated input — [CLS] query [SEP] document [SEP] — and output a single relevance score. Because query and document are processed together, the model can attend to the specific words in the query when scoring the document, and attend to the specific words in the document when scoring the query. The interaction is direct. This is why cross-encoders consistently outperform bi-encoders on precision metrics: they can detect that a document uses the word 'bond' in a financial context (relevant) vs a chemical context (not relevant) based on the specific query asking about 'corporate bonds.'

The cost of this accuracy is inference time. A bi-encoder encodes a query once and then does dot-product similarity against pre-computed document vectors in milliseconds. A cross-encoder must run a forward pass for each (query, document) pair — if you have 100 candidate documents, you run 100 forward passes. This is O(N) per query, where N is the candidate count. At a 1-billion-document corpus, running a cross-encoder over all documents is computationally infeasible. This is why rerankers are always second-stage: first-pass retrieval (BM25 or embedding similarity) narrows candidates to 50-200 documents, then the cross-encoder reranks those candidates precisely.

The latency math: a typical cross-encoder forward pass on a T4 GPU takes 10-50ms depending on document length. Reranking 100 documents sequentially = 1-5 seconds, which is too slow for user-facing latency. Production deployments either (1) batch all 100 pairs into a single GPU inference call (reduces wall-clock time to 50-200ms) or (2) use the cloud API (Cohere, Voyage) which handles batching server-side. Self-hosted BGE needs batching logic in your application; cloud APIs abstract this away.

Why the quality lift is so consistent: the BEIR benchmark (Benchmarking Information Retrieval, 18 heterogeneous datasets covering news, biomedical, financial, code, legal, and general web corpora) consistently shows that reranking on top of first-stage BM25 or dense retrieval adds 3-10 nDCG@10 points. On some BEIR tasks (TREC-COVID, FiQA, SCIDOCS) the lift exceeds 10 points. The average lift across all BEIR tasks is 5-8 nDCG@10 when replacing BM25-only retrieval with BM25 + cross-encoder rerank. This is not a marginal improvement — it moves systems from 'borderline useful' to 'reliably useful' on real retrieval tasks.

Why not just use a bigger embedding model instead of a reranker? Bigger embedding models (e.g., upgrading from voyage-3-lite to voyage-3-large) add 3-6 nDCG@10 points. Adding a reranker adds another 5-10 points. They compound rather than substitute. The optimal architecture for quality-critical RAG is: strong embedding model for first-stage recall + cross-encoder reranker for second-stage precision.

Cohere rerank-v3.5: the API-first reranker

Cohere rerank-v3.5 launched in late 2025 as the successor to rerank-v3, with improved multilingual performance and better handling of long documents. It is Cohere's primary go-to-market product alongside their embed-v4.0 embeddings, and its pricing model is the simplest of any reranker: $1 per 1,000 queries, where one query can rerank up to 100 documents. Whether you pass 10 documents or 100 documents per query, you pay $0.001.

The per-query pricing model is unusual in the API world and has a key implication: Cohere's reranker is highly cost-efficient when you rerank many documents per query, and less efficient if you only rerank a few. If you send 100 candidates per query, you pay $0.001/query and get 100 scored pairs. If you send 10 candidates per query (because your first-stage retrieval only returns 10 useful candidates), you still pay $0.001/query but you get only 10 scored pairs — effectively paying 10x the per-pair cost.

BEIR benchmarks: as of June 2026, Cohere rerank-v3 scores among the top cross-encoder rerankers on the MTEB leaderboard's reranking tasks. nDCG@10 on the BEIR test suite is competitive with the best open-source rerankers. Verify the exact current position at https://huggingface.co/spaces/mteb/leaderboard under the reranking category before making procurement decisions — these rankings update as new models are submitted.

Multilingual is Cohere's headline differentiator. rerank-v3.5 supports 100+ languages and, critically, cross-lingual reranking: a query in English can rerank documents in French, Spanish, Japanese, Arabic, or Korean without the query or documents being pre-translated. This is valuable for global enterprise RAG systems where the document corpus spans languages.

Integration: Cohere rerank works with any embedding source. You do not need to use Cohere embed-v4.0 as your embedding model to use Cohere rerank-v3.5 as your reranker. Cohere's SDK exposes a simple co.rerank() call that accepts the raw query string and a list of document strings — the reranker receives the text, not the vectors. Integrates with LangChain, LlamaIndex, Haystack, and raw Python.

Limitations: the $1/1k query cost becomes meaningful at high volume. 10M queries/month = $10,000. 100M queries/month = $100,000. For consumer-facing applications with high query volume, this is a significant budget item. Cloud API means latency is network-dependent and varies with Cohere's infrastructure; there is no SLA for inference latency published on the public pricing tier. For latency-sensitive applications, test empirically.

Voyage rerank-2: token-based pricing changes the math

Voyage rerank-2 takes a structurally different approach to pricing. Instead of charging per query, Voyage charges per token: $0.05 per 1M reranked tokens. The rerank-2-lite variant is $0.02 per 1M tokens. The total token count for a rerank call is the sum of (query tokens + document tokens) across all pairs. If your query is 20 tokens and you rerank 100 documents averaging 200 tokens each, the token count is 20 × 100 (query repeated) + 200 × 100 (documents) = 22,000 tokens per query call.

At 22,000 tokens per query × $0.05/1M = $0.0011 per query. Roughly comparable to Cohere at $0.001 per query for this document length. But the comparison shifts dramatically with document length. For short documents (50-word snippets, ~65 tokens): 20 × 100 + 65 × 100 = 8,500 tokens × $0.05/1M = $0.000425 per query — less than half Cohere's cost. For long documents (500-word passages, ~650 tokens): 20 × 100 + 650 × 100 = 67,000 tokens × $0.05/1M = $0.00335 per query — 3.35x more expensive than Cohere.

This makes Voyage's pricing model optimal for short-document RAG (FAQ retrieval, product catalog search, short customer reviews) and uncompetitive for long-document RAG (contract retrieval, research papers, technical documentation). Know your average document length before defaulting to either model. Use our embeddings cost calculator to model total RAG pipeline cost including reranker.

rerank-2-lite at $0.02/1M tokens is the budget option with slightly reduced quality. Voyage positions rerank-2-lite for latency-sensitive applications where throughput matters more than precision — smaller model, faster inference, meaningfully lower cost. At 22,000 tokens per query, rerank-2-lite costs $0.00044 per query, competitive with BGE self-hosted at mid-volume.

BEIR benchmarks: Voyage rerank-2 scores competitively with Cohere rerank-v3.5 on English BEIR tasks. Multilingual support covers English and major European languages; cross-lingual is more limited than Cohere. Domain-specialized reranker variants (rerank-2-finance, rerank-2-law) are implicitly referenced in Voyage documentation and may be available for enterprise customers — verify with sales for production use.

Integration: Voyage exposes a similar reranking API to Cohere. The raw text is passed (not vectors), and a relevance score is returned per document. LangChain and LlamaIndex integrations available.

BGE rerankers: self-hosted, Apache 2, and 97% cheaper at scale

The BGE reranker family (BAAI General Embeddings, developed by the Beijing Academy of Artificial Intelligence) are open-source cross-encoder models available on Hugging Face under the Apache 2.0 license, free for commercial use with no API fee. The relevant models for production reranking in 2026 are: bge-reranker-v2-m3 (~568M parameters, multilingual, T4-GPU compatible), bge-reranker-v2-gemma (Gemma-2B base, ~2B parameters, stronger but requires L4/A100), and bge-reranker-large (560M parameters, English-optimized, the well-understood baseline).

The hardware requirements are concrete. bge-reranker-v2-m3 at 568M parameters fits on a T4 GPU (16 GB VRAM) and handles 500-1,000 rerank calls per second depending on document length and batch size. On GCP, a T4 instance (n1-standard-4 + T4) runs ~$0.40/hr on-demand, $0.14/hr on spot. At 24/7 on-demand: $0.40 × 24 × 30 = $288/month, call it $290. At spot: ~$100/month with occasional interruption. bge-reranker-v2-gemma at 2B parameters requires an A100 (40 GB) or L4 (24 GB) — L4 on GCP is ~$0.70/hr on-demand ($504/month), A100 is ~$2.50-3.00/hr ($1,800-2,160/month).

The throughput math at 10M queries/month: 10M queries ÷ 30 days ÷ 24 hours ÷ 3,600 seconds = ~3.9 queries/second average. bge-reranker-v2-m3 on a single T4 handles 500-1,000 QPS — more than 100x headroom on average load. Even at peak (say 10x average = 39 QPS), a single T4 handles it comfortably. The cost: $290/month vs Cohere's $10,000/month. That is a 97% cost reduction.

BEIR performance: bge-reranker-v2-m3 is competitive with Cohere rerank-v3 on English BEIR tasks — nDCG@10 scores are within 1-3 points on most individual BEIR datasets according to the MTEB leaderboard (verify at https://huggingface.co/spaces/mteb/leaderboard). On multilingual tasks, bge-reranker-v2-m3's cross-lingual training gives it competitive performance vs Cohere's multilingual claims. bge-reranker-v2-gemma improves over m3 by 2-5 nDCG@10 points on English tasks, at the cost of 4x the hardware footprint.

What you give up by self-hosting: operational overhead (deployment, monitoring, auto-scaling, CUDA driver maintenance, model updates), cold-start latency if you run on spot instances, and the engineering time to build the inference service. A minimal FastAPI wrapper around BGE reranker is roughly 50-100 lines of Python; frameworks like Hugging Face Text Embeddings Inference (TEI) make deployment substantially easier. The cloud providers (GCP, AWS, Azure) all offer pre-built inference endpoints that reduce operational burden significantly.

When BGE self-hosted beats the API economically: the break-even point varies by query volume. At 1M queries/month, Cohere costs $1,000 and BGE T4 costs $290 — BGE wins by $710. At 100k queries/month, Cohere costs $100 and BGE T4 costs $290 — Cohere wins unless you need the GPU for other things. The economic inflection point is roughly 300k-500k queries/month for a dedicated T4 instance. Below that, the API is cheaper when you account for engineering time. Above that, BGE is cheaper, and the gap grows linearly with volume.

BEIR benchmarks: what the numbers actually say

BEIR (Benchmarking Information Retrieval) is the standard evaluation suite for information retrieval models — 18 heterogeneous datasets covering news (TREC-NEWS), biomedical (TREC-COVID, BioASQ, NFCorpus, SCIDOCS), financial (FiQA-2018), code, legal (FIQA legal), and general web (MS MARCO, Natural Questions, HotpotQA) retrieval. The standard metric is nDCG@10 (normalized Discounted Cumulative Gain at rank 10), which measures how well the top-10 results are ordered by relevance. Higher is better; the maximum is 1.0; real-world strong reranker scores are in the 0.55-0.75 range depending on the dataset.

A note on reporting: BEIR consists of 18 datasets, and models can cherry-pick the datasets on which they perform best. When evaluating a model, check the average across all BEIR datasets, not just the cherry-picked highlights. The MTEB leaderboard reports the full breakdown — use it. Verify current standings at https://huggingface.co/spaces/mteb/leaderboard under the 'Reranking' category.

What the reranking task measures in MTEB: the standard MTEB reranking setup starts from an existing first-stage retrieval result (usually BM25 or a dense retriever) and measures how well the reranker re-orders those candidates. A nDCG@10 gain of 5 points over the baseline is a meaningful improvement for user-facing RAG. A gain of 10+ points is exceptional.

Baseline context: BM25 (pure lexical retrieval, no embedding or reranking) achieves nDCG@10 averages in the 0.40-0.50 range on BEIR tasks. Strong embedding-only (no reranker) retrieval achieves 0.50-0.60. Adding a cross-encoder reranker pushes to 0.55-0.70. The reranker contribution is especially large on tasks where lexical similarity is misleading (domain jargon, paraphrase queries, cross-lingual retrieval).

Data caveats: benchmark scores reflect controlled evaluation conditions with specific document lengths and query styles. Production performance may differ. A pipeline built on MS MARCO-fine-tuned models may overfit MS MARCO-style queries. If your retrieval task is domain-specific (medical, legal, code), look for BEIR subsets that match your domain and weight those scores more heavily. Always run an empirical evaluation on your own held-out (query, relevant-doc) pairs before committing to a production reranker.

As of June 2026: all three major options (Cohere rerank-v3.5, Voyage rerank-2, bge-reranker-v2-m3/gemma) achieve strong BEIR scores and are meaningfully better than no reranking. The quality differences between them are smaller than the quality difference between having a reranker and not having one. Choose primarily on cost model and operational requirements; trust that quality is competitive across all three.

Cost scenarios at 1M, 10M, and 100M queries per month

**1M queries/month scenario** (mid-scale SaaS RAG — e.g., 10,000 daily active users averaging 3-4 queries each). Cohere rerank-v3.5: $1,000/month. Voyage rerank-2 (avg 200-token documents): ~$220/month. BGE bge-reranker-v2-m3 on a T4 (on-demand): $290/month. BGE on spot: ~$100/month. Verdict: at 1M queries, Voyage token pricing wins on raw cost for typical document lengths; BGE self-hosted is competitive on cost but adds engineering overhead. Cohere is 3.5-5x more expensive for general-purpose use cases.

**10M queries/month scenario** (high-scale consumer or enterprise RAG — e.g., 100k DAU, 3-4 queries each). Cohere: $10,000/month. Voyage rerank-2 (avg 200-token docs): ~$2,200/month. BGE bge-reranker-v2-m3 on a single T4 (on-demand): $290/month (single T4 handles this load comfortably at 3.9 QPS average). BGE on spot: ~$100/month. Verdict: the cost gap at this scale is the core reason BGE self-hosting exists. BGE on-demand is 97% cheaper than Cohere; 87% cheaper than Voyage. Engineering overhead to deploy BGE is ~1-2 engineer-days, meaning ROI is positive in the first month at this volume.

**100M queries/month scenario** (high-scale search or platform — e.g., a search product, a large-scale AI assistant). Cohere: $100,000/month. Voyage rerank-2 (avg 200-token docs): ~$22,000/month. BGE bge-reranker-v2-m3 — at 100M queries/month, average QPS is 39. A single T4 handles this. But peak load (10x average = 390 QPS) requires 1-2 additional T4 instances with auto-scaling. 2× T4 always-on: ~$580/month. With Kubernetes auto-scaling on GKE, you might average 1.3× T4 = ~$377/month. Compare to $100,000/month on Cohere.

**Long-document exception**: Voyage token pricing becomes expensive with long documents. If your average document is 1,000 tokens (not 200), Voyage rerank-2 at 10M queries = ~$11,000/month, comparable to Cohere. If your average document is 2,000 tokens, Voyage = ~$22,000/month, worse than Cohere. BGE self-hosted is still ~$290/month regardless of document length, since compute cost per GPU-hour is fixed. For long-document RAG (legal, technical), BGE's cost advantage over API rerankers is even larger.

**Infrastructure cost footnote**: BGE self-hosting costs above assume GPU-only. Add ~$50-100/month per instance for CPU, memory, disk, and networking. Add Kubernetes cluster cost if using managed auto-scaling (~$50-200/month for GKE Autopilot at this scale). A realistic BGE production deployment at 10M queries/month is $400-600/month all-in, still 95%+ cheaper than Cohere. See our vector DB cost calculator to model the full RAG infrastructure budget.

**Decision threshold summary**: below 300k queries/month, API rerankers (Cohere or Voyage) are usually cheaper than a dedicated GPU instance. Between 300k-1M queries/month, Voyage token pricing is competitive; BGE self-hosted on spot is cheaper. Above 1M queries/month, BGE self-hosted wins on cost unless you have no ML infrastructure capability or multilingual requirements that BGE cannot meet.

Multilingual reranking: cross-lingual retrieval differences

Multilingual reranking is harder than multilingual embedding. An embedding model can represent query and document in a shared multilingual space, allowing cross-lingual similarity search at retrieval time. But cross-encoder rerankers process (query, document) pairs jointly — for cross-lingual reranking to work well, the model must understand the semantic relationship between a query in Language A and a document in Language B, which requires more sophisticated multilingual training than same-language reranking.

Cohere rerank-v3.5 is the strongest multilingual option. Cohere explicitly trains for cross-lingual reranking with 100+ language support. A query in English reranking documents in German, Japanese, Arabic, or Korean is a supported use case, not an edge case. For global enterprise RAG where document corpora span multiple languages, Cohere's multilingual reranker is a meaningful architectural advantage.

Voyage rerank-2 supports English and major European languages (French, Spanish, German, Italian, Portuguese, Dutch). Cross-lingual support between, say, English and Mandarin or English and Arabic is not a core capability. For primarily English or English-European multilingual use cases, Voyage is adequate. For broader multilingual needs, Cohere or BGE.

bge-reranker-v2-m3 (the 'm3' refers to Multi-Multilingual, Multi-Functionality, Multi-Granularity) is trained specifically for cross-lingual retrieval. The model handles code-switching queries and cross-language document pairs. BEIR scores on multilingual retrieval tasks are competitive with Cohere. The v2-m3 variant is the BGE option to choose for multilingual needs — the standard bge-reranker-large is English-optimized and not appropriate for multilingual reranking.

Practical cross-lingual test: a query 'What is the interest rate policy?' should surface a relevant Spanish-language document discussing 'la política de tipos de interés' even without translation. Cross-encoder models with strong multilingual training handle this; models trained on English-only data do not. If cross-lingual reranking matters for your use case, test empirically on representative query-document language pairs before deploying.

Language-pair coverage gap: even the best multilingual rerankers underperform on low-resource language pairs (e.g., English-Swahili, English-Yoruba, English-Tamil). If your document corpus includes low-resource languages, verify performance with actual test queries before assuming multilingual claims hold.

Latency, throughput, and the two-stage pipeline in practice

Reranking adds latency to the query path. The total RAG query latency is: (1) first-stage retrieval [10-50ms for dense or BM25 retrieval], plus (2) reranker inference [50-500ms depending on model, batch size, and candidate count], plus (3) LLM generation [500ms-5s]. Reranking is typically not the latency bottleneck but it is a non-trivial addition that needs to be budgeted.

Cloud API latency (Cohere, Voyage): real-world p50 reranker API latency is typically 100-300ms for a 100-document rerank request at typical document lengths, measured from request submission to response received. This includes network round-trip plus server-side inference. p95 can be 500ms-1s. Do not assume sub-100ms for production SLA budgeting without empirical measurement.

Self-hosted BGE latency: bge-reranker-v2-m3 on a T4 GPU with proper batching (all 100 (query, document) pairs in a single batch forward pass) takes 50-200ms wall-clock time depending on document length. The key is batching — processing pairs sequentially (one at a time) is 10-50x slower than batched inference. Any production BGE deployment must implement proper batching. Hugging Face Text Embeddings Inference (TEI) handles this correctly out of the box.

Optimizing latency for reranking: reduce the candidate count (rerank top-20 instead of top-100 — quality loss is typically small; going from 100 to 20 candidates saves 75% of inference time with <5% quality loss for most tasks). Use reranker-lite variants (Voyage rerank-2-lite, bge-reranker-base) for latency-critical applications. Apply reranking only to query categories that need precision (not all queries need reranking — keyword lookups on structured data do not benefit).

Async reranking pattern: for applications where result re-ordering can tolerate slightly higher latency (chatbot RAG where the LLM still needs time to generate an answer), trigger the reranker call in parallel with the LLM context assembly. The reranker result arrives before the LLM generation completes; the final context is assembled with reranked order. This hides most of the reranker latency behind LLM inference time.

Throughput at scale: for batch offline reranking jobs (re-scoring document collections, offline index enrichment), throughput matters more than latency. BGE on a single T4 achieves 500-1,000 pairs/second with optimal batching. A 1M-document reranking job at 100 candidates each = 100M pair scores needed; at 750 pairs/second = 37 hours on one T4. If you run offline reranking jobs, size accordingly or parallelize across multiple GPU instances.

Migrating from no-reranker to a two-stage pipeline

Most RAG systems start with single-stage dense retrieval: embed all documents, store vectors, query with embedding similarity. Adding a reranker is the single highest-ROI improvement available to a mature RAG system. The migration pattern is well-established.

Step 1: add the reranker call after your existing retrieval step. Your existing embedding retrieval returns top-100 candidates. Pass those 100 candidates and the original query to the reranker API (or self-hosted model). The reranker returns relevance scores; re-sort the candidates by score. Use the top-5 or top-10 of the re-sorted list as the context for your LLM. Zero changes to your embedding model, vector store, or LLM — the reranker inserts as a post-processing step.

Step 2: validate the quality improvement on your evaluation set. Before deploying to production, run your held-out (query, relevant-doc) eval set through the pipeline with and without reranking. Measure recall@5, recall@10, nDCG@10 before and after. A production-quality reranker should add at least 3-8 nDCG@10 points; if it does not, you may have a first-stage retrieval problem (returning completely irrelevant candidates that no reranker can salvage) or your eval set is not representative.

Step 3: tune the candidate count. More candidates for the reranker to process = higher quality ceiling + more cost and latency. Typical production values are 20-50 candidates (first-stage retrieval returns top-50, reranker scores all 50, top-5 goes to LLM). 100 candidates is the ceiling for most use cases — going beyond 100 rarely helps because the additional candidates from dense retrieval are increasingly noisy. Find your specific quality/cost/latency trade-off empirically.

Step 4: monitor reranker quality over time. Embedding model drift (corpus updates, new document types) can change the distribution of first-stage retrieval candidates, which affects reranker performance. Add reranker confidence score distributions to your observability stack — a sudden shift in average reranker scores is a signal that first-stage retrieval quality has changed.

Common mistake during migration: not accounting for reranker latency in p99 SLA budgets. The median reranker call is fast; the 99th percentile (long documents, slow API day, cold GPU) can be 2-5x slower. Test your latency SLA under realistic load before going to production.

When to use each reranker: decision criteria

Use Cohere rerank-v3.5 when: (1) your query volume is under 1M/month and you value simplicity over cost; (2) you need multilingual or cross-lingual reranking with 100+ language support; (3) you are already using Cohere embeddings and want a native integrated solution; (4) you need the flat per-query pricing model to simplify billing (budget per query is predictable); (5) you have no ML infrastructure capability and need a fully managed reranker with Cohere's support.

Use Voyage rerank-2 when: (1) your documents are short (under 200 tokens average) and you want to optimize cost — token-based pricing is favorable for short docs; (2) you are already using Voyage embeddings and want a native pairing; (3) you need domain-specialized reranking (finance, law) and have verified with Voyage sales that domain-specific variants are available for your use case; (4) you want token-based pricing to more accurately model cost proportional to work performed.

Use BGE self-hosted when: (1) your query volume exceeds 500k/month and cost is a priority — the ROI on self-hosting becomes positive at this volume; (2) you need maximum data privacy (documents never leave your infrastructure); (3) you need the lowest possible latency via local inference with no network round-trip; (4) you have existing ML infrastructure (Kubernetes cluster, GPU node pools) that can host a small inference service; (5) you are building a product where the reranker is a competitive cost differentiator and 97% savings at 10M queries/month materially affects your unit economics.

Use bge-reranker-v2-gemma specifically when: quality is more important than cost within the self-hosted option set and you have A100 or L4 GPU capacity. The 2-5 nDCG@10 point improvement over bge-reranker-v2-m3 is meaningful for high-stakes retrieval (medical, legal, financial RAG where wrong answers carry real consequences).

Hybrid architecture: nothing prevents using Cohere or Voyage for low-volume production traffic while migrating to BGE for high-volume paths. A tiered reranker routing layer — Cohere for premium users (low volume, SLA matters), BGE for high-volume standard tier (cost matters) — is a reasonable production pattern for platforms with heterogeneous user segments. This is compatible with a vector DB setup that uses a single index but separate reranker routes per tier.

What not to do: do not skip reranking because the embedding model is good. The best embedding model in the world is still a bi-encoder producing an approximation of relevance. A cross-encoder reranker on even 20 candidates will improve precision measurably. The $0.001/query cost of reranking is the cheapest retrieval quality investment available.

Sourcing and data caveats

Pricing data sourced from: Cohere pricing page at https://cohere.com/pricing (fetched June 2026); Voyage AI reranker docs at https://docs.voyageai.com/docs/reranker (fetched June 2026); BGE models on Hugging Face at https://huggingface.co/BAAI (fetched June 2026). GPU pricing from GCP compute pricing at https://cloud.google.com/compute/gpus-pricing and AWS EC2 pricing at https://aws.amazon.com/ec2/pricing/ (fetched June 2026).

BEIR benchmark data sourced from the MTEB leaderboard at https://huggingface.co/spaces/mteb/leaderboard, reranking category, fetched June 2026. MTEB leaderboard rankings update continuously as new models are submitted — the relative positioning of models can shift between publication and reading. Treat relative rankings as directional guidance; verify absolute scores before making procurement decisions.

GPU cost model assumptions: on-demand pricing for a single T4 GPU on GCP (n1-standard-4 + T4) at ~$0.40/hr. Spot instance prices fluctuate and are excluded from primary analysis. Prices vary by region — us-central1 pricing used as reference. AWS g4dn.xlarge (T4-based) is approximately the same price range. Azure NC4as T4 v3 is similarly priced. Your actual GPU cost may differ based on reserved instance discounts, committed use discounts, or negotiated enterprise pricing.

BGE throughput estimates (500-1,000 pairs/second on T4) are based on reported benchmarks and community testing with bge-reranker-v2-m3 at typical document lengths (128-512 tokens) using batch inference. Your actual throughput will vary with document length, batch size, CUDA version, and model quantization. Measure empirically on your workload before capacity planning.

All cost scenarios in this guide represent approximate models to illustrate relative economics. Actual production costs will differ based on infrastructure choices, reserved capacity, data transfer costs, and operational overhead. Verify before procurement. The BEIR benchmark figures and vendor pricing are as of June 2026 — verify before procurement that the figures in this guide match the current state of vendor pricing pages and the MTEB leaderboard.

How to choose a reranker for your RAG pipeline

1
Estimate your query volume and run the cost model
The fundamental reranker choice is API vs self-hosted, and that choice hinges on query volume. Below ~300k queries/month, a managed API (Cohere or Voyage) is cheaper than a dedicated GPU instance after accounting for engineering time. Above 500k-1M queries/month, BGE self-hosted on a T4 GPU is cheaper — often by 90-97% — and the ROI on self-hosting engineering effort is positive within the first month. Run the math with your specific query volume, average document count per rerank call, and average document length before defaulting to a vendor.
2
Identify your document characteristics: length and language
Document length drives the Cohere vs Voyage cost comparison. Voyage token pricing is cheaper for short documents (under 200 tokens) and more expensive for long documents (over 500 tokens). Cohere's flat per-query pricing is neutral on document length. Multilingual requirements favor Cohere rerank-v3.5 (100+ languages, cross-lingual) or bge-reranker-v2-m3 (also multilingual with cross-lingual training). Voyage rerank-2 is best for English and major European languages.
3
Validate quality on your specific retrieval task
BEIR benchmarks tell you average quality across 18 diverse datasets. Your task may behave differently. Build a held-out evaluation set of 100-500 (query, relevant-document) pairs from your actual corpus. Measure nDCG@10 for first-stage retrieval alone, then with Cohere, Voyage, and BGE reranking. The model that adds the most nDCG@10 points on your specific task is the right model, regardless of BEIR averages. This 30-minute evaluation exercise prevents months of suboptimal retrieval.
4
Decide on two-stage architecture: candidate count and latency budget
Set your candidate count (how many documents does first-stage retrieval return for reranking?) and latency budget. Start with 50-100 candidates — enough to capture most relevant documents without exploding reranker cost. For latency-sensitive applications (user-facing real-time search), target total reranker latency under 200ms and consider the async reranking pattern (run reranker in parallel with LLM context assembly). For batch offline jobs, throughput matters more than latency — optimize for GPU utilization.
5
Wire in observability from day one
Reranker quality degrades silently when first-stage retrieval changes (new document types, domain drift, embedding model updates). Log average reranker scores per query, track the distribution of top-1 reranker scores over time, and alert when the distribution shifts significantly. Set up your eval set as a regression test that runs on each deployment. Monitor GPU utilization and inference latency (p50, p95, p99) for self-hosted BGE. Rerankers are a second-stage black box — make the black box observable.

Digital Dashboard Hub

The prompt patterns above work 10x better when they live in a library you actually own — tunable to your niche, exportable to GPT-5, Claude, Gemini, Perplexity, Midjourney, Llama. Stop pasting across 6 tools.

Try DDH's AI Prompt Builder — free 14 days, no card. →

Related calculators

OpenAI Pricing Calculator →GPT-5.5, 5.4, mini, nano — full per-call cost in one input.Claude Pricing Calculator →Opus 4.8, Sonnet 4.6, Haiku 4.5, Fable 5 — input + output combined.Context Window Comparison →Max input length and price per 1M for every current model.

Related prompt tools

Embeddings provider comparison→Vector DB comparison→Vector DB cost calculator→

Frequently Asked Questions

What is a cross-encoder reranker and how does it differ from an embedding model?

An embedding model (bi-encoder) encodes queries and documents independently into vectors; similarity is computed as vector distance. A cross-encoder reranker processes the (query, document) pair jointly in a single forward pass, producing a relevance score. Joint processing lets the model attend to specific query terms when scoring the document and vice versa — this is why cross-encoders consistently outperform bi-encoders on precision metrics. The trade-off is that cross-encoders cannot pre-compute document representations — they must run a full forward pass per (query, document) pair, making them O(N) per query. This is why rerankers are always second-stage: first-stage retrieval narrows candidates to 50-100, then the reranker scores those candidates precisely.

How much does adding a reranker actually improve RAG quality?

Across BEIR's 18 heterogeneous retrieval datasets, adding a cross-encoder reranker on top of first-stage BM25 or dense retrieval adds 5-10 nDCG@10 points on average, with some tasks seeing 10-15 point improvements. On tasks where lexical similarity is misleading (medical queries, financial jargon, cross-lingual retrieval), the improvement is largest. For user-facing RAG, the practical effect is that the relevant document appears in the top-3 results much more reliably, which means the LLM generates more accurate answers with fewer hallucinations from irrelevant context.

How much cheaper is BGE self-hosted vs Cohere rerank at 10M queries/month?

At 10M queries/month, Cohere rerank-v3.5 costs $10,000/month. A single T4 GPU running bge-reranker-v2-m3 on GCP (on-demand) costs approximately $290/month — about 97% cheaper. A single T4 handles 10M queries/month comfortably at the 3.9 QPS average load. The engineering cost to deploy BGE self-hosted is roughly 1-2 engineer-days for an initial deployment using Hugging Face Text Embeddings Inference. The ROI on self-hosting is positive in the first month at this query volume.

Can I use Cohere rerank-v3.5 with Voyage or OpenAI embeddings?

Yes. Cross-encoder rerankers receive raw text (query string + document strings), not embedding vectors. Cohere rerank-v3.5 works with any embedding source — Voyage, OpenAI, or any open-source model. The reranker re-scores documents based on text, independently of how those documents were retrieved. Cross-vendor pairings (Voyage embed + Cohere rerank, or OpenAI embed + Voyage rerank) are common and work correctly.

Which reranker is best for multilingual RAG?

Cohere rerank-v3.5 for API-based multilingual reranking: supports 100+ languages with explicit cross-lingual capability (English query, French documents). bge-reranker-v2-m3 for self-hosted multilingual reranking: also trained cross-lingually with competitive multilingual BEIR scores. Voyage rerank-2 is best for English and major European language pairs; it is not the right choice for broader multilingual needs.

What is the difference between bge-reranker-v2-m3 and bge-reranker-v2-gemma?

bge-reranker-v2-m3 is ~568M parameters, multilingual, runs on a T4 GPU (16 GB VRAM), and achieves strong BEIR nDCG@10 scores competitive with Cohere rerank-v3 on most tasks. bge-reranker-v2-gemma uses the Gemma-2B architecture (~2B parameters), requires an A100 or L4 GPU, and outperforms m3 by approximately 2-5 nDCG@10 points on English retrieval tasks. Choose m3 when cost efficiency and multilingual coverage matter most; choose gemma when maximum quality on English retrieval is the priority and you have A100/L4 GPU access.

At what query volume does BGE self-hosting make economic sense?

The break-even point is roughly 300k-500k queries/month for a dedicated T4 GPU instance (on-demand). Below that, API rerankers are cheaper when you include engineering overhead. Above 500k queries/month, BGE is cheaper, and the gap grows linearly. At 1M queries/month, BGE saves approximately $710/month vs Cohere ($1,000 - $290). At 10M queries/month, BGE saves approximately $9,710/month. For organizations with existing ML infrastructure (Kubernetes + GPU nodes already in use for other workloads), the break-even is lower because fixed infrastructure cost is shared.

Should I rerank on all queries or only some?

Not all queries benefit equally from reranking. Queries where first-stage retrieval reliably returns highly relevant results (exact keyword matches against well-structured data) benefit less than ambiguous semantic queries. A practical optimization is to selectively apply reranking: use a lightweight first-stage quality signal (e.g., top-1 embedding similarity score) to route only low-confidence retrievals through the reranker. This can cut reranker call volume by 40-60% with minimal quality degradation. Empirically validate the routing threshold on your eval set before deploying selective reranking in production.

Your reranker is ready. Now make the prompts that use it count.

The two-stage retrieval pipeline gets you the right documents. The prompt you send those documents to determines whether your LLM actually answers the question. Our AI Prompt Generator builds RAG-tuned system prompts that extract maximum signal from retrieved context — works with any vector DB, embedding model, or reranker. 14-day free trial, no card required.

Browse all prompt tools →