By The DDH Team · Digital Dashboard Hub

Embedding Model Leaderboard 2026: MTEB Rankings, RAG-Specific Recall, and the Open vs API Trade-off

By The DDH Team at Digital Dashboard Hub·Updated June 21, 2026

Stop writing AI prompts from scratch.

Tell us your business + your task + your model. We write the prompt — perfectly tuned for ChatGPT, Claude, Grok, Gemini, Midjourney, or any model. Plus 500+ pre-built prompts in your library.

Choosing the wrong embedding model is one of the most expensive silent errors in an AI stack. A model that ranks highly on a general benchmark may underperform by five or more points on the specific retrieval task your RAG pipeline actually needs — and because the embedding layer is upstream of everything else, a bad choice quietly degrades every answer your system produces. This guide uses the public MTEB v2 leaderboard as of June 2026 to rank the top ten models, then reorders them by MTEB-Retrieval specifically, which is the subset that predicts real-world dense retrieval quality.

All numbers cited here are approximations drawn from the public Massive Text Embedding Benchmark leaderboard maintained on Hugging Face. The leaderboard is updated continuously as new submissions arrive, so treat every score as a snapshot rather than a permanent verdict. Before making infrastructure decisions, verify current rankings at huggingface.co/spaces/mteb/leaderboard and cross-check provider pricing pages for current API costs, since both change frequently.

This post covers the full top-ten breakdown, why the RAG-specific reordering matters, a deep dive into open-source heavyweights like NV-Embed-v2 and gte-Qwen2-7B-instruct, an honest analysis of API models from OpenAI, Voyage, Cohere, and Jina, domain-specialized variants for code and legal corpora, and a total-cost-of-ownership framework for the open vs API decision. For a direct side-by-side of the leading API providers, see Cohere vs Voyage vs OpenAI embeddings. For RAG system design decisions upstream of the embedding layer, the RAG architecture decision tree 2026 is the recommended starting point.

Digital Dashboard Hub

Writing good prompts for ONE AI is hard. Writing them for GPT-5, Claude, Gemini, Perplexity, Midjourney and 6 more is a full-time job. DDH's AI Prompt Builder writes once, runs everywhere — locked to your niche, voice, and brand tone.

Free 14 days, no card. →

MTEB v2 embedding model leaderboard — June 2026

Feature	Model	Provider	MTEB avg	MTEB-Retrieval	Dims	Max tokens	License	API price /1M tokens
NV-Embed-v2	NV-Embed-v2	NVIDIA	~72.3	~61	4096	32,768	Apache 2.0	Self-hosted only
voyage-3-large	voyage-3-large	Voyage AI / MongoDB	~70.1	~62–63	2048	32,000	Proprietary	~$0.18 / 1M
gte-Qwen2-7B-instruct	gte-Qwen2-7B-instruct	Alibaba / BAAI	~69.5	~59	3584	32,768	Apache 2.0	Self-hosted only
Cohere embed-v4.0	Cohere embed-v4.0	Cohere	~68.2	~60–61	1024	128,000	Proprietary	~$0.10 / 1M
bge-multilingual-gemma2	bge-multilingual-gemma2	BAAI	~67.4	~56	3584	8,192	Apache 2.0	Self-hosted only
voyage-3	voyage-3	Voyage AI / MongoDB	~66.5	~57	1024	32,000	Proprietary	~$0.06 / 1M
mxbai-embed-large-v1	mxbai-embed-large-v1	MixedBread.ai	~65.3	~55	1024	512	Apache 2.0	Self-hosted only
text-embedding-3-large	text-embedding-3-large	OpenAI	~64.6	~58–59	3072	8,191	Proprietary	~$0.13 / 1M
jina-embeddings-v3	jina-embeddings-v3	Jina AI	~64.1	~54	1024	8,192	CC BY-NC 4.0	~$0.02 / 1M
text-embedding-3-small	text-embedding-3-small	OpenAI	~62.3	~55	1536	8,191	Proprietary	~$0.02 / 1M

All scores are approximations from the public MTEB v2 leaderboard as of June 2026. Rankings shift as new models are submitted. Verify current standings at huggingface.co/spaces/mteb/leaderboard. API pricing is approximate and subject to change; always check the provider's current pricing page before budgeting. Self-hosted models have no per-token API fee but require GPU infrastructure investment. Dimension counts reflect default output; some models support Matryoshka-style dimension truncation.

Why MTEB matters — and where it falls short

The Massive Text Embedding Benchmark is the closest thing the embedding world has to a standardized exam. Maintained by Hugging Face, MTEB v2 covers over 50 tasks across retrieval, classification, clustering, semantic textual similarity, summarization, and reranking, drawing on datasets from multiple languages and domains. A model's MTEB average score aggregates performance across all of these, giving a single headline number that lets you compare models without running your own experiments on every task type.

The benchmark's strength is also its weakness. Because the average pools more than 50 heterogeneous tasks, a model can score well by excelling at classification and clustering while being mediocre at the dense retrieval task that powers most RAG pipelines. The inverse is also true: a model tuned specifically for passage retrieval may rank lower on the overall leaderboard while outperforming the headline winner on the exact task you care about. This is why the MTEB-Retrieval subset deserves separate attention, which the next section covers in detail.

There are additional limits worth naming. MTEB primarily evaluates English-language performance, though the MMTEB multilingual extension covers additional languages for some models. It tests on held-out academic datasets, which may not reflect your domain's vocabulary, query style, or document length distribution. And it evaluates embedding quality in isolation — not how that quality translates into end-to-end RAG answer quality when combined with a specific chunking strategy and retriever. For a more practical grounding, pairing MTEB scores with your own corpus evaluation (described in the steps section below) is strongly recommended.

The full leaderboard breakdown

NV-Embed-v2 from NVIDIA holds the top position on MTEB v2 average at approximately 72.3 as of June 2026. This is a 7.85-billion-parameter model that leverages instruction-following capabilities derived from the NV-Embed architecture, producing 4096-dimensional vectors with a 32,768-token context window. Its strength lies in following task-specific instructions at inference time, allowing the same model weights to serve retrieval, classification, and semantic similarity tasks with differentiated prompting. The trade-off is significant GPU memory requirement — an A100 or equivalent is the practical minimum for production inference at reasonable throughput.

Voyage AI's voyage-3-large sits in second at approximately 70.1 MTEB average, but as the next section explains, it leads on the MTEB-Retrieval subset at roughly 62–63 — meaning it is the better choice for most RAG applications despite its lower headline ranking. Alibaba's gte-Qwen2-7B-instruct takes third at approximately 69.5, benefiting from the Qwen2 language model backbone to produce rich 3584-dimensional embeddings. Cohere's embed-v4.0 ranks fourth at approximately 68.2, with the notable differentiator of native 128,000-token context — a meaningful advantage for embedding long documents without pre-chunking.

The remainder of the top ten — bge-multilingual-gemma2, voyage-3, mxbai-embed-large-v1, text-embedding-3-large, jina-embeddings-v3, and text-embedding-3-small — span a range from roughly 67 down to 62. Within this band, the differences in practical retrieval quality are often smaller than the differences implied by the absolute numbers, and operational factors like latency, API reliability, token limits, and total cost often outweigh a two-point score gap. The embeddings cost calculator can help model the cost side of this comparison.

MTEB-Retrieval reranking: why the order changes for RAG

MTEB-Retrieval is the subset of MTEB tasks focused specifically on dense passage retrieval — the task of finding the most semantically relevant documents in a corpus given a natural-language query. This is the core competency for RAG pipelines, vector database search, and semantic search applications. The reordering it produces compared to the overall average is significant and worth understanding before committing to any model.

voyage-3-large leads the retrieval subset at approximately 62–63, climbing above NV-Embed-v2 (which scores around 61 on retrieval despite leading the overall average). Cohere embed-v4.0 follows closely at roughly 60–61. gte-Qwen2-7B-instruct drops more substantially to approximately 59, suggesting that its broad capability advantages across classification and clustering tasks account for more of its headline score than its retrieval performance. OpenAI's text-embedding-3-large, despite ranking eighth overall, scores approximately 58–59 on retrieval, narrowing the gap with models that rank significantly above it on the aggregate.

The practical takeaway is this: if you are building a RAG system, the MTEB-Retrieval column is the one to sort by, not the MTEB average. The divergence between the two rankings is not noise — it reflects genuine differences in how different model architectures and training objectives handle the query-document matching problem. Models that excel at classification tasks use their embedding space differently than models optimized for retrieval, and the aggregate average obscures this. For RAG-specific guidance on how the embedding layer interacts with retrieval architecture choices, see the RAG architecture decision tree 2026 and the GraphRAG vs Vector RAG comparison.

Open-source models deep dive: NV-Embed-v2, gte-Qwen2-7B-instruct, bge-multilingual-gemma2

NV-Embed-v2 is the current open-source benchmark leader and the model to beat for teams with GPU infrastructure. Released under Apache 2.0, it is freely usable for commercial applications without per-token fees. The model's instruction-following architecture means you prompt it differently depending on the task — for retrieval, the query should be prefixed with a retrieval-specific instruction string, while document embeddings are typically computed without instruction prefixing. This asymmetric prompting is well-documented in NVIDIA's model card on Hugging Face. At 7.85 billion parameters and 4096-dimensional output, it is memory-intensive: expect approximately 16GB of GPU VRAM at half-precision, making an A100 40GB or H100 the realistic minimum for production batch inference. Quantized versions are available and reduce memory requirements, with modest quality trade-offs.

gte-Qwen2-7B-instruct from Alibaba ranks third overall and benefits from the Qwen2-7B language model backbone, which was pre-trained on a large multilingual corpus before embedding-specific fine-tuning. The 3584-dimensional output is dense and requires proportionally larger vector indices, which has downstream effects on both memory and query latency in production. Its 32,768-token maximum context is a practical advantage for embedding longer documents without chunking, though embedding quality at the extreme end of the context window degrades, as with most long-context models. Like NV-Embed-v2, it is Apache 2.0 licensed and requires self-hosted GPU infrastructure.

bge-multilingual-gemma2 from BAAI occupies a different niche. While its 67.4 MTEB average is strong and its retrieval score of approximately 56 is competitive, its primary advantage is multilingual coverage. Built on the Gemma2 backbone, it supports dozens of languages with notably stronger cross-lingual retrieval performance than English-focused models. However, its 8,192-token context window is shorter than the other open-source leaders, and its 3584-dimensional output carries the same index size implications as gte-Qwen2. Teams building multilingual RAG applications or semantic search over non-English corpora should benchmark this model alongside Cohere's embed-v4.0, which offers similar multilingual breadth via API.

API models analysis: OpenAI, Voyage AI, Cohere, and Jina

OpenAI's text-embedding-3-large remains the most widely deployed embedding model in production systems as of mid-2026, reflecting its combination of reasonable quality, ecosystem integration, and operator familiarity. At approximately 64.6 MTEB average and 58–59 on retrieval, it is not the top performer in either ranking — but it benefits from extremely high uptime, predictable latency, and native integration with LangChain, LlamaIndex, and most major RAG frameworks. The 3072-dimensional output is large enough to represent nuanced semantic distinctions, and Matryoshka representation learning means you can truncate to smaller dimensions (1536, 768, etc.) with modest quality loss if index storage is a constraint. OpenAI's OpenAI embeddings rate limits are relevant at scale and worth reviewing before designing your ingestion pipeline. Pricing at approximately $0.13/1M tokens is moderate — not the cheapest option, but competitive with the quality level it delivers.

Voyage AI's voyage-3-large is the strongest API option for pure retrieval performance, leading the MTEB-Retrieval subset at approximately 62–63 despite ranking second on the overall average. Voyage AI was acquired by MongoDB in 2025 and its models are now available both through the Voyage API and natively within MongoDB Atlas Vector Search. The 2048-dimensional output is smaller than NV-Embed-v2 or gte-Qwen2, which reduces index storage requirements while maintaining strong retrieval quality — a favorable trade-off. At approximately $0.18/1M tokens, it is the most expensive API option on the list, but the retrieval quality premium is measurable. voyage-3 at approximately $0.06/1M is a lower-cost alternative from the same provider with somewhat lower retrieval performance, suitable for applications where cost efficiency outweighs peak performance. For a detailed comparison, see Cohere vs Voyage vs OpenAI embeddings.

Cohere's embed-v4.0 offers the most distinctive feature profile of any API model: a 128,000-token context window, which is an order of magnitude larger than most competitors, and leading multilingual performance on the MMTEB benchmark across more than 100 languages. The 1024-dimensional output is smaller than the open-source leaders, which helps with index size and query speed. At approximately $0.10/1M tokens, it is mid-range on price. The long-context capability is particularly useful for embedding entire documents rather than chunked passages — though this changes the chunking strategy decisions described in chunking strategies 2026. Jina AI's jina-embeddings-v3 rounds out the API category at approximately $0.02/1M tokens, the joint-lowest API price on the list alongside text-embedding-3-small. Its CC BY-NC 4.0 license means it is free for non-commercial self-hosting but requires a commercial license or API payment for production commercial use.

Domain-specialized models: when general benchmarks do not tell the full story

MTEB measures performance across general-domain academic datasets, and the rankings above are meaningful for typical text corpora — web content, news, product descriptions, conversational data. But for highly specialized corpora, domain-specific embedding models trained on in-domain text can outperform general-purpose leaders by five to ten points on the relevant retrieval task. This gap is large enough to materially affect answer quality in production RAG systems.

voyage-code-3 is the clearest example. On the CodeSearchNet benchmark, which tests code-to-code and natural-language-to-code retrieval, voyage-code-3 outperforms general models including its sibling voyage-3-large by a significant margin. The model is trained on a large corpus of code in multiple programming languages and understands function signatures, variable names, and code documentation patterns in ways that general models do not. For any RAG application over a code repository, documentation codebase, or technical knowledge base with substantial code content, benchmarking voyage-code-3 before committing to a general model is strongly advised.

Voyage AI also maintains voyage-finance-2 for financial documents and voyage-law-2 for legal corpora, both of which show meaningful performance improvements over general models on their respective domain benchmarks. Cohere's embed-v4.0, while a general-purpose model, leads on multilingual benchmarks and is the recommended starting point for any application requiring embedding quality across non-English languages — its MMTEB scores reflect consistent multilingual retrieval performance that none of the open-source models on this list currently match at comparable operational cost. When evaluating whether a domain-specialized model is justified, the practical test is to build a 100–500 pair evaluation set from your actual corpus and compare recall@10 directly, rather than relying on public benchmark scores as a proxy.

The open vs API TCO calculation

The make-vs-buy decision for embedding models comes down to a total cost of ownership calculation that most teams do not run carefully enough before committing. The API cost is visible and linear: at OpenAI's text-embedding-3-large rate of approximately $0.13/1M tokens, embedding 100 million tokens per month costs $13. At NV-Embed-v2 self-hosted, the per-token fee is zero, but the infrastructure cost is not. A single A100 80GB instance on a major cloud GPU provider runs approximately $2–4 per hour, translating to roughly $1,500–$3,000 per month for a single GPU at continuous operation. The break-even point between self-hosting NV-Embed-v2 and using text-embedding-3-large at that pricing is therefore somewhere above 10 billion tokens per month — a scale most applications do not reach.

The math shifts when you compare against more expensive API options. voyage-3-large at $0.18/1M tokens reaches cost parity with a single A100 at around 8–17 billion tokens per month depending on the GPU cost and utilization rate. At 50 million tokens per month, API wins decisively on cost for any provider on this list. At 500 million tokens per month, the comparison becomes sensitive to the specific GPU instance type, utilization rate, and whether inference can be batched efficiently. Teams already operating GPU clusters for LLM inference can attach an embedding model to existing infrastructure at near-zero marginal cost, which changes the calculation entirely.

There are non-cost factors that favor API models independent of the token volume calculation. API providers handle model versioning, scaling, uptime, and inference optimization. Self-hosted models require engineering time for deployment, monitoring, scaling policy design, and ongoing maintenance — costs that are real but rarely appear in a back-of-envelope TCO calculation. On the other side, self-hosted models give you data residency control, no rate limit exposure, and the ability to fine-tune on proprietary data. The embeddings cost calculator can help model the cost side at your specific token volumes; add 20–40% to any self-hosted estimate to account for engineering overhead and operational margin.

How to build your own embedding model benchmark

MTEB scores are a useful prior, but the only reliable way to know which embedding model will perform best on your corpus is to measure directly. The process is straightforward and does not require large resources. Start by sampling 200–500 query-document pairs from your actual data — ideally a mix of queries that represent real user questions and the documents that should surface as top results. Include both easy positives (clearly relevant) and hard negatives (plausibly relevant but incorrect) to make the evaluation discriminating. Label relevance at a binary or three-tier level (relevant, partially relevant, not relevant).

For each model under evaluation, embed all queries and documents, then compute recall@k (typically k=5 and k=10) and mean reciprocal rank. These are the metrics that translate most directly into end-user experience in a RAG system — recall@10 tells you how often the correct document appears in the top 10 retrieved results, which is the effective recall ceiling for the downstream LLM. Run this evaluation for each candidate model and compare. If you are testing chunking strategies in parallel, this benchmark structure also gives you the infrastructure to measure chunking impact independently of model choice; the chunking strategies 2026 post covers this in detail.

Once you have a baseline, run the benchmark on a monthly cadence. The MTEB leaderboard releases new strong models frequently, and the open-source tier in particular has been moving quickly. A discipline of re-evaluating when a new model posts a 3+ point MTEB-Retrieval improvement over your current model prevents gradual performance decay as better options become available. Keep your evaluation set stable over time so improvements are attributable to the model rather than evaluation drift. If your corpus changes significantly — due to domain expansion or new document types — update the evaluation set accordingly to keep it representative.

Sourcing and how to verify current rankings

Every score in this article is derived from the public MTEB v2 leaderboard hosted on Hugging Face at huggingface.co/spaces/mteb/leaderboard. The leaderboard accepts community submissions and is updated continuously, which means rankings can change between the date this was written and the date you are reading it. New model releases from major labs and from independent researchers regularly displace existing entries. The numbers here should be treated as a snapshot from approximately June 2026, not as permanent ground truth.

For API pricing, the approximate figures cited in the table above were sourced from provider pricing pages as of June 2026. Pricing changes with product updates, competitive pressure, and volume discounts. Before budgeting any embedding-dependent system, verify current pricing directly at the provider: platform.openai.com/docs/pricing for OpenAI, docs.voyageai.com/docs/embeddings for Voyage AI, cohere.com/pricing for Cohere, and jina.ai/embeddings for Jina AI. Volume pricing, committed-use discounts, and enterprise agreements can alter the per-token cost significantly from published list prices.

Model architecture details, dimension counts, and context window lengths are sourced from the respective model cards on Hugging Face and provider documentation pages. Dimension counts reflect default output; models supporting Matryoshka Representation Learning can produce smaller vectors by truncating the full output, with quality degrading approximately proportional to the compression ratio. The Apache 2.0 licenses on the open-source models in this list are accurate as of the publication date but should be re-verified before commercial deployment — license terms can be updated with new model versions.

Evaluating embedding models for your RAG system

1
Start with MTEB-Retrieval rankings, not MTEB average
The MTEB-Retrieval subset is the predictive benchmark for dense passage retrieval performance — the task that underpins RAG, semantic search, and vector database lookups. As the rankings in this post show, the order changes meaningfully when you isolate retrieval from the overall average. voyage-3-large leads on retrieval despite ranking second on the aggregate; gte-Qwen2-7B-instruct drops from third to fourth on retrieval; and text-embedding-3-large closes the gap significantly with models that rank above it on the headline score. Filter the MTEB leaderboard by the Retrieval task category to get the correct sorted view before shortlisting candidates.
2
Identify whether a domain-specialized variant exists for your corpus type
Before committing to a general-purpose model, check whether a domain-specialized variant exists for your application area. Voyage AI maintains voyage-code-3 for code corpora, voyage-finance-2 for financial documents, and voyage-law-2 for legal text — all of which outperform general models on their respective domain benchmarks by five to ten points. If your corpus is primarily code, technical documentation, legal filings, or financial reports, a domain-specialized model will likely outperform the general leaderboard leaders. The only reliable test is to run your own evaluation set as described in step four, but checking for domain variants before that step can save evaluation effort by narrowing the candidate list.
3
Run the API vs open-source TCO calculation at your expected monthly token volume
Before selecting a self-hosted model based on benchmark score alone, complete the total cost of ownership calculation at your realistic monthly token volume. The embeddings cost calculator can help with the API cost side. For the self-hosted side, get current GPU instance pricing from your cloud provider, estimate the number of GPUs required to serve your peak throughput at acceptable latency, and multiply by the number of hours per month. Add a 20–40% overhead factor for engineering time, monitoring, and operational margin. Below approximately 50 million tokens per month, API models almost always win on total cost. Above 500 million tokens per month, the calculation becomes sensitive to infrastructure assumptions and existing GPU utilization.
4
Build a 100–500 pair held-out evaluation set from your actual corpus
MTEB scores on academic datasets are a useful prior but not a substitute for measurement on your data. Sample 200–500 query-document pairs from your actual corpus, covering the range of query types and document formats your application will encounter in production. Include hard negatives — documents that look plausibly relevant but are not the correct answer — to make the evaluation discriminating. Compute recall@5, recall@10, and mean reciprocal rank for each candidate model. Keep this evaluation set stable over time so you can compare results as models are updated or replaced. This is the single highest-value investment you can make in embedding model selection.
5
Run monthly benchmark and upgrade when a 3+ point retrieval improvement is available
The open-source embedding landscape in particular moves quickly, with new models regularly achieving meaningful improvements over their predecessors. Establish a monthly cadence of re-running your held-out evaluation set against new entrants that have posted a 3 or more point improvement on MTEB-Retrieval over your current deployed model. A 3-point retrieval improvement typically translates to a meaningful improvement in end-to-end RAG answer quality, making it a reasonable upgrade threshold. Below that level, the operational cost of re-embedding your corpus, updating index schemas, and redeploying the model generally outweighs the quality benefit. Set a calendar reminder to check the MTEB leaderboard monthly and filter for models that have improved in the Retrieval category since your last review.

Digital Dashboard Hub

The prompt patterns above work 10x better when they live in a library you actually own — tunable to your niche, exportable to GPT-5, Claude, Gemini, Perplexity, Midjourney, Llama. Stop pasting across 6 tools.

Try DDH's AI Prompt Builder — free 14 days, no card. →

Related calculators

OpenAI Pricing Calculator →GPT-5.5, 5.4, mini, nano — full per-call cost in one input.Claude Pricing Calculator →Opus 4.8, Sonnet 4.6, Haiku 4.5, Fable 5 — input + output combined.Context Window Comparison →Max input length and price per 1M for every current model.

Related prompt tools

Cohere vs Voyage vs OpenAI Embeddings Compared→Embeddings Cost Calculator→OpenAI Embeddings Rate Limits→

Frequently Asked Questions

What does MTEB actually measure?

MTEB (Massive Text Embedding Benchmark) measures the quality of text embeddings across more than 50 tasks in eight categories: retrieval, reranking, classification, clustering, semantic textual similarity, summarization, pair classification, and bitext mining. Each task uses a different dataset and metric — retrieval tasks use nDCG@10, classification uses accuracy, clustering uses V-measure, and so on. A model's MTEB average is the mean score across all evaluated tasks. Because the tasks are heterogeneous, the average reflects general embedding quality rather than performance on any single application type. This is why domain-specific evaluation on your own corpus is important before making final model selection decisions.

Why does MTEB-Retrieval differ from MTEB-average?

MTEB-Retrieval isolates the subset of tasks that evaluate dense passage retrieval — finding relevant documents given a natural-language query. This task places different demands on an embedding space than classification or semantic similarity tasks. A good retrieval embedding must encode query intent in a way that aligns with how relevant documents are encoded, including handling vocabulary mismatch, multi-hop reasoning, and implicit topic overlap. Models that excel at classification tasks may organize their embedding space optimally for separating semantic categories, which does not always translate to retrieval performance. Training objectives, contrastive loss functions, and hard-negative mining strategies all influence retrieval quality specifically, and models differ in how much their training emphasized this task.

Should I use open-source or API embedding models?

The decision depends primarily on your monthly token volume and whether you already operate GPU infrastructure. For most applications below 100 million tokens per month, API models win on total cost of ownership because GPU infrastructure costs exceed API fees at that scale. For applications with existing GPU clusters, attaching an open-source model at marginal cost can be competitive at any volume. Qualitatively, the top open-source models (NV-Embed-v2, gte-Qwen2-7B-instruct) match or exceed the best API models on benchmark scores, but require significant engineering investment for deployment, monitoring, and maintenance. API models offer reliability, managed scaling, and zero infrastructure overhead at the cost of per-token fees and data residency considerations.

How often does the MTEB leaderboard change?

The MTEB leaderboard accepts continuous submissions and can change week to week as new models are submitted. Major model releases from NVIDIA, Voyage AI, Cohere, OpenAI, and large academic labs like BAAI have historically produced meaningful leaderboard reshuffles. The open-source tier tends to move faster than the API tier because researchers can publish new models without the operational constraints of a product release. For practical purposes, checking the leaderboard monthly is sufficient to avoid missing a significant new entrant. The leaderboard is available at huggingface.co/spaces/mteb/leaderboard and can be filtered by task category, language, model size, and license.

Does dimension count affect embedding quality?

Higher dimension counts allow a model to encode more nuanced semantic information, but the relationship between dimensions and quality is not linear, and diminishing returns set in well below the maximum dimensions used by modern models. Among the models on this leaderboard, NV-Embed-v2 uses 4096 dimensions and leads the overall benchmark, while voyage-3-large uses 2048 dimensions and leads on retrieval — suggesting that training quality and architecture matter more than dimension count at current scales. Higher dimensions have a direct cost: larger index storage, slower approximate nearest neighbor search, and higher memory requirements in the vector database. Some models, including OpenAI's text-embedding-3 series, support Matryoshka Representation Learning, which allows you to truncate dimensions post-embedding with modest quality loss, giving flexibility to trade storage cost for quality.

When should I use domain-specialized models?

Domain-specialized models are worth benchmarking when your corpus is heavily concentrated in a specific domain that has known specialized vocabulary, structure, or query patterns that differ from general web text. The clearest cases are code search (voyage-code-3), legal document retrieval (voyage-law-2), and financial document search (voyage-finance-2). For multilingual applications, Cohere embed-v4.0 is the strongest general option with documented multilingual performance. The practical test is straightforward: build your own evaluation set and compare recall@10 between a general model and the domain-specific candidate. If the improvement exceeds 3 percentage points, the domain-specialized model is likely worth the operational complexity of maintaining a separate embedding pipeline.

What is the cheapest good-quality embedding model?

For API models, jina-embeddings-v3 and OpenAI's text-embedding-3-small are the joint cheapest at approximately $0.02/1M tokens. text-embedding-3-small scores approximately 62.3 on MTEB average and approximately 55 on MTEB-Retrieval — lower than the leaders but adequate for many applications, particularly those where the downstream LLM can compensate for imperfect retrieval. Jina's model has a CC BY-NC 4.0 license, meaning commercial self-hosting requires a commercial agreement with Jina AI. For self-hosted open-source, mxbai-embed-large-v1 is a strong option under Apache 2.0 but is limited to 512-token maximum input length, which restricts its use with longer documents. If your application involves short queries against shorter passages, mxbai-embed-large-v1 at zero API cost can be a cost-efficient choice.

How do I compare embedding models on my own corpus?

Sample 200–500 query-document pairs from your actual data, covering the range of query types and document formats you expect in production. Include hard negatives (plausibly relevant but incorrect documents) alongside true positives. Embed all queries and documents using each candidate model, then for each query compute recall@5, recall@10, and mean reciprocal rank against the labeled relevance set. Aggregate across all queries for each model and compare. This process can be completed in a few hours with basic Python tooling and any vector similarity library. Keep your evaluation set static over time to enable apples-to-apples comparisons as new models are released. Re-run the benchmark monthly to avoid missing significant new entrants on the MTEB leaderboard that might outperform your current deployed model.

Model the cost before you commit to an embedding provider

The embeddings cost calculator at AIPromptsHub lets you input your monthly token volume and compare per-token costs across OpenAI, Voyage AI, Cohere, and Jina in seconds — so the open vs API decision is based on your actual numbers, not rule-of-thumb estimates. Explore the prompt engineering tools on the platform to build better RAG queries alongside better retrieval.

Browse all prompt tools →