Skip to contentNew: Does ChatGPT recommend your brand? Free 60-second AI visibility check →
By The DDH Team · Digital Dashboard Hub

Embedding Cost vs Quality (2026): Voyage vs OpenAI vs Cohere vs Google Benchmark

By The DDH Team at Digital Dashboard HubUpdated

Stop writing AI prompts from scratch.

Tell us your business + your task + your model. We write the prompt — perfectly tuned for ChatGPT, Claude, Grok, Gemini, Midjourney, or any model. Plus 500+ pre-built prompts in your library.

14 days, no card. Cancel in 2 clicks.

Embeddings are a once-decide-then-forget choice. Pick a model, embed your corpus, build your index, ship. The switching cost is brutal: any change of base model means re-embedding the entire corpus, rebuilding your vector index, re-tuning your chunking strategy, and re-running your retrieval evals. That's why teams default to whatever the loudest blog post recommended — usually OpenAI text-embedding-3-small from 2024, or text-embedding-3-large from 2025 — and never revisit the choice. The defaults are sticky because the migration is painful.

The reality in June 2026 is that the embedding landscape has 2-5x quality variance on domain-specific tasks, 7-10x cost variance across providers, and 3-5x latency variance depending on region. Voyage 3 sits at the top of MTEB (~70.5 avg) at $0.18 per 1M tokens. Google text-embedding-005 sits ~8 MTEB points lower (~62.0) at $0.025 per 1M tokens — that is 7x cheaper for ~88% of the quality. OpenAI text-embedding-3-large is the comfortable middle ($0.13/M, ~64.6 MTEB) with a trick most teams forget: the dimensions parameter lets you slice 3072-dim outputs down to 1024 or 512 with Matryoshka-style graceful degradation, cutting vector storage by 3-6x.

Most teams are over-paying on embedding cost AND under-spending on embedding quality at the same time. They paid for OpenAI 3-large at full 3072 dimensions (wasteful storage), they're not using a reranker (foregoing 5-8 points of quality lift), and they never ran a custom eval on their own corpus (so MTEB is the only signal they have, and MTEB is an average across 56 datasets — your legal/medical/code corpus probably ranks the models differently).

Below: a sourced comparison table of every credible embedding model in June 2026, the cost-per-quality math, why reranking changes the calculus, the migration cost reality for a 1B-token corpus, and a decision framework. Run the math for your own workload with our embedding cost calculator, and dive deeper into the head-to-head in our Cohere vs Voyage vs OpenAI embeddings comparison.

Digital Dashboard Hub

Writing good prompts for ONE AI is hard. Writing them for GPT-5, Claude, Gemini, Perplexity, Midjourney and 6 more is a full-time job. DDH's AI Prompt Builder writes once, runs everywhere — locked to your niche, voice, and brand tone.

Free 14 days, no card.

Embedding model comparison — June 2026

Feature
Cost per 1M tokens
Dimensions
Max input tokens
MTEB avg score
Voyage 3$0.18102432k~70.5
Voyage 3 large$0.30204832k~71.2
OpenAI text-embedding-3-large$0.133072 (256-3072 via dimensions param)8k~64.6
OpenAI text-embedding-3-small$0.0215368k~62.3
Cohere embed-english-v4$0.121536512~63.8
Cohere embed-multilingual-v4$0.121536512~64.1
Google text-embedding-005$0.0257682k~62.0
Google gemini-embedding-001$0.1530722k~68.2
Mistral embed$0.1010248k~62.5
Jina v3$0.0210248k~64.8
Nomic Embed v2 (self-hosted)~$0.001/M effective7688k~63.5

Cost sources: each provider's public pricing page as of June 20, 2026 (voyageai.com/pricing, openai.com/api/pricing, cohere.com/pricing, cloud.google.com/vertex-ai/generative-ai/pricing, mistral.ai/pricing, jina.ai/pricing). MTEB scores from huggingface.co/spaces/mteb/leaderboard snapshot June 20, 2026. MTEB benchmarks are noisy and represent an average across 56 datasets — verify on YOUR domain before committing. Nomic 'effective' cost assumes ~80% GPU utilization on a $0.40/hr T4 amortized over 1B tokens/month throughput.

Why embedding choice matters more than vector DB choice

Engineering teams spend weeks debating Pinecone vs Weaviate vs Qdrant vs pgvector. Embedding model choice gets 30 minutes and a default to OpenAI. This is exactly backwards. Swapping a vector DB is mechanical — export vectors, import vectors, update the client SDK, point the app at the new endpoint. A weekend of work for a small corpus, a week for a big one. The data does not change; only the storage substrate does.

Swapping an embedding model is a different beast entirely. You re-embed every document in your corpus (paying the per-token cost again), you rebuild every vector in your index, you re-evaluate retrieval quality because the new model has different semantic-space geometry, you potentially re-tune your chunking strategy because the new model has different max-input limits, and you reset your production metrics baseline. For a 10M-doc corpus averaging 800 tokens/doc (8B tokens total), re-embedding at OpenAI 3-large pricing ($0.13/M) is $1,040; at Voyage 3 pricing ($0.18/M) it's $1,440; at full Voyage 3 large pricing ($0.30/M) it's $2,400. Plus 1-2 days of engineering time for re-eval, plus the risk of regressing production quality during the cutover.

The math: embedding model choice is roughly 100x more load-bearing than vector DB choice over a 3-year horizon. Yet most teams optimize the wrong variable. Spend the engineering effort on getting embeddings right on day one, accept whatever pgvector or your existing Postgres setup gives you, and circle back to vector DB optimization only when you hit actual scale problems (>100M vectors, sub-50ms latency targets, complex hybrid-search requirements).


Voyage 3: the quality leader, with cost tradeoff

Voyage 3 (general) at $0.18/M and Voyage 3 large at $0.30/M are the consistent top scorers on MTEB across 2026, sitting at ~70.5 and ~71.2 respectively — a meaningful 5-8 point lead over the next tier. The lead is biggest on technical, code, and scientific corpora: Voyage's training data emphasis on long-form technical content shows up materially when your corpus is API docs, research papers, internal engineering wikis, or codebase content.

The strategic context: Voyage AI was acquired by Anthropic in late 2025. The acquisition came with tight Claude integration — Voyage embeddings are now the recommended retrieval layer for Claude-based RAG systems, and the Anthropic SDK ships first-party Voyage client wrappers. If you're already running a Claude-heavy stack, the friction of using Voyage is near zero and the support story is the strongest in the market.

The cost tradeoff is the catch. Voyage 3 is roughly 7x more expensive per token than Google text-embedding-005, and ~9x more expensive than Jina v3 or OpenAI text-embedding-3-small. For high-volume workloads — think product search over a billion-document e-commerce catalog, or a logging-data RAG over terabytes of structured text — that cost differential dominates the annual budget. Voyage is worth the premium when: you have a niche technical corpus where MTEB-leading recall actually matters, retrieval recall is mission-critical (legal-grade search, security incident triage, medical literature lookup), or the corpus is small enough that the absolute cost is trivial (under 100M tokens annually).

Voyage is overkill when the corpus is generic English consumer content, or when you're running a reranker on top anyway (the reranker recovers most of the quality lift at a fraction of the per-document cost — more on this in section 10).


OpenAI 3-large: middle ground (and the Matryoshka dimensions trick)

OpenAI text-embedding-3-large at $0.13/M with 3072 default dimensions is the comfortable middle of the market. MTEB ~64.6 puts it ~6 points below Voyage 3 but ~2.6 points above text-embedding-3-small, ~2.6 points above Google text-005, and on par with Cohere embed-english-v4. For teams that want a known-good default without paying Voyage premium, OpenAI 3-large is the obvious choice — and OpenAI's reliability, billing, and SDK ergonomics are best-in-class.

Most teams ship OpenAI 3-large at the default 3072 dimensions and never touch it again. This is a mistake. OpenAI 3-large was trained with Matryoshka Representation Learning (MRL), meaning the dimensions are nested — the first 1024 dimensions are themselves a meaningful embedding, just less precise than the full 3072. You request shorter dimensions via the `dimensions` parameter at embed time (`dimensions=1024`, `dimensions=512`, anything from 256 to 3072), and OpenAI returns the appropriately-truncated vector.

Quality degrades gracefully, not catastrophically. Internal benchmarks consistently show: 1024-dim is ~95% of full-3072-dim quality, 512-dim is ~88-90% of full, 256-dim is ~80-85% of full. The storage savings are the inverse: 1024-dim cuts vector storage to 33% of 3072-dim, 512-dim cuts to 17%, 256-dim cuts to 8%. For a 10M-vector index, the difference between 3072-dim and 1024-dim is the difference between ~120 GB and ~40 GB of memory-resident vectors — and that maps directly to cheaper pgvector/Pinecone/Qdrant infrastructure.

The practical recipe: default to OpenAI 3-large at 1024 dimensions. You get ~95% of the quality at 33% of the storage cost. Bump to 1536 or 2048 if your retrieval evals show measurable lift; bump back down if they don't. This single parameter change is the highest-leverage embedding optimization most teams have available — and almost nobody runs it.


Cohere v4: the multilingual specialist

Cohere embed-english-v4 ($0.12/M, ~63.8 MTEB) is competitive on English-only corpora but doesn't break new ground. Cohere embed-multilingual-v4 ($0.12/M, ~64.1 MTEB English / much higher than competitors on non-English benchmarks) is where Cohere shines: it is the best-in-class option for Indic languages (Hindi, Bengali, Tamil), CJK (Chinese, Japanese, Korean), Arabic, and the long tail of languages where OpenAI and Voyage have spotty coverage.

The benchmark story: on the multilingual MIRACL evaluation, Cohere embed-multilingual-v4 leads OpenAI 3-large by 8-15 nDCG points on non-English splits. On MKQA (multilingual open-domain QA), the gap is similar. If your corpus or your user queries are meaningfully non-English — global e-commerce, international news aggregation, multilingual customer support — Cohere multilingual-v4 is the canonical choice and the cost-per-quality math is unambiguous in its favor.

The catch is the 512-token input limit, which is dramatically shorter than Voyage's 32k, OpenAI's 8k, or Jina's 8k. You'll be chunking smaller (typical: 400-token chunks with 50-token overlap rather than 1500-token chunks). This means roughly 3-4x more chunks per document, which inflates your embedding cost and your vector storage proportionally. Run the math: Cohere's per-token cost is competitive, but per-document cost ends up roughly even with OpenAI 3-large once chunking overhead is factored in.

The practical recipe: Cohere v4 multilingual for any corpus with >20% non-English content. Cohere v4 English is fine but doesn't beat OpenAI 3-large at a similar price point, so the choice is mostly about whether you prefer Cohere's API and reranker ecosystem.


Google text-embedding-005: cheapest credible option

Google text-embedding-005 at $0.025/M is the cheapest credible embedding model in the market — 7x cheaper than Voyage 3, 5x cheaper than OpenAI 3-large, 4x cheaper than Cohere v4. MTEB ~62.0 sits roughly 8 points below Voyage and ~3 points below OpenAI 3-large, but the absolute quality is still well above the threshold where a reranker can recover the gap.

For high-volume or cost-sensitive workloads, Google text-005 is the winning play. The classic use case: an e-commerce semantic search over 500M product descriptions averaging 200 tokens each (100B tokens total). At Voyage 3 pricing, the initial embed costs $18,000. At OpenAI 3-large, $13,000. At Google text-005, $2,500 — a $15,500 savings on day one, and proportional savings on quarterly re-embeds. Pair Google text-005 with a Voyage or Cohere reranker on the top-50 retrieved candidates and the end-to-end retrieval quality typically lands within 1-2 nDCG points of Voyage-alone — for a fraction of the lifetime cost.

The other Google story: gemini-embedding-001 at $0.15/M with 3072 dimensions and MTEB ~68.2 is Google's quality play, sitting between OpenAI 3-large and Voyage 3. Released early 2026 alongside Gemini 2.5, it benefits from the same Vertex AI infrastructure and integrates cleanly with Google's reranking models. If you're already on GCP, gemini-embedding-001 is a credible alternative to OpenAI 3-large with comparable economics.

Embed-then-store cost: 100M-token corpus

Feature
Embed-once $
Re-embed yearly $
Vector storage $/mo (1024 dim equivalent)
Voyage 3$18$18~$11/mo (Pinecone p1.x1)
OpenAI 3-large @ 3072 dim$13$13~$33/mo (3x storage)
OpenAI 3-large @ 1024 dim (Matryoshka)$13$13~$11/mo
Google text-embedding-005 @ 768 dim$2.50$2.50~$8/mo (smaller vectors)
Jina v3 @ 1024 dim$2$2~$11/mo

Assumes ~50% overhead for chunking (so 100M corpus tokens = ~150M embedded tokens after overlap). Storage costs estimated on Pinecone p1.x1 pricing as of June 2026 for ~100k vectors at the listed dimension count; pgvector self-hosted on a $50/mo Postgres instance is materially cheaper at this scale.


Jina v3: the under-the-radar pick

Jina v3 at $0.02/M is the dark-horse embedding model of 2026. Open-weights (released under Apache 2.0) and also offered as a hosted API at near-Google pricing, Jina v3 hits MTEB ~64.8 — directly competitive with OpenAI 3-large at full price, and roughly 6.5x cheaper. The model supports multilingual workloads (89 languages with credible quality), long-context input (8k tokens, matching OpenAI), and late-interaction patterns (token-level embeddings for ColBERT-style retrieval) without the chunking compromise of Cohere v4.

Why isn't Jina the default? Mostly distribution: Jina AI is a smaller player than OpenAI/Google/Cohere, the ecosystem of tutorials and integrations is thinner, and enterprise procurement teams haven't heard of them. None of those are technical objections.

For cost-sensitive teams that want strong quality without the operational cost of self-hosting Nomic or BGE-large, Jina v3 hosted API is the no-brainer. For teams that DO want to self-host, the open weights mean you can drop Jina v3 onto your own GPU infra and amortize the model serving cost — typically landing at ~$0.001-0.005/M effective at high utilization, an order of magnitude cheaper than the API.

Real-world deployment pattern: use Jina v3 hosted API for development and low-volume production (<1B tokens/month), switch to self-hosted Jina v3 on a single A10G or T4 GPU when volume crosses the breakeven point (~5B tokens/month). Same model, same quality, same vectors — no re-embedding required at the cutover.


Self-host: Nomic Embed v2, BGE-large, E5-mistral-7b

At volumes above ~1B tokens/month, self-hosting embedding models becomes meaningfully cheaper than any API. The three credible open-weights options in June 2026: Nomic Embed v2 (text-only, ~64 MTEB, 768-dim, very fast — runs on a single T4 GPU at 5000+ tokens/sec); BAAI BGE-large-en-v1.5 (English-focused, ~65 MTEB, 1024-dim, mature ecosystem); E5-mistral-7b-instruct (instruction-tuned, ~67 MTEB, 4096-dim, requires an A100 or H100).

Nomic Embed v2 is the operational sweet spot for most self-hosting decisions. A single T4 GPU on AWS g4dn.xlarge ($0.40/hr ≈ $292/mo) serves roughly 5,000 tokens/sec sustained — that's 432M tokens/day, or ~13B tokens/month. At 13B tokens/month, the API equivalent at Jina v3 pricing ($0.02/M) would be $260, so the breakeven is roughly at this volume; at OpenAI 3-large pricing ($0.13/M), the API would be $1,690, so self-hosting Nomic is 6x cheaper. Above 50B tokens/month, the math gets dramatic.

BGE-large served via vLLM on a single A10G ($0.75/hr ≈ $547/mo) hits ~2,000 tokens/sec per request with batch-32 throughput around 8,000 tokens/sec aggregate. That's ~200B tokens/month potential. For a serious production embedding workload at this scale, the engineering investment to run vLLM + a load balancer + a queue (typically 1-2 weeks of work) pays back in months, not years.

The hidden cost of self-hosting: ops time. You own GPU instance management, model updates, observability, failure recovery, and the on-call rotation. Most teams under 50 engineers should stay on hosted APIs until they hit a hard cost ceiling. The right answer for a 5-person startup is OpenAI 3-large or Jina v3 hosted; the right answer for a 500-person company embedding 100B tokens/month is self-hosted Nomic or BGE.


Latency: not all embedding APIs are equal

Embedding latency matters when you're doing query-time embedding for live user search. For batch/offline embedding (your one-time corpus embed), latency is irrelevant — only throughput matters, and all major providers can saturate ~10k tokens/sec per API key with parallel requests. For online query embedding, the p50 numbers as of June 2026: Voyage p50 ~80ms per single-query request, OpenAI ~120ms, Cohere ~150ms, Google ~60ms (when called from a same-region Vertex AI endpoint), Jina API ~100ms, self-hosted Nomic on a warm T4 ~20ms.

The p99 picture matters more for production systems. Voyage p99 around 200-300ms is steady. OpenAI p99 frequently spikes to 500-1000ms during peak hours and outage windows — this is a chronic complaint and the main reason latency-sensitive teams migrate off. Cohere p99 is similar to OpenAI. Google p99 from a same-region endpoint is the most reliable at 100-150ms because Vertex AI gives you region-local serving.

If query latency is in your critical path (live semantic search where >300ms total response time is unacceptable), Google text-005 from same-region Vertex AI is the lowest-friction win — and the cost story works out too. If you've already adopted OpenAI 3-large, consider running a parallel query path that hits a self-hosted Nomic v2 instance for latency-sensitive queries while keeping OpenAI for batch embed; you're not gaining quality, but you're guaranteeing sub-50ms query embed latency.


Domain-specific quality: MTEB lies on YOUR corpus

Every embedding model comparison ranks models on MTEB. MTEB is the Massive Text Embedding Benchmark — 56 datasets across classification, clustering, pair classification, reranking, retrieval, STS, summarization. The headline 'MTEB avg score' is the average across all 56. This average is useful for high-level model selection but is a deeply misleading proxy for performance on YOUR specific corpus.

Real-world example: a legal-tech startup ran a head-to-head eval of Voyage 3, OpenAI 3-large, and Cohere v4 on their corpus of 800k contracts and case-law documents. MTEB rankings predicted Voyage > OpenAI > Cohere by 5-8 points. On their actual eval (100 hand-judged queries scored by precision@10), Cohere v4 English actually won by 3 points, Voyage was middle, OpenAI was bottom. Reason: their corpus has long passages of formal legal language with heavy citation density, and Cohere's training data over-indexed on formal text patterns that benefit this style.

Another real example: a code-search startup found Voyage 3 dominated their 50-query manually-judged eval on a corpus of Python and TypeScript snippets (because Voyage trained extensively on code), while their MTEB-predicted second-place choice (Jina v3) was actually third behind OpenAI 3-large.

The rule: every team committing to an embedding model should run a custom eval first. Minimum viable eval is 100 queries you genuinely care about, with the top-10 results from each candidate model labeled relevant/irrelevant by a human (or a strong LLM judge prompted with your relevance criteria). Compute precision@10 and nDCG@10 per model. Spend ~1 engineer-day on the eval; save weeks of regret. MTEB tells you which models are in the top tier; YOUR eval tells you which top-tier model wins on YOUR data.


Reranking changes the embedding math entirely

A reranker is a second-stage model that takes the top-K candidates from your initial embedding-based retrieval and re-scores them using a cross-encoder architecture (query and candidate are fed together into the model, allowing fine-grained attention between them). Cross-encoders are much more accurate than bi-encoders (embedding models) on the relevance judgment task — but they're too slow to run over the full corpus, so they're applied only to the top-50 or top-100 results.

The economics: Voyage rerank-2.5 at $0.05/1k requests and Cohere rerank-3.5 at $0.05/1k requests are absurdly cheap relative to embeddings. You apply them to ~50 candidates per query, so per-query reranking cost is roughly $0.00005 — negligible. The quality lift is consistently +5-8 nDCG points on retrieval benchmarks, which is more than the entire quality gap between Google text-005 and Voyage 3.

The practical implication: cheap embeddings + a reranker often beats expensive embeddings alone. A typical production stack in June 2026: Google text-embedding-005 or Jina v3 for the base embed of the whole corpus (cheap), retrieve top-50 by cosine similarity, then run Voyage rerank-2.5 or Cohere rerank-3.5 on those 50 to produce the final top-10. Net cost-per-query is dominated by the LLM call you make with the retrieved context, not the embedding or reranking. Net retrieval quality typically matches or exceeds Voyage 3 embeddings without reranking.

This is why the 'just use the most expensive embedding model' shortcut is so often wrong. You're paying a 7x cost premium on every embedded token (the corpus is millions of times bigger than the per-query reranking workload), in exchange for quality lift you could have gotten more cheaply by adding a $0.05/1k reranker. The cost-per-quality math is dramatic: reranking is roughly 100-1000x more cost-efficient at improving end-to-end retrieval than upgrading the base embedding model.


Migration cost reality: what re-embedding actually costs

Concrete numbers for a 1B-token corpus migration. Re-embed at OpenAI text-embedding-3-large: $130. Re-embed at Voyage 3: $180. Re-embed at Voyage 3 large: $300. Re-embed at Google text-005: $25. Re-embed at Jina v3: $20. These are the raw API costs; they're surprisingly modest because embedding APIs are cheap relative to inference APIs. For most teams, the API cost of a migration is not the bottleneck.

What IS the bottleneck: engineering time. A serious embedding migration involves (1) running the new embed across the corpus and validating the resulting vector counts match the old index (1-2 days), (2) rebuilding the vector index in your DB (Pinecone reindex, pgvector ANN rebuild, Qdrant migration — typically 4-24 hours of wall clock for a 10M-vector index), (3) re-running your retrieval eval suite to confirm quality didn't regress in unexpected ways (1-3 days), (4) running a shadow-traffic test where new and old retrieval are compared on production queries (1 week of soak time), (5) the cutover itself with rollback plan ready (half a day).

Budget 1 engineer-week minimum for a routine embedding migration; 2-3 weeks for a major one involving dimension changes or chunking strategy adjustments. At a $150k/yr loaded eng cost (~$3000/week), the engineering cost of migration is 10-30x the API cost. This is the primary friction that keeps teams on suboptimal embedding choices — the API math says 'switch to Jina v3 and save $10k/year', but the engineering math says 'that costs $6k in eng time + risk', so the migration sits on the backlog forever.

The strategic implication: get the embedding choice right the first time, and re-evaluate annually (not quarterly). The market moves fast enough that annual re-eval is correct; quarterly is too aggressive given migration cost. Build your codebase to make embedding model swap easy (abstract the embed call behind a single interface, version your vector indexes by model so old and new can coexist during cutover), and the marginal cost of future migrations drops dramatically.

Choose-and-deploy embeddings — 7 steps

  1. 1

    Run a custom 100-query eval on your corpus across 4-5 candidates (don't trust MTEB)

    Pick 100 queries you genuinely care about — pulled from real user logs if you have them, synthesized from domain expertise if you don't. Run each query through each candidate embedding model + your retrieval pipeline. Have a human (or a strong LLM judge with your relevance criteria) label top-10 results as relevant or not. Compute precision@10 and nDCG@10 per model. Budget 1 engineer-day. This eval is the single highest-ROI thing you can do before committing to a model.

  2. 2

    Pick the cheapest model within 3 points of the top scorer on YOUR eval

    If Voyage 3 scores 78 on your eval, OpenAI 3-large scores 76, Google text-005 scores 74 — pick Google. The 4-point gap is roughly within reranker recovery range, and the 7x cost savings compound over time. Don't optimize for MTEB; optimize for your retrieval-precision metric net of reranking.

  3. 3

    Test a reranker on top-50 with a cheap base embedding before going expensive

    Add Voyage rerank-2.5 or Cohere rerank-3.5 to your stack at $0.05/1k requests. Apply it only to the top-50 retrieved candidates per query. Re-run your eval. If reranking closes the gap to the expensive embedding model (it usually does), ship the cheap embed + reranker combo and pocket the savings.

  4. 4

    Use OpenAI Matryoshka dimensions to halve storage with minimal quality loss

    If you committed to OpenAI 3-large, request 1024 dimensions instead of the default 3072 via the `dimensions` parameter. You'll get ~95% of the quality at 33% of the vector storage cost. This is the highest-leverage single change available to OpenAI embedding users. Bump back to 2048 only if your eval shows measurable lift.

  5. 5

    Budget the re-embed cost annually as base models improve

    Embedding models are improving fast — the 2024-to-2026 jump in MTEB across the board was ~5-8 points. Plan a re-embed budget annually (it's cheap relative to engineering time) and re-run your custom eval each year to check whether the current incumbent is still the right choice.

  6. 6

    Instrument retrieval-precision-at-10 as a production metric

    Sample 50-100 queries per week from production traffic, have an LLM judge score top-10 results for relevance, log the metric over time. This catches model drift, chunking regressions, and surprise quality changes from provider-side model updates. Voyage, OpenAI, and Google all update underlying models without major version bumps occasionally.

  7. 7

    Re-evaluate quarterly — the embedding landscape moves fast

    Read MTEB leaderboard quarterly. Watch for new model releases from Voyage, Cohere, Jina, Nomic, Google. Don't migrate impulsively, but DO keep your decision framework fresh. When a model materially shifts the cost-per-quality frontier (Jina v3 did this in early 2026; Google text-005 did it in late 2025), re-run your eval and consider migration.

Frequently Asked Questions

Which embedding model has the best price-to-quality in 2026?

On pure cost-per-MTEB-point, Jina v3 ($0.02/M, ~64.8 MTEB) is the leader. For teams already in GCP, Google text-embedding-005 ($0.025/M, ~62.0 MTEB) is the cheapest credible option. For top-end quality on technical/code corpora, Voyage 3 ($0.18/M, ~70.5 MTEB) leads but the 7x cost premium is hard to justify without a reranker comparison.

Is Voyage 3 worth 7x the cost of Google text-005?

Sometimes. Voyage 3 leads MTEB by ~8 points, which is materially better on technical/scientific/code corpora and matters when retrieval recall is mission-critical (legal-grade search, medical literature, security incident triage). For high-volume general-purpose retrieval where you'll layer in a reranker anyway, Google text-005 + Voyage rerank-2.5 typically closes the quality gap at 1/5 the cost. Always run a custom eval before committing — Voyage's MTEB lead does not always translate to lead on YOUR corpus.

Should I use OpenAI 3-large or 3-small?

3-large at 1024 dimensions (via the Matryoshka `dimensions` parameter) is the right default for most teams — you get +2.3 MTEB points over 3-small at ~6x the per-token cost but with the storage cost matched. 3-small at 1536 dimensions is fine for high-volume cost-sensitive workloads where the quality difference doesn't justify the cost. Skip both if you can use Jina v3 — it scores higher than 3-large on MTEB at 1/6 the price.

What's a reranker and when do I need one?

A reranker is a second-stage model that re-scores the top-K candidates from your embedding search using a cross-encoder architecture (more accurate than bi-encoder embeddings but too slow to run on the full corpus). Voyage rerank-2.5 and Cohere rerank-3.5 cost $0.05/1k requests applied to ~50 candidates per query — negligible cost, +5-8 nDCG point lift on typical retrieval benchmarks. You need one if (a) you're using a cheap embedding model and want to recover quality, or (b) end-to-end retrieval precision matters more than per-query latency.

Can I reduce embedding dimensions to save storage?

Yes, for any model trained with Matryoshka Representation Learning (MRL). OpenAI text-embedding-3-large supports `dimensions=256..3072`; Cohere v4 supports dimension truncation; Voyage 3 large supports 1024/1536/2048 variants. Quality degrades gracefully — typically ~95% of full-dimension quality at 33% of dimensions. Storage cost scales linearly with dimensions, so 1024-dim vs 3072-dim is a 3x storage win. Models NOT trained with MRL (older Cohere, Google, Mistral) cannot be safely truncated.

When does self-hosting embeddings make sense?

Above ~1-5B tokens/month sustained throughput, self-hosting Nomic Embed v2 or BGE-large on a single GPU becomes cheaper than any hosted API. A single T4 GPU at $0.40/hr handles ~13B tokens/month of Nomic v2 throughput — that's roughly $292/month all-in vs $260+ on Jina API or $1,690+ on OpenAI API at equivalent volume. Below 1B tokens/month, the engineering overhead of running your own GPU infra (deployment, observability, on-call) outweighs the API savings.

How do I evaluate embeddings on my domain?

Minimum viable eval: 100 queries you care about, scored by precision@10 and nDCG@10. Source the queries from real user logs if you have them, or synthesize from domain expertise. For each candidate model, run the queries through your full retrieval pipeline (embed + index + retrieve top-10). Label results relevant/irrelevant — humans for ~30 queries, then a strong LLM judge (Claude Opus 4.8 or GPT-5.5 with explicit relevance criteria) for the remaining ~70. Compare scores across models, pick the cheapest within 3 points of the top scorer. Budget 1 engineer-day.

The query prompt determines recall as much as the model.

Our AI Prompt Generator writes embedding-tuned query patterns (HyDE, query-decomposition, structured filters) for Voyage, OpenAI, Cohere, Google — based on YOUR domain + corpus. 14-day free trial.

Browse all prompt tools →