Skip to contentNew: Does ChatGPT recommend your brand? Free 60-second AI visibility check →
By The DDH Team · Digital Dashboard Hub

RAG Cost Per Query (2026): The Full Stack Breakdown

By The DDH Team at Digital Dashboard HubUpdated

Stop writing AI prompts from scratch.

Tell us your business + your task + your model. We write the prompt — perfectly tuned for ChatGPT, Claude, Grok, Gemini, Midjourney, or any model. Plus 500+ pre-built prompts in your library.

14 days, no card. Cancel in 2 clicks.

A single RAG query touches four priced services in sequence: the embedding model (to vectorize the user's question), the vector database (to retrieve relevant chunks), an optional reranker (to reorder retrieved results by relevance), and the LLM (to generate a grounded response from the retrieved context). Teams building RAG systems for the first time almost always underestimate the LLM layer and overestimate the retrieval layer. The LLM call typically accounts for 85-95% of total per-query cost.

As of June 2026, a typical production RAG query costs $0.015–$0.025 end-to-end at modest context lengths (3,000 input tokens to the LLM, 500 output tokens). The breakdown: ~$0.000001–0.000009 embedding, ~$0.0000083 vector read (Pinecone Serverless), ~$0.001 reranking (optional), and $0.013–0.021 LLM generation. At 1M queries/month, that is a $15,000–25,000 monthly bill — almost entirely driven by the LLM.

This page covers the query-side cost stack. For the upstream cost of building your vector index — what you paid to embed your corpus and store the vectors — see the vector DB cost calculator and the embeddings cost calculator. For the embedding model comparison that affects both corpus indexing and query-side embedding cost, see Cohere vs OpenAI embedding cost. For a worked RAG architecture guide, see our RAG architecture decision tree.

Digital Dashboard Hub

Writing good prompts for ONE AI is hard. Writing them for GPT-5, Claude, Gemini, Perplexity, Midjourney and 6 more is a full-time job. DDH's AI Prompt Builder writes once, runs everywhere — locked to your niche, voice, and brand tone.

Free 14 days, no card.

Cost per RAG query component — June 2026

Feature
Component
Typical cost per query
Provider example
Notes
Query embedding$0.000001–$0.000009OpenAI text-embedding-3-small @ $0.02/1M; Voyage 3-large @ $0.18/1M~50 tokens per query; negligible vs LLM layer
Vector DB read$0.0000083–$0.00005Pinecone Serverless @ $8.25/1M readsWeaviate/Qdrant cluster cost is amortized, not per-query
Reranker (optional)$0.001Cohere Rerank @ $1/1,000 queriesBoosts recall precision; cost is 50-100x the embedding step
LLM generation (dominates)$0.013–$0.070Sonnet 4.6 @ $3/$15 per M in/out; gpt-4.1 @ $2/$8 per M in/out3,000 input + 500 output tokens typical; context length drives the bill
Total (no reranker)$0.013–$0.021Sonnet 4.6 with Pinecone Serverless + OpenAI small embeddingLLM is 85-95% of total cost
Total (with Cohere reranker)$0.014–$0.022Add $0.001/query on top of aboveReranker raises total ~5-7%

Sources as of June 2026: OpenAI embeddings pricing (developers.openai.com/api/docs/pricing — text-embedding-3-small $0.02/1M, text-embedding-3-large $0.13/1M); Voyage AI embeddings (docs.voyageai.com/docs/pricing — voyage-3-large $0.18/1M); Pinecone Serverless reads ($8.25/1M read units, pinecone.io/pricing); Cohere Rerank pricing (cohere.com/pricing — $1/1,000 queries for Rerank v3 on production tier); Anthropic Claude Sonnet 4.6 pricing ($3/1M input, $15/1M output — verify at anthropic.com/pricing as model pricing changes frequently); gpt-4.1 pricing ($2/1M input, $8/1M output — verify at openai.com/pricing). Token counts used: 50-token query embedding, 3,000-token LLM input (query + retrieved context), 500-token LLM output. Actual costs vary with context length and model selection.

The RAG query cost formula

Every RAG query runs four operations in sequence. Here is the formula with each layer isolated:

``` per_query_cost = # Layer 1: embed the user query (query_tokens / 1_000_000) × embed_$/M # Layer 2: vector database read + vector_read_cost_per_query # Layer 3: reranker (optional) + (use_reranker ? rerank_$/query : 0) # Layer 4: LLM generation (this dominates) + (llm_input_tokens / 1_000_000) × llm_input_$/M + (llm_output_tokens / 1_000_000) × llm_output_$/M ```

The LLM input token count is the sum of: the system prompt (shared across queries), the user's question, and the retrieved context chunks. This is the key lever. A system prompt of 800 tokens + a 100-token question + 5 chunks of 400 tokens each = 2,900 input tokens. At Sonnet 4.6's $3/1M input rate, that is $0.0087 in input tokens alone — before output. Add 500 output tokens at $15/1M = $0.0075. Total LLM: $0.0162 per query.

The number of retrieved chunks is the most controllable cost lever after model selection. Going from top-10 to top-5 chunks cuts context by ~40% on a typical RAG, reducing the LLM input cost proportionally. Measure retrieval precision to find the minimum chunk count that maintains answer quality.


Worked example 1: 1,000 queries/month — prototype or internal tool

At 1,000 queries/month, the bill is negligible. This is a solo developer's internal doc search or a team knowledge base with light usage.

**Query embedding (OpenAI text-embedding-3-small, 50 tokens/query):** 1,000 × 50 tokens = 50,000 tokens = 0.05M. 0.05 × $0.02 = **$0.001/month**.

**Vector DB read (Pinecone Serverless):** 1,000 × $8.25/1M = **$0.00825/month**.

**LLM (Sonnet 4.6, 3,000 input + 500 output tokens):** Input: 1,000 × 3,000 / 1M × $3 = $9. Output: 1,000 × 500 / 1M × $15 = $7.50. LLM total: **$16.50/month**.

**Total:** ~$16.51/month. The LLM layer is 99.9% of the bill. At this scale, model selection is the only cost decision worth making.

**Cheaper alternative:** Switch to Claude Haiku 3.5 ($0.80/$4 per M in/out): Input: $2.40, Output: $2.00. Total LLM: $4.40. Full query cost: **$4.41/month**. At 1,000 queries/month, Haiku is often sufficient for retrieval-augmented question answering with clean retrieved context.


Worked example 2: 100,000 queries/month — production SaaS feature

100,000 queries/month is a live production RAG feature in a B2B SaaS product — a documentation assistant, a support-ticket deflection tool, a contract review aid.

**Query embedding (OpenAI text-embedding-3-small):** 100,000 × 50 / 1M × $0.02 = **$0.10/month**.

**Vector DB read (Pinecone Serverless):** 100,000 × $8.25/1M = **$0.83/month**.

**Reranker (Cohere Rerank, optional):** 100,000 × $1/1,000 = **$100/month**. Note: the reranker is now the second largest cost component at this volume — larger than the vector DB and embedding combined. Only include it if it measurably improves answer quality on your eval.

**LLM (Sonnet 4.6, 3,000 in + 500 out):** Input: 100,000 × 3,000 / 1M × $3 = $900. Output: 100,000 × 500 / 1M × $15 = $750. LLM total: **$1,650/month**.

**Total (with reranker):** $0.10 + $0.83 + $100 + $1,650 = **$1,750.93/month** (~$0.0175/query).

**Total (without reranker):** $0.10 + $0.83 + $1,650 = **$1,650.93/month** (~$0.0165/query).

At 100K queries/month, the LLM is still 94% of the bill. The Cohere reranker adds 6% cost for its quality lift — worth benchmarking against your eval before including it in production.


Worked example 3: 1,000,000 queries/month — high-volume production

1M queries/month is an enterprise-scale RAG deployment — a customer-facing AI assistant, a large internal knowledge management tool, a high-volume document processing pipeline.

**Query embedding (OpenAI text-embedding-3-small):** 1M × 50 / 1M × $0.02 = **$1.00/month**.

**Vector DB read (Pinecone Serverless):** 1M × $8.25/1M = **$8.25/month**.

**Reranker (Cohere Rerank):** 1M × $1/1,000 = **$1,000/month**.

**LLM (Sonnet 4.6, 3,000 in + 500 out):** Input: 1M × 3,000 / 1M × $3 = $9,000. Output: 1M × 500 / 1M × $15 = $7,500. LLM total: **$16,500/month**.

**Total (with reranker): $17,509/month** (~$0.0175/query).

**Total (without reranker): $16,509/month** (~$0.0165/query).

At this scale, the LLM cost is the only optimization lever that matters. Three paths to reduce it: (1) prompt caching for the shared system prompt and static context — cuts input cost by 75-90% on the cached portion; (2) switching to a cheaper model tier (gpt-4.1-mini at $0.40/$1.60 per M vs Sonnet 4.6 at $3/$15); (3) reducing retrieved context length from top-10 to top-5 chunks. Each of these is independent and compoundable.


Prompt caching: the 60-80% bill reduction

Prompt caching is the highest-leverage RAG cost optimization available in 2026. Both Anthropic and OpenAI offer it; the mechanics differ slightly.

**Anthropic Claude (prompt cache):** Cache write: 1.25x the standard input price. Cache read: 0.10x the standard input price — a 90% discount. If your system prompt + any static context totals 1,500 tokens and is shared across all queries, the first query writes it to cache at 1.25x; every subsequent query reads it at 0.10x.

``` Without caching (Sonnet 4.6, 3,000 input tokens per query at $3/1M): 1M queries × 3,000 tokens = 3B input tokens × $3/1M = $9,000/month With caching (1,500 tokens cached, 1,500 tokens uncached): Cache writes (first hit per cache TTL): ~$1,687 (1.25x rate, amortized) Cache reads: 1M queries × 1,500 cached tokens × $0.30/1M = $450 Uncached: 1M queries × 1,500 tokens × $3/1M = $4,500 Total input: ~$6,637 — 26% cheaper just from caching the system prompt. ```

If you can cache more aggressively — a large static knowledge base preamble of 4,000 tokens included in every request — the savings compound. At 4,000 tokens cached per 5,000-token prompt (80% cached): cache reads = 1M × 4,000 × $0.30/1M = $1,200; uncached = 1M × 1,000 × $3/1M = $3,000; total input = $4,200 vs $15,000 without caching — a 72% input cost reduction.

**OpenAI (automatic prompt caching):** OpenAI applies automatic prompt caching to the longest common prefix of requests. The cached portion is billed at 50% of the standard input rate (versus Anthropic's 10%). Less aggressive but zero configuration needed — it applies automatically to requests that share a common leading context.

Caching is the single most impactful RAG cost optimization. If your system prompt is more than 1,000 tokens, enable prompt caching today. See our Claude API cost calculator for the caching math on other Claude models.


Context length is the hidden cost multiplier

Teams routinely over-retrieve. A RAG system configured to return top-10 chunks of 400 tokens each is injecting 4,000 tokens of context per query. Drop that to top-5 and you cut context injection by half. At Sonnet 4.6's $3/1M, the input cost difference is $0.006/query — $6,000/month at 1M queries. That is a line-item saving worth one benchmark run.

``` Context injection cost by chunk configuration (Sonnet 4.6, $3/1M input): top-3 × 400 tokens = 1,200 context tokens → $0.0036/query top-5 × 400 tokens = 2,000 context tokens → $0.0060/query top-10 × 400 tokens = 4,000 context tokens → $0.0120/query top-20 × 400 tokens = 8,000 context tokens → $0.0240/query ```

The output token count is often under-estimated. An AI assistant that writes comprehensive 800-token answers costs 60% more in output than one giving 500-token answers. On Sonnet 4.6 at $15/1M output, the difference is $0.0045/query — $4,500/month at 1M queries. Add system-level output constraints (`max_tokens`, response format guidance) to control this.

For a worked guide to minimizing context length without degrading answer quality, see our RAG architecture decision tree.


Model selection: cost vs quality tradeoffs in 2026

The LLM model choice drives more of the RAG query cost than any other single decision. The spread between cheapest and most expensive tier is 100x:

**Budget tier** — Claude Haiku 3.5 ($0.80/$4 per M in/out) or gpt-4.1-mini ($0.40/$1.60 per M in/out). At 3,000 in + 500 out tokens: Haiku = $0.0024 + $0.002 = $0.0044/query. gpt-4.1-mini = $0.0012 + $0.0008 = $0.002/query. Use for: simple factual Q&A on clean structured context, support ticket deflection, FAQ retrieval where the answer is a direct lift from retrieved text.

**Mid tier** — Claude Sonnet 4.6 ($3/$15 per M) or gpt-4.1 ($2/$8 per M). At 3,000 in + 500 out: Sonnet = $0.009 + $0.0075 = $0.0165/query. gpt-4.1 = $0.006 + $0.004 = $0.010/query. Use for: multi-step reasoning over retrieved context, synthesis across multiple chunks, nuanced answer generation where hallucination risk is significant.

**Premium tier** — Claude Opus ($15/$75 per M) or equivalent. At 3,000 in + 500 out: $0.045 + $0.0375 = $0.0825/query. Use only when the use case demands it: complex legal/medical reasoning, multi-document synthesis in high-stakes decisions. At 1M queries/month this is an $82,500/month bill — typically reserved for low-volume high-stakes queries, not bulk workloads.

The production pattern for high-volume RAG: route simple queries (keyword-answerable, single-chunk retrieval) to the budget tier; route complex queries (multi-hop, ambiguous, cross-chunk synthesis) to the mid tier. A 70/30 split between Haiku and Sonnet cuts the LLM cost by ~50% versus all-Sonnet, with minimal quality regression on the simple-query segment.

Verify all model prices at anthropic.com/pricing and openai.com/pricing before finalizing any budget — both providers adjust pricing with new model generations.


The reranker decision: $0.001/query worth it?

A reranker takes the top-N retrieved chunks from vector search and scores them by semantic relevance to the specific query before passing to the LLM. Cohere Rerank v3 is $1/1,000 queries on the production tier = $0.001/query.

The business case: if your vector search returns top-10 chunks but only 3 are actually relevant, the LLM is spending tokens on 7 irrelevant chunks. A good reranker filters those out, cutting context length (and LLM cost) while improving answer precision. The reranker earns its $0.001 if it reduces average chunk count from 10 to 5 at 3,000 total context tokens — because the reduction saves $0.006 at Sonnet 4.6 rates, netting a $0.005 savings per query.

When reranking is worth it: high-recall, low-precision retrieval (dense vector search with many near-miss chunks); long context windows that are expensive to fill; use cases where answer precision is measured (RAG eval scores, user satisfaction CSAT, support deflection accuracy).

When reranking is not worth it: very clean, narrow corpora where vector search already returns highly precise results; budget-tier LLM usage where the per-query LLM cost is already $0.002-0.004 and the $0.001 reranker fee is a 25-50% surcharge; query volumes above 100K/month where the reranker bill exceeds $100/month and a retrieval precision audit might yield the same gains for free.

See the Pinecone vs Weaviate vs Qdrant comparison for vector search precision benchmarks by provider that inform the reranker-vs-no-reranker tradeoff.


At 1M queries per month: the full optimization roadmap

Baseline bill at 1M queries/month (Sonnet 4.6, top-10 chunks at 400 tokens each, no caching, no reranker):

``` Embedding: $1/month (negligible) Vector DB: $8/month (negligible) LLM input: 1M × 4,100 tokens × $3/1M = $12,300/month LLM output: 1M × 500 tokens × $15/1M = $7,500/month Total: ~$19,800/month ```

Optimization 1 — cut to top-5 chunks: LLM input drops to 2,100 tokens. Input = $6,300. Saves **$6,000/month**.

Optimization 2 — enable prompt caching (1,000-token system prompt): cache reads at $0.30/1M vs $3/1M on the shared portion. Saves ~$1,800/month on the system prompt tokens. Saves **~$1,800/month**.

Optimization 3 — query routing: send 60% of queries to Haiku 3.5 ($0.80/$4 per M). Haiku 60%: 600K × 2,100 in / 1M × $0.80 = $1,008; 600K × 500 out / 1M × $4 = $1,200. Sonnet 40%: 400K × 2,100 in / 1M × $3 = $2,520; 400K × 500 out / 1M × $15 = $3,000. Total LLM after routing: $7,728 vs $13,800. Saves **~$6,072/month**.

Combined after all three optimizations: ~**$5,736/month** vs original $19,800 — a **71% cost reduction** with no model degradation on the simple-query segment and improved precision on the complex-query segment.

The implementation order: prompt caching first (zero-code change on Anthropic, configuration only), then chunk reduction (benchmark retrieval quality before cutting), then query routing (requires classification layer, most engineering effort but highest dollar savings).

How to estimate your RAG query cost in 5 steps

  1. 1

    Count your monthly query volume

    Every user interaction that triggers retrieval is one RAG query. 10,000 active users at 2 queries/day = 600,000 queries/month. This number drives everything — start here before touching any model or provider decision.

  2. 2

    Measure your average context length

    Add up: system prompt tokens + user query tokens + retrieved chunk tokens (N chunks × average chunk size). This is your LLM input token count per query. In most RAG systems this is 2,000-6,000 tokens. Every 1,000 tokens at Sonnet 4.6 input rate = $3/1M = $0.003/query = $3,000/month at 1M queries.

  3. 3

    Price the LLM layer first

    LLM cost = (input_tokens / 1M × input_$/M) + (output_tokens / 1M × output_$/M). This is 85-95% of your total RAG bill. Pick the cheapest model that hits your quality bar on a 50-query held-out eval before assuming you need the premium tier.

  4. 4

    Add vector DB and embedding costs

    Query embedding: query_tokens × monthly_queries / 1M × embed_$/M. Typically under $2/month at most scales. Vector DB reads: depends on provider — see the vector DB cost calculator. Usually 1-5% of total cost.

  5. 5

    Apply prompt caching and measure savings

    Enable prompt caching on your LLM provider. Anthropic cache reads are 0.10x the standard input price — a 90% discount on cached tokens. If your system prompt is 1,000 tokens and you run 1M queries/month, caching saves ~$2,700/month at Sonnet 4.6 rates. Zero code change, configure in the API call.

Frequently Asked Questions

How much does one RAG query cost in 2026?

Typical range: $0.013–$0.025/query. Breakdown: query embedding ~$0.000001, vector DB read ~$0.0000083 (Pinecone Serverless), LLM generation $0.013–$0.021 (Sonnet 4.6, 3,000 input + 500 output tokens). The LLM layer is 85-95% of total cost. Cheaper with Haiku or gpt-4.1-mini: $0.002–0.005/query.

How much does RAG cost at 1 million queries per month?

At $0.018/query average (Sonnet 4.6, 3,000 in + 500 out, top-5 chunks, prompt caching): ~$18,000/month. Before optimization with top-10 chunks, no caching: ~$19,800/month. After full optimization (caching + chunk reduction + query routing): ~$5,700/month. The optimization levers are real and worth implementing at this volume.

What is the biggest cost in a RAG system?

The LLM generation call — consistently 85-95% of total per-query cost. The retrieval stack (embedding + vector DB read) is typically under 1% of the bill. This means model selection and context length are the only cost levers that materially matter. Optimize those first.

Does Anthropic prompt caching work for RAG?

Yes, and it is one of the best ROI cost optimizations for RAG. Cache write: 1.25x the standard input price. Cache read: 0.10x — a 90% discount. If your system prompt and any static context total 1,500 tokens, caching them cuts those token costs by 90% on every repeated query. Enable it via the cache_control parameter in the Anthropic API.

Should I use a reranker in my RAG pipeline?

It depends on your retrieval precision. Cohere Rerank is $0.001/query (Rerank v3 production tier). If reranking reduces your average retrieved chunks from 10 to 5, it saves ~$0.006/query in LLM context costs at Sonnet 4.6 rates — netting a $0.005 saving after the $0.001 reranker fee. Run a retrieval precision audit before adding a reranker; if your vector search already returns high-precision results, a reranker adds cost without quality gain.

How do I reduce RAG cost without degrading quality?

Three compoundable optimizations: (1) Enable prompt caching — zero code change, 90% discount on cached tokens in Anthropic; (2) Reduce retrieved chunk count — benchmark retrieval quality with top-3 vs top-5 vs top-10 chunks; (3) Route simple queries to a cheaper model tier — Haiku 3.5 at $0.80/$4 per M handles straightforward factual lookups at 1/8 the cost of Sonnet 4.6. Combined, these typically achieve a 60-70% cost reduction.

What LLM should I use for RAG in 2026?

Start with the mid tier: Claude Sonnet 4.6 ($3/$15 per M in/out) or gpt-4.1 ($2/$8 per M). Both handle multi-chunk synthesis reliably. Drop to Haiku 3.5 or gpt-4.1-mini for simple factual Q&A — they are 6-8x cheaper and sufficient for direct-lift answers from retrieved context. Only escalate to Opus-class models for complex legal/medical/financial reasoning where the quality gap is measurable on your eval.

Is the vector database or the embedding model the main cost in RAG?

Neither — the LLM generation call is. The embedding model costs a fraction of a cent per query (50 tokens at $0.02/1M = $0.000001). The vector DB read on Pinecone Serverless is $0.0000083/query. The LLM at 3,000 input + 500 output tokens on Sonnet 4.6 is $0.0165/query — 1,000-16,000x more than either retrieval component. Build your cost model around the LLM first.

Cut your RAG bill before you scale.

Better query prompts reduce retrieved context length and cut LLM input tokens per query. Our AI Prompt Generator writes efficient RAG query patterns — shorter, higher-precision queries that retrieve less noise. 14-day free trial, no card.

Browse all prompt tools →