The RAG query cost formula
Every RAG query runs four operations in sequence. Here is the formula with each layer isolated:
``` per_query_cost = # Layer 1: embed the user query (query_tokens / 1_000_000) × embed_$/M # Layer 2: vector database read + vector_read_cost_per_query # Layer 3: reranker (optional) + (use_reranker ? rerank_$/query : 0) # Layer 4: LLM generation (this dominates) + (llm_input_tokens / 1_000_000) × llm_input_$/M + (llm_output_tokens / 1_000_000) × llm_output_$/M ```
The LLM input token count is the sum of: the system prompt (shared across queries), the user's question, and the retrieved context chunks. This is the key lever. A system prompt of 800 tokens + a 100-token question + 5 chunks of 400 tokens each = 2,900 input tokens. At Sonnet 4.6's $3/1M input rate, that is $0.0087 in input tokens alone — before output. Add 500 output tokens at $15/1M = $0.0075. Total LLM: $0.0162 per query.
The number of retrieved chunks is the most controllable cost lever after model selection. Going from top-10 to top-5 chunks cuts context by ~40% on a typical RAG, reducing the LLM input cost proportionally. Measure retrieval precision to find the minimum chunk count that maintains answer quality.