Skip to contentNew: Does ChatGPT recommend your brand? Free 60-second AI visibility check →
By The DDH Team · Digital Dashboard Hub

RAG Architecture Decision Tree 2026: Which Setup Fits Your Corpus, Query Type, and Budget

By The DDH Team at Digital Dashboard HubUpdated

Stop writing AI prompts from scratch.

Tell us your business + your task + your model. We write the prompt — perfectly tuned for ChatGPT, Claude, Grok, Gemini, Midjourney, or any model. Plus 500+ pre-built prompts in your library.

14 days, no card. Cancel in 2 clicks.

Retrieval-Augmented Generation has matured from a single technique into a family of architectures, each with meaningfully different cost profiles, latency characteristics, and engineering overhead. The most common mistake engineers make in 2026 is reaching for a managed vector database when their corpus has fewer than ten thousand documents, or deploying vanilla vector RAG when their queries require multi-hop reasoning across entities that span dozens of chunks. Choosing the wrong architecture does not just waste money — it produces worse answers than a simpler approach would have.

This guide is a branching decision framework, not a vendor pitch. It starts with the most consequential variable — corpus size — and works through query type, latency budget, and compliance requirements to arrive at a concrete recommendation. Every branch includes approximate cost estimates, which are sourced from publicly listed pricing as of mid-2026 and should be treated as rough anchors rather than guarantees, since cloud pricing changes frequently.

If you are building a RAG system prompt or tuning retrieval instructions, the RAG cost calculator can help you model per-query cost before you commit to an architecture. For a deeper comparison of specific vector databases, see Pinecone vs Weaviate vs Qdrant. This post focuses on the architecture decision itself — which retrieval pattern to use and why — rather than benchmarking individual products.

Digital Dashboard Hub

Writing good prompts for ONE AI is hard. Writing them for GPT-5, Claude, Gemini, Perplexity, Midjourney and 6 more is a full-time job. DDH's AI Prompt Builder writes once, runs everywhere — locked to your niche, voice, and brand tone.

Free 14 days, no card.

RAG architecture comparison 2026

Feature
Architecture
Best corpus size
Latency
Build cost
Query cost
Compliance options
In-process (FAISS/Chroma)In-process (FAISS/Chroma)<500k docs<100ms p99$0 infra; engineer time$0 at runtimeSelf-managed; data never leaves host
pgvectorpgvector<2M docs50-300ms$50-200/mo managed PostgresShared DB costBAA available (AWS/GCP); strong for HIPAA
Pinecone serverlessPinecone serverless10k-50M docs100-400ms$0 setup; pay-as-you-go$0.33/GB-month + read unitsSOC 2 Type II; Enterprise HIPAA BAA
Qdrant CloudQdrant Cloud10k-100M docs50-350ms$25-100/mo standard clustersIncluded in cluster costSOC 2; Enterprise self-hosted option
Hybrid BM25+DenseHybrid BM25+Dense1M-10M docs300-800msVector DB + BM25 engine (~$95/mo ES)$0.01-0.05/query at scaleInherits underlying DB compliance
GraphRAGGraphRAG100k-10M docs (analytical)500ms-3s$200-500 per 1M tokens to build graph$0.01-0.10/query runtimeSelf-hosted open-source option available
Knowledge Graph (full)Knowledge Graph (full)1M+ highly relational200ms-2s$5k-50k build; ongoing curation$0.005-0.05/queryEnterprise deployments; air-gap possible

Cost estimates are approximate as of mid-2026 and based on publicly listed pricing. Actual costs vary significantly with query volume, document size, reranking model usage, and cluster configuration. Verify current pricing directly with each vendor before making commitments.

Why corpus size is the first branch

The most important variable in any RAG architecture decision is not which vector database has the best benchmark score — it is how many documents your system needs to search across. Corpus size determines whether retrieval is necessary at all, what kind of index structure is appropriate, and whether you need approximate or exact nearest-neighbor search. Getting this wrong early means either overengineering a simple use case or hitting hard performance ceilings as your corpus grows.

The reason corpus size comes first is that the economics change by orders of magnitude at each threshold. Below ten thousand documents, the total token count of a reasonably chunked corpus often fits inside a 128k context window, which means retrieval adds latency and complexity without adding accuracy. Between ten thousand and one million documents, a vanilla vector database with good chunking will handle the majority of factual lookup queries at acceptable cost. Above one million documents, sparse-dense hybrid search starts to outperform vanilla vector RAG because BM25-style exact term matching catches rare proper nouns, product codes, and identifiers that dense embeddings frequently smooth over.

Above ten million documents, the architecture decision becomes more nuanced. Sharding a vector database across multiple nodes is operationally complex and expensive. GraphRAG — which builds a knowledge graph from the corpus during an offline processing step — can be more cost-effective for analytical workloads at this scale because query-time retrieval becomes a graph traversal rather than an exhaustive approximate nearest-neighbor search across hundreds of millions of vectors. The tradeoff is a significant upfront build cost and a corpus that must be reprocessed when documents change substantially.


The less-than-10k stuffing case: skip RAG entirely

If your corpus contains fewer than ten thousand documents of typical length — say, a company's internal FAQ, a product's documentation set, or a legal contract database — you should seriously consider whether RAG is the right tool at all. With modern context windows of 128k tokens (Claude 3.5 Sonnet, GPT-4o) and 1M tokens (Gemini 1.5 Pro), a 5,000-document knowledge base chunked at 500 tokens totals roughly 2.5 million tokens. That does not fit in a single context, but it does fit in a small number of chunked batches, and for many use cases a map-reduce summarization approach or a pre-filtered context stuffing approach will produce more accurate answers than approximate nearest-neighbor retrieval.

The argument for stuffing context is straightforward: retrieval introduces two failure modes that do not exist when you simply include everything. First, the embedding model may not rank the most relevant chunks highest, especially for rare terms or questions that require synthesizing information across many chunks. Second, chunking itself can split a coherent passage in a way that makes neither chunk individually useful. When you eliminate retrieval, you eliminate both failure modes at the cost of higher per-query token spend.

The practical threshold for context stuffing depends on your query volume and cost tolerance. At $3/million input tokens (approximate Claude Sonnet pricing), stuffing a 500k-token context on every query costs $1.50 per call — clearly impractical for high-volume applications. But for a low-volume internal tool with ten queries per day, that cost is entirely reasonable. The correct decision is to calculate your actual per-query token cost at your query volume and compare it to the engineering overhead of standing up and maintaining a vector retrieval pipeline. For small corpora with low query volume, stuffing almost always wins.


Vector DB RAG for the 10k-to-1M document range

For corpora between ten thousand and one million documents, vanilla vector database RAG is the right starting point for the majority of factual lookup workloads. The pattern is well-understood: chunk documents into segments of 256-512 tokens with meaningful overlap, embed them with a capable general-purpose embedding model, store the vectors in a vector database, and at query time embed the question and retrieve the top-k most similar chunks as context for the LLM. This architecture handles the vast majority of enterprise RAG use cases when implemented with attention to chunking strategies 2026.

The choice of vector database within this range depends primarily on your latency budget, compliance requirements, and operational preferences. For teams that want zero infrastructure overhead, Pinecone serverless provides pay-as-you-go pricing with no cluster to manage; see Pinecone quota tiers for the details of their free and paid tiers. Qdrant Cloud offers a generous free tier and strong filtering capabilities that are particularly useful for multi-tenant applications where you need to restrict retrieval to documents belonging to a specific user or organization; see Qdrant Cloud quotas for current limits. If your application already runs on Postgres, pgvector adds vector search to your existing database with no additional service to manage, at the cost of some query performance relative to purpose-built vector databases.

Embedding model selection matters more than most teams realize. For general English-language corpora, OpenAI text-embedding-3-large and Cohere embed-v4.0 are competitive choices with strong benchmark performance. For code search specifically, Voyage AI's voyage-code-3 embedding model significantly outperforms general-purpose models on code retrieval tasks and is worth the additional cost for developer tools. For multilingual corpora spanning more than a handful of languages, Cohere embed-v4.0 with its support for over 100 languages is the practical default choice rather than attempting to run per-language embedding pipelines. A step-by-step implementation guide is available at Build RAG with Pinecone.


Hybrid BM25+dense search for the 1M-to-10M range

Once a corpus grows past roughly one million documents, vanilla dense vector RAG starts to show systematic weaknesses that matter in production. Dense embeddings compress meaning into a fixed-dimensional vector space, which means they excel at semantic similarity but perform poorly on exact-match retrieval for rare terms — product serial numbers, specific regulatory citations, named individuals with uncommon names, or technical identifiers. At scale, these failures accumulate and degrade user trust in the system. Hybrid search addresses this by running BM25 (or another sparse retrieval method) in parallel with dense vector retrieval and fusing the results before reranking.

The hybrid search pattern involves three components: a sparse index (typically BM25 via Elasticsearch, OpenSearch, or Qdrant's sparse vector support), a dense vector index, and a reranker that takes the combined candidate set and produces a final ranked list. The reranker — commonly Cohere Rerank or a cross-encoder model — is often the single biggest accuracy improvement available, because it applies a more expensive but more accurate model to a small candidate pool rather than the full corpus. Elasticsearch at its ~$95/month entry tier adds significant fixed cost, but for corpora at this scale the retrieval quality improvement typically justifies it. See hybrid search: BM25 plus dense for an implementation walkthrough.

Latency is the main downside of hybrid search. Running two retrieval systems in parallel and then reranking adds roughly 200-400ms compared to a single vector lookup, pushing total query latency into the 500ms-800ms range for most deployments. This is acceptable for most enterprise applications but may be a constraint for real-time conversational interfaces where users expect near-instantaneous responses. If you need sub-500ms latency at the 1M+ scale, you should consider an in-process FAISS index with aggressive filtering to reduce the search space before lookup, accepting the operational overhead of maintaining a locally hosted index.


GraphRAG for analytical and multi-hop queries

GraphRAG is the right architecture when your queries require synthesizing information across multiple entities or relationships that are not co-located in any single document chunk. A question like "which suppliers have both a quality compliance issue and a contract renewal due this quarter" requires the retrieval system to traverse relationships between entities, not just find chunks similar to the question embedding. Vanilla vector RAG will typically retrieve the most similar chunks to the question surface text and miss the cross-entity synthesis required for a complete answer. GraphRAG vs Vector RAG covers this distinction in more depth.

Building a GraphRAG system requires an offline processing step that is substantially more expensive than simply embedding and indexing documents. Microsoft's open-source GraphRAG implementation, which established much of the 2024-2026 design vocabulary for this approach, uses an LLM to extract entities and relationships from the corpus and construct a knowledge graph. At rough estimates of $200-500 per million tokens of corpus processed, a ten-million-document corpus can cost tens of thousands of dollars to build and must be substantially reprocessed when the corpus changes. This makes GraphRAG a poor fit for frequently updated corpora unless you can partition the graph and incrementally update only changed portions.

At query time, GraphRAG is surprisingly cost-effective. Because the knowledge graph pre-computes entity relationships, query-time retrieval is a graph traversal over a compact structure rather than an approximate nearest-neighbor search across millions of vectors. Runtime query costs of $0.01-0.10 per query are achievable depending on query complexity and the depth of graph traversal required. The key engineering decision is whether your analytical query mix justifies the upfront graph construction cost — if fewer than 20-30% of your queries require multi-hop reasoning, you are likely better off with hybrid search plus a capable LLM than investing in GraphRAG infrastructure.

The ASCII decision tree below summarizes the corpus size branches: Corpus size? ├── <10k docs → Stuff context (no RAG needed) ├── 10k-1M → Vector DB RAG │ ├── Latency <500ms → FAISS/Chroma local │ └── Latency <2s → Pinecone/Qdrant Cloud ├── 1M-10M → Hybrid BM25 + Dense + Rerank └── 10M+ → GraphRAG or Sharded Vector DB


Latency budget branches

After corpus size, latency budget is the most important architectural constraint. Sub-500ms end-to-end latency — including retrieval, context assembly, and LLM generation — is genuinely difficult to achieve unless you keep retrieval in-process. FAISS embedded directly in your application server, Chroma running on the same host, or pgvector on the same database server that handles your application data can all achieve vector retrieval in 10-50ms for corpora up to roughly 500,000 vectors. Beyond that scale, approximate nearest-neighbor search latency climbs and you need to consider either sharding or accepting slightly higher latency from a dedicated service.

For applications where latency under two seconds is acceptable — which covers the majority of enterprise productivity tools, document Q&A interfaces, and internal search products — managed serverless vector databases are the operationally sensible default. Pinecone serverless and Qdrant Cloud both consistently achieve p99 retrieval latency under 400ms for typical workloads, leaving a comfortable budget for LLM generation. The tradeoff versus in-process options is a network round-trip and a monthly infrastructure cost, but you gain managed replication, automatic scaling, and no index maintenance burden.

For batch workloads where latency is not a constraint — overnight document processing, scheduled analytics pipelines, or background knowledge base updates — almost any architecture is viable from a latency perspective. This is where GraphRAG becomes most cost-competitive: building the knowledge graph overnight means users get fast analytical answers during the day without the system needing to execute expensive real-time graph construction. Batch-tolerant workloads should optimize primarily for build cost and query accuracy rather than latency, which often points toward GraphRAG or a large sharded dense index with aggressive reranking.


Compliance branches: SOC 2, HIPAA, and air-gapped requirements

Compliance requirements can override cost and latency preferences entirely. If your application handles data that requires SOC 2 Type II attestation, your vector database options are narrowed to those with current SOC 2 reports: Pinecone at the Standard and Enterprise tiers, Weaviate Cloud Enterprise, and Qdrant Enterprise. All three publish their SOC 2 reports under NDA or via their trust portals. pgvector running on AWS RDS or Google Cloud SQL can inherit the cloud provider's SOC 2 attestation if your overall system architecture is scoped correctly, which makes it a cost-effective option for organizations already operating within a compliant cloud environment.

HIPAA requirements add a Business Associate Agreement (BAA) requirement to the compliance checklist. Pinecone Enterprise offers a BAA for qualifying healthcare customers. The operationally simplest path for HIPAA-covered workloads is pgvector running in your own AWS VPC with AWS signing the BAA — you already have the BAA for the rest of your AWS infrastructure, and adding RDS Postgres with pgvector does not require any additional vendor relationship. If your corpus is under two million documents and your latency budget is under 300ms, this is often the most straightforward HIPAA-compliant RAG architecture available.

Air-gapped or data sovereignty requirements — common in defense, certain financial services, and government contexts — effectively rule out all managed cloud services and require self-hosted deployments. The viable options in this category are pgvector on self-managed Postgres, Qdrant self-hosted on your own infrastructure, Milvus, and Weaviate open-source. All four are production-grade and actively maintained. The operational overhead of running a self-hosted vector database is not trivial — you are responsible for backups, upgrades, capacity planning, and availability — but for organizations with existing on-premises or private cloud infrastructure, the incremental burden is manageable. GraphRAG built on Microsoft's open-source implementation can also run fully self-hosted, making it viable for air-gapped analytical workloads.


Anti-patterns: wrong architecture choices and what they cost you

The most common anti-pattern in 2026 RAG deployments is deploying a managed vector database for a corpus that would be better served by context stuffing. Engineers familiar with vector databases often reach for them by default, even for small corpora where the retrieval quality is demonstrably worse than including the full context. If you have a 2,000-document product manual and your users ask questions that require synthesizing information from three or four different sections, retrieval with top-5 chunks will frequently miss one of the relevant sections and produce an incomplete answer. The fix is embarrassingly simple: include more context.

The second common anti-pattern is using vanilla vector RAG for corpora that contain lots of exact identifiers — order numbers, SKUs, regulatory citation codes, person names in diverse scripts. Dense embeddings were designed to capture semantic meaning and they do so at the cost of exact-match precision. A query for "invoice INV-2024-00847" will have degraded recall in a dense-only system because the embedding model does not treat that string as a unique identifier requiring exact match. Adding a BM25 component costs relatively little — Qdrant now supports sparse vectors natively, reducing the need for a separate Elasticsearch instance in many cases — and significantly improves recall for exact-term queries.

The third anti-pattern is building GraphRAG for a corpus that changes frequently. GraphRAG's offline construction step is expensive and does not support incremental updates well in most current implementations. Teams that build GraphRAG for a corpus of support tickets, news articles, or product reviews that adds thousands of new documents per day quickly find themselves either running expensive daily rebuild jobs or serving stale graphs. For high-velocity corpora, hybrid BM25+dense with a capable reranker is almost always the better choice even for analytical queries, because the index can be updated incrementally at low cost. Reserve GraphRAG for corpora that are relatively stable or where the analytical query requirements genuinely cannot be met any other way.


Total cost of ownership across architectures

Comparing RAG architectures on monthly infrastructure cost alone is misleading because engineering time — both initial build and ongoing maintenance — often dominates the three-year total cost of ownership for lower-volume applications. A self-hosted FAISS or pgvector deployment might have zero monthly infrastructure cost, but requires someone to handle index rebuilds, capacity planning, version upgrades, and failure recovery. A managed service like Pinecone or Qdrant Cloud shifts those responsibilities to the vendor at a monthly fee that is often cheaper than the engineering time equivalent, particularly for teams without dedicated infrastructure engineers.

For a representative mid-size deployment — one million documents, 10,000 queries per day, a two-engineer team — approximate three-year TCO estimates run as follows: in-process FAISS on a dedicated server runs roughly $3,600-7,200 in server costs plus significant engineering time to maintain; pgvector on managed Postgres runs approximately $5,000-15,000 in database costs; Pinecone serverless at $0.33/GB-month plus query units runs approximately $8,000-20,000 depending on index size and query volume; Qdrant Cloud on standard clusters runs approximately $5,000-12,000; hybrid BM25+dense adds an Elasticsearch instance and roughly doubles infrastructure cost to $12,000-30,000; GraphRAG has the highest upfront cost at $20,000-50,000 for corpus processing but lower ongoing query costs. These estimates are rough and should be calculated for your specific volume using the RAG cost calculator.

The compliance premium is real and worth quantifying separately. Moving from Qdrant Cloud standard to Qdrant Enterprise, or from Pinecone Standard to Pinecone Enterprise, typically adds $1,000-3,000 per month in baseline costs for the BAA, dedicated infrastructure, and enhanced SLA. For HIPAA-covered workloads, the cost of non-compliance — fines, breach notification, audit overhead — almost always justifies the compliance premium. Budget it explicitly in your architecture decision rather than treating it as a potential future line item.


How to verify and stay current with these recommendations

RAG architecture benchmarks and pricing change faster than almost any other area of the ML infrastructure stack. The cost estimates in this article are based on publicly available pricing as of mid-2026, but Pinecone, Qdrant, Weaviate, and other vendors revise pricing frequently — sometimes upward as serverless becomes standard, sometimes downward as competition intensifies. Before committing to an architecture based on cost, verify current pricing directly from each vendor's pricing page and run a 30-day pilot workload to get real query unit consumption numbers rather than relying on estimates.

Benchmark results for embedding models, rerankers, and end-to-end RAG pipeline accuracy are published by several independent groups including BEIR, MTEB, and the RAGAs framework. These benchmarks are not perfect — they test on specific dataset distributions that may not match your corpus — but they provide a useful baseline for comparing embedding models and retrieval strategies. When evaluating a new architecture for your use case, treat published benchmarks as a starting point and build a small domain-specific evaluation set of 100-200 question-answer pairs drawn from your actual corpus and query distribution to validate performance before full deployment.

The open-source GraphRAG implementation from Microsoft and the LightRAG project from the University of Hong Kong are both actively developed and have released significant updates in 2025-2026. If you evaluated GraphRAG previously and found the construction cost prohibitive, the incremental update support has improved substantially and is worth re-evaluating. Similarly, pgvector's performance characteristics have improved with the addition of HNSW index support, which narrows the query latency gap with purpose-built vector databases for corpora under five million vectors. The architecture decision you made 18 months ago may no longer be optimal for your current scale.

Building your RAG architecture decision

  1. 1

    Audit corpus size and growth rate

    Count your current document count, measure average document length in tokens, and project how both will grow over the next 12 and 36 months. A corpus at 800,000 documents growing at 50,000 per month will cross the one-million threshold in four months, which may push you toward hybrid search now rather than migrating an already-deployed vanilla vector pipeline in six months. Document the current size, growth rate, update frequency (how often existing documents change), and whether documents are deleted or only added, since deletion handling differs significantly across vector database implementations.

  2. 2

    Characterize your query type distribution

    Sample 200-300 real or representative queries and classify each as: factual lookup (retrieving a specific fact from a known document), analytical or comparative (comparing options, summarizing across many documents), multi-hop (requiring synthesis of information from multiple documents about related entities), code search, or long-document QA. Estimate the percentage in each category. If more than 30% of queries are multi-hop or analytical, GraphRAG or a knowledge graph deserves serious consideration. If more than 20% involve code, a specialized code embedding model will outperform a general-purpose embedding model. This classification exercise also surfaces whether your expected retrieval accuracy baseline is realistic for the architecture you are considering.

  3. 3

    Set a concrete latency SLA

    Define a specific latency target — not "fast" but "p95 under 800ms end-to-end from user input to first token of response." Break that budget down: retrieval must complete in X ms, context assembly in Y ms, LLM time-to-first-token in Z ms. This forces a realistic assessment of what each component contributes. If your LLM provider delivers first token in 600ms under load, you have 200ms left for retrieval — which rules out hybrid search with reranking for many configurations and points you toward an in-process or low-latency managed option. Latency SLAs should be set based on user experience requirements, not adjusted to fit the architecture you prefer.

  4. 4

    Assess compliance requirements

    Determine whether your application handles data covered by HIPAA, SOC 2 audit requirements, GDPR data residency restrictions, or industry-specific frameworks like FedRAMP or PCI-DSS. For each applicable framework, identify the specific controls that your vector database vendor must satisfy — a SOC 2 Type II report, a BAA, data residency in a specific geographic region, or an air-gap deployment requirement. Map these requirements to the vendor options that satisfy all of them before evaluating on cost or performance. A vendor that fails a compliance requirement is not a viable option regardless of other advantages.

  5. 5

    Calculate 3-year TCO for your top two options

    Take the top two architectures that survive the corpus size, query type, latency, and compliance filters and build a 3-year total cost of ownership model. Include infrastructure cost (servers or managed service fees), embedding model cost (the one-time cost to embed your corpus plus ongoing cost to embed new documents), query-time cost (embedding the query plus retrieval cost plus reranker cost if applicable), LLM generation cost, and engineering time estimated in hours multiplied by a loaded hourly rate. The RAG cost calculator can help model the per-query component. Compare the two totals and weight them by your organization's cost sensitivity versus operational risk tolerance to make the final decision.

Frequently Asked Questions

When should I skip RAG entirely and just stuff context?

Skip RAG when your corpus has fewer than ten thousand documents and your query volume is low enough that the per-query token cost of including substantial context is acceptable. The key calculation is: estimate your total chunked corpus size in tokens, multiply by your LLM provider's input token cost, and compare that to the engineering overhead of building and maintaining a retrieval pipeline. For small corpora with complex queries that require synthesizing information from many sources, context stuffing consistently produces better answers than retrieval because it eliminates retrieval failures — chunks that are relevant but not ranked in the top-k. The practical threshold depends on your query volume, but teams doing fewer than a few thousand queries per day on corpora under 500,000 tokens total should run the stuffing cost calculation before assuming RAG is necessary.

What is the cheapest possible RAG setup that is actually production-grade?

For a corpus in the 10,000-500,000 document range with latency requirements under 500ms and no compliance constraints, an in-process FAISS index with a local embedding model or a low-cost hosted embedding API has zero ongoing infrastructure cost. The practical limitations are that you manage index serialization and rebuilds yourself, and FAISS does not support filtering or metadata queries natively in the same way purpose-built vector databases do. For teams comfortable with the operational overhead, FAISS plus pgvector for metadata filtering on the same Postgres instance that runs your application data is a genuinely production-grade, near-zero marginal infrastructure cost option. If you want a managed service with a free tier, Qdrant Cloud's free tier supports up to 1GB of vector storage, which covers roughly 500,000 vectors at 1536 dimensions — sufficient for many small production RAG applications.

Is pgvector good enough for production RAG, or do I need a dedicated vector database?

pgvector with the HNSW index is production-grade for corpora up to roughly two to five million vectors, depending on dimensionality and query load. The realistic limitation is not accuracy but query latency under concurrent load — pgvector shares its thread pool and memory with the rest of your Postgres workload, so a write-heavy application can degrade vector search latency in ways that a dedicated vector database avoids through workload isolation. For most enterprise applications with corpora under one million documents and moderate query volume, pgvector is entirely sufficient and eliminates the operational overhead of a separate service. The cases where pgvector specifically underperforms are high-concurrency vector search workloads and corpora requiring complex metadata filtering across high-cardinality fields, where purpose-built databases have more optimized execution paths.

What exactly is hybrid search and when does it matter?

Hybrid search combines two fundamentally different retrieval mechanisms: dense vector search, which finds documents semantically similar to the query embedding, and sparse retrieval (typically BM25), which finds documents containing the same terms as the query. The results from both are fused — usually via reciprocal rank fusion or a learned combination — and optionally reranked by a cross-encoder model. Hybrid search matters most when your corpus contains exact identifiers, rare proper nouns, or specialized terminology that dense embeddings may not represent precisely. It also consistently outperforms dense-only search on corpora with domain-specific vocabulary where the embedding model was not trained on in-domain data. The downside is added latency from running two retrieval systems and a reranker, and added operational complexity from maintaining both a vector index and a text search index.

What makes GraphRAG so expensive to build, and is the cost coming down?

GraphRAG construction is expensive primarily because it uses an LLM to extract entities and relationships from each document in the corpus, and LLM calls at scale are not cheap. Microsoft's reference implementation processes each document with multiple LLM calls to identify entities, extract relationships, and generate community summaries — the total LLM cost for a one-million-document corpus at current pricing runs approximately $200-500, and for ten million documents approaches $2,000-5,000. The cost is coming down as models get cheaper and more efficient extraction pipelines emerge, but it remains orders of magnitude more expensive than simple embedding indexing. Incremental update support is also improving — earlier implementations required full corpus reprocessing for any update, whereas newer implementations support graph patching for changed or added documents. If you evaluated GraphRAG based on 2024 cost benchmarks, a re-evaluation using current models and tooling is warranted.

How do compliance requirements actually change the architecture decision?

Compliance requirements can eliminate entire categories of vendor from consideration regardless of their technical merits. HIPAA requires a Business Associate Agreement with every vendor that handles PHI — if a vector database vendor does not offer a BAA, they cannot be used for HIPAA-covered workloads full stop. SOC 2 Type II attestation rules out most open-source self-hosted deployments unless your organization has completed its own SOC 2 audit that scopes in the infrastructure running the vector database. Data residency requirements under GDPR or national data sovereignty laws may eliminate US-hosted managed services for European data, or vice versa. The practical advice is to determine compliance requirements before evaluating vendors, not after — compliance constraints are binary eliminators, not tradeoffs.

How should I choose between Pinecone, Qdrant, and Weaviate for a vanilla vector RAG deployment?

For most teams starting a new vanilla vector RAG deployment in 2026, the decision between these three comes down to: (1) whether you need a generous free tier for prototyping — Qdrant Cloud has the most permissive free tier; (2) whether you have strong Python ecosystem preferences and want maximum LangChain/LlamaIndex integration documentation — Pinecone has the most extensive ecosystem documentation; (3) whether you need strong multi-tenancy with namespace isolation across thousands of customers — Pinecone's namespace model is operationally simpler at high tenant counts; and (4) whether you may need self-hosted deployment in the future — Qdrant and Weaviate both have production-grade open-source versions that match their cloud API. See Pinecone vs Weaviate vs Qdrant for a detailed feature comparison. Avoid selecting purely on benchmark scores — all three perform comparably for typical enterprise RAG workloads, and operational fit matters more than single-digit recall differences on benchmark datasets.

What embedding model should I use for a multilingual corpus?

For a corpus spanning more than a few languages, Cohere embed-v4.0 is the most practical default choice as of mid-2026, supporting over 100 languages with strong cross-lingual retrieval — meaning a query in English can retrieve relevant documents in Spanish, French, German, Japanese, or other supported languages without requiring translation. The alternative is to run per-language embedding pipelines with language-specific models, which achieves marginally better monolingual retrieval quality but requires significant additional infrastructure complexity and fails on cross-lingual queries. OpenAI's text-embedding-3-large supports multiple languages but with noticeably degraded performance on non-English retrieval compared to Cohere's multilingual model. If your corpus is primarily English with occasional documents in one or two other languages, a general-purpose English embedding model is likely sufficient and more cost-effective.

Build better RAG system prompts with AI tools

Once you have selected your RAG architecture, the quality of your system prompt — how you instruct the LLM to use retrieved context, handle conflicting information, and acknowledge uncertainty — determines most of the remaining variance in answer quality. Use our prompt engineering tools to generate, test, and refine RAG system prompts for your specific architecture and query type.

Browse all prompt tools →