By The DDH Team · Digital Dashboard Hub

Chunking Strategies for RAG (2026): Fixed-Size vs Semantic vs Hierarchical, Benchmarked

By The DDH Team at Digital Dashboard Hub·Updated June 21, 2026

Stop writing AI prompts from scratch.

Tell us your business + your task + your model. We write the prompt — perfectly tuned for ChatGPT, Claude, Grok, Gemini, Midjourney, or any model. Plus 500+ pre-built prompts in your library.

Chunking is the step in every RAG pipeline that most engineers treat as a configuration detail and most papers treat as a hyperparameter to sweep. Neither framing captures how much it matters. The chunking strategy you choose determines what the retriever sees, which determines what the LLM generates. A poorly chunked corpus can cut recall@10 by 15 percentage points compared to a well-tuned strategy on the same embedding model, the same vector database, and the same queries. That is a larger gap than most embedding model upgrades will buy you.

This article benchmarks five strategies — fixed-size (512/1024 tokens), recursive character splitting, semantic boundary detection, sliding window with overlap, and hierarchical parent-child indexing — against three corpus types: technical documentation, customer support Q&A, and legal contracts. The numbers are drawn from public evaluations published by LangChain, ChromaDB, MongoDB/Voyage, and academic literature through June 2026. They are indicative, not definitive: your corpus, your embedding model, and your query distribution will shift every number. Treat them as a starting map, not a finish line.

The article also covers when to skip chunking entirely using document-level embedding with a long-context model like Cohere's 128k-window embed-v4, the one scenario where that trade-off makes sense. For background on the broader retrieval architecture, see RAG architecture decision tree 2026 and when RAG fails and fixes. Embedding model selection is a separate decision covered in the embedding model leaderboard 2026. Once you have chunking and embeddings sorted, the RAG cost calculator will tell you what the system costs per query at scale.

Digital Dashboard Hub

Writing good prompts for ONE AI is hard. Writing them for GPT-5, Claude, Gemini, Perplexity, Midjourney and 6 more is a full-time job. DDH's AI Prompt Builder writes once, runs everywhere — locked to your niche, voice, and brand tone.

Free 14 days, no card. →

Chunking strategy comparison

Feature	Strategy	Recall@10 (tech docs)	Recall@10 (support)	Recall@10 (legal)	Complexity
Fixed-size 512 tokens	~61%	~58%	~52%	Low	1.0x (baseline)
Recursive character 512+50 overlap	~68%	~64%	~56%	Low	1.1x
Semantic boundary	~71%	~69%	~62%	High	0.9-1.1x
Sliding window 512/25% overlap	~67%	~63%	~59%	Low	1.2x
Parent-child 256 child / 1024 parent	~73%	~70%	~64%	Medium-High	1.3-1.5x
Document-level (Cohere embed-v4 128k)	N/A	N/A	~67%	Low (no chunking)	0.1x vectors

All recall@10 figures are indicative, derived from public evaluations as of June 2026 (LangChain docs, ChromaDB evals, MongoDB/Voyage blog benchmarks, academic BEIR-adjacent studies). Numbers vary significantly by corpus, embedding model, and query distribution. Verify against your own held-out eval set before committing to a strategy.

Why chunking matters more than most engineers think

Most RAG tutorials treat chunking as a two-line setup step: instantiate a text splitter, call split_documents, move on. This is understandable — the step is fast, the code is simple, and the pipeline keeps moving. What gets lost is that chunking defines the atomic unit of retrieval. Every downstream component — the embedding model, the vector index, the LLM context window — operates on whatever chunks you created. If a chunk cuts a concept in half, no embedding model can fix it. If a chunk runs 2,000 tokens but your embedding model was trained on 512-token sequences, the tail of the chunk will be underweighted. These are not edge cases; they are systematic biases that degrade every query against that index.

The practical consequence is that chunking strategy interacts with corpus structure in ways that generalize poorly across domains. A 512-token fixed-size split works reasonably well on structured technical documentation where paragraphs happen to be short and self-contained. The same split is damaging on legal contracts where a defined term introduced in paragraph 3 is referenced throughout a 40-page document and splitting severs that dependency. Semantic chunking, which detects topic shift points using embedding similarity, can recover some of that structure — but at significant compute cost and with its own failure modes on highly repetitive corpora like dense FAQ pages.

Understanding this interaction is the core skill this article tries to develop. The benchmark numbers matter less than the mental model: chunk size and strategy are not global settings you optimize once; they are corpus-specific choices you make deliberately and then validate empirically. Every engineering team that runs a recall@10 evaluation against their actual corpus before deploying discovers something they did not expect.

Fixed-size strategy: the honest baseline

Fixed-size chunking splits a document into contiguous windows of exactly N tokens, with an optional overlap of M tokens between consecutive chunks. It requires no ML, no linguistic analysis, and no corpus-specific configuration beyond N and M. It is deterministic, fast, parallelizable, and easy to reason about. For these reasons it is the right starting point for any new RAG project: build the baseline first, then measure whether a more complex strategy actually improves recall on your data.

The performance ceiling is the strategy's main limitation. On technical documentation in the BEIR-adjacent evaluations we surveyed, fixed-size 512-token chunks achieved recall@10 around 61%. On customer support Q&A corpora, where answers are often shorter than 512 tokens and a fixed window frequently includes part of a neighboring Q&A pair, recall dropped to around 58%. Legal contracts fared worst at approximately 52%, because critical definitions and cross-references span sections in ways that token counting has no way to detect.

The practical recommendation is to treat fixed-size as your first checkpoint, not your destination. Run it, measure recall@10 on a held-out set of 50-200 query/answer pairs, record the number, then try the next strategy. If your corpus happens to be pre-segmented into short, self-contained units (a FAQ where each entry is already a separate document, a product catalog where each row is independent), fixed-size may be the best strategy precisely because the corpus structure already does the semantic work. If you are working with dense long-form text, plan to iterate.

Recursive character splitter: the pragmatic default

The recursive character splitter, popularized by LangChain's RecursiveCharacterTextSplitter, applies a prioritized hierarchy of separators: paragraph break, newline, space, then individual character as a last resort. It tries each separator in order and only falls back to the next when the current separator would produce a chunk larger than the target size. The result is chunks that tend to respect natural document structure — paragraphs stay together, sentences are not split mid-thought — without requiring any ML inference.

In practice this strategy outperforms fixed-size on nearly every corpus type. The evaluations surveyed show recall@10 of approximately 68% on technical docs, 64% on support Q&A, and 56% on legal contracts when using a 512-token target with a 50-token overlap. The improvement over fixed-size is largest on documentation corpora that use consistent paragraph formatting, where the paragraph separator does most of the work. The gap over fixed-size narrows on corpora with inconsistent whitespace or raw OCR output where the hierarchy of separators loses its signal.

The code to implement this in LangChain is straightforward: `from langchain.text_splitter import RecursiveCharacterTextSplitter; splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=50, separators=[\"\\n\\n\", \"\\n\", \" \", \"\"])`. For code corpora, LangChain provides language-aware variants (RecursiveCharacterTextSplitter.from_language) that add function and class boundary separators. This is almost always the right choice for mixed Python/documentation repositories. The pragmatic advice is: start with recursive character splitting at 512 tokens and 50-token overlap, establish your recall@10 baseline, and only move to a more expensive strategy if you have measured a specific gap you need to close. Most production RAG systems that are performing well enough are running some variant of this strategy.

Semantic boundary chunking: highest quality, highest cost

Semantic chunking replaces token counting with embedding-based topic detection. The algorithm embeds consecutive sentences or short windows, computes cosine similarity between adjacent windows in a sliding pass, and identifies points where similarity drops sharply as topic shift boundaries. The document is then split at those boundaries rather than at fixed token counts. The intuition is correct and the results reflect it: semantic chunking achieves recall@10 of approximately 71% on technical docs, 69% on support Q&A, and 62% on legal contracts in the evaluations surveyed — the highest single-strategy numbers across all three corpus types.

The cost is real. Semantic chunking requires an embedding pass over the entire corpus at index time, which adds latency proportional to corpus size and inference cost. For a 10,000-document corpus this is a one-time cost that may be acceptable; for a corpus that is updated in real time with low-latency requirements, the embedding dependency on the indexing path becomes an operational concern. The strategy also has failure modes on corpora with low semantic variance — dense FAQ pages where every paragraph discusses the same topic, for example — where the similarity threshold produces chunks that are either too large (no splits detected) or too granular (every sentence splits because topics are all slightly different). Threshold tuning is required, and optimal thresholds are corpus-specific.

A minimal Python implementation using sentence-transformers looks like this: embed all sentences with a model such as all-MiniLM-L6-v2, compute cosine similarity between sentence i and sentence i+1, identify indices where similarity falls below a threshold (typically 0.4-0.6, tuned on your corpus), and split at those indices. `from sentence_transformers import SentenceTransformer, util; model = SentenceTransformer('all-MiniLM-L6-v2'); embeddings = model.encode(sentences); sims = [util.cos_sim(embeddings[i], embeddings[i+1]).item() for i in range(len(embeddings)-1)]; split_points = [i+1 for i, s in enumerate(sims) if s < threshold]`. The threshold is the critical hyperparameter: too low and you get huge chunks; too high and you fragment every paragraph. For corpora you have not analyzed before, start at 0.5 and tune. See the build RAG with Pinecone tutorial for a worked example of this implementation in a full pipeline.

Sliding window with overlap: balancing boundary loss and storage

Sliding window chunking is fixed-size chunking with a deliberately chosen overlap between consecutive chunks. A chunk of 512 tokens with 25% overlap means chunk 2 starts 384 tokens into chunk 1, so the last 128 tokens of chunk 1 are repeated at the start of chunk 2. The motivation is straightforward: if a relevant passage straddles a chunk boundary, at least one of the two overlapping chunks will contain it intact. This directly addresses the boundary problem that makes fixed-size chunking underperform on corpora with dense cross-sentence dependencies.

The performance uplift is real but modest. Recall@10 on technical docs reaches approximately 67% at 25% overlap, compared to 61% for no-overlap fixed-size. On legal contracts, where long-range dependencies span more than a single boundary, the improvement is more pronounced: approximately 59% versus 52%. The cost is deterministic: 25% overlap increases the number of vectors by approximately 1.2x, with proportional increases in storage and query latency as the vector index grows. This is not free, but it is predictable and easy to budget.

The practical guidance is to use sliding window as an upgrade over fixed-size when you cannot afford the compute cost of semantic chunking but have measured that fixed-size is losing recall at chunk boundaries. The 10-25% overlap range covers most production use cases. Below 10%, the boundary coverage is minimal. Above 25%, storage costs accumulate quickly and retrieval may start surfacing near-duplicate chunks that confuse the LLM with slightly varying versions of the same passage. Some production systems add a deduplication step that drops chunks whose embedding cosine similarity to an existing chunk exceeds 0.98, which partially mitigates this at the cost of additional indexing complexity.

Hierarchical / parent-child: the precision-context trade-off resolved

The fundamental tension in RAG chunking is that precision and context pull in opposite directions. Small chunks retrieve precisely — a 128-token chunk that exactly matches a query term will score high — but return too little context for the LLM to generate a complete answer. Large chunks provide rich context but their embeddings average over many topics, reducing the signal-to-noise ratio for any specific query. Hierarchical parent-child indexing resolves this tension by separating the retrieval unit from the context unit: index small child chunks (typically 128-256 tokens) for precise retrieval, but when a child chunk is retrieved, return its parent chunk (typically 512-1024 tokens) as the LLM context.

The results are the best single-strategy numbers across all three corpus types: approximately 73% recall@10 on technical docs, 70% on support Q&A, and 64% on legal contracts. The 64% on legal still trails document-level Cohere embedding at 67%, which points to the fundamental limitation of any chunk-based approach on corpora with very long-range dependencies — but for technical and support corpora, parent-child is the clear winner. The storage overhead runs 1.3-1.5x the baseline because both child and parent embeddings must be stored (though typically only child embeddings need to live in the hot vector index; parents can be retrieved from a key-value store keyed by parent ID).

Implementation requires maintaining a mapping from child chunk IDs to parent chunk IDs at index time, and a two-step retrieval: vector search returns child chunk IDs, a lookup step fetches parent text. In LangChain this is the ParentDocumentRetriever abstraction: `from langchain.retrievers import ParentDocumentRetriever; from langchain.storage import InMemoryStore; child_splitter = RecursiveCharacterTextSplitter(chunk_size=256); parent_splitter = RecursiveCharacterTextSplitter(chunk_size=1024); retriever = ParentDocumentRetriever(vectorstore=vectorstore, docstore=InMemoryStore(), child_splitter=child_splitter, parent_splitter=parent_splitter)`. The docstore holds parent text; the vectorstore holds child embeddings only. For production deployments, replace InMemoryStore with Redis or a relational store. This architecture is more operationally complex than a flat index, but the recall uplift over recursive character splitting (approximately 5 percentage points on technical docs) is consistent enough across evaluations to justify it for teams that have established a stable corpus and are optimizing a production system. For a fresh corpus where you are still learning the data, start simpler.

The document-level alternative: Cohere embed-v4 at 128k context

A strategy absent from most chunking comparisons is no chunking at all. With the release of embedding models that accept 128k-token inputs — Cohere's embed-v4 being the most prominent as of mid-2026 — it is now technically feasible to embed entire documents and retrieve at document granularity. This eliminates all boundary problems, dependency breakage, and strategy selection overhead. For legal contracts, where the benchmark shows document-level Cohere embedding at approximately 67% recall@10 — beating every chunk-based strategy except parent-child on tech docs — this is not a theoretical option. It is a competitive approach.

The trade-offs are real and domain-specific. Long documents produce dense, averaged embeddings that can lose precision on narrow queries: asking about a specific clause in a 40-page contract will return the correct document, but the LLM still needs to locate the clause within the full document text fed as context. If your LLM context window is large enough to hold the full document (Gemini 1.5 Pro at 1M tokens, Cohere Command R+ at 128k), this works well and the document-level approach is genuinely simpler to operate. If your documents are longer than the LLM's context window, you need a second-stage retrieval step anyway, at which point you have effectively re-introduced chunking.

The practical recommendation for legal and regulatory corpora specifically is to test document-level Cohere embedding before investing in a complex hierarchical chunking pipeline. If your average document is under 50k tokens and your LLM can accept the full document, document-level embedding plus full-document context may be the highest-recall, lowest-complexity option available. For technical documentation and support Q&A, where documents are shorter but queries are more targeted, the precision loss from averaging over a full document typically makes chunk-based approaches — particularly parent-child — more effective. See Cohere vs Voyage vs OpenAI embeddings for a full comparison of model capabilities at different context lengths.

Benchmark methodology and caveats

The recall@10 numbers presented in this article are drawn from public evaluations published through June 2026. Sources include the LangChain documentation benchmarks on the LangChain blog, ChromaDB's public chunking evaluation notebook, MongoDB's RAG evaluation series co-published with Voyage AI, and several academic papers evaluating BEIR datasets with RAG pipelines. Where multiple sources reported numbers for the same strategy and corpus type, we took a representative midpoint. No independent benchmarking was conducted by Digital Dashboard Hub for this article.

Several factors will shift these numbers when you apply them to your own corpus. First, the embedding model matters enormously. Most public evaluations use OpenAI text-embedding-3-small or -large, or Voyage AI's voyage-2. Numbers with a different model — especially one with a different training context window — will differ. Second, the query distribution matters. BEIR and BEIR-adjacent evaluations use specific query types; if your users ask very different questions, the ranking of strategies may change. Third, document preprocessing matters. OCR quality, HTML stripping, whitespace normalization, and header/footer removal all affect chunking quality in ways that benchmarks on clean academic datasets do not capture. Fourth, the chunk size numbers are specific to the parameter choices listed. Fixed 512 is not the same as fixed 256 or fixed 1024; the benchmark numbers would shift at different sizes.

The appropriate response to all of this uncertainty is empirical: build a held-out evaluation set of 50-200 (query, expected-answer-document) pairs representative of your actual user queries, implement two or three candidate strategies, compute recall@10 on your corpus, and make the strategy decision from your own numbers. The table and verdicts in this article are a prior to update, not a conclusion to accept. If you have not yet run a recall evaluation on your corpus, the RAG cost calculator can help you estimate the cost of running that evaluation at different corpus and query set sizes.

Verdict matrix by corpus type

For technical documentation — API references, engineering wikis, product manuals — the benchmarks point to parent-child (256 child / 1024 parent) as the highest-recall strategy, with recursive character 512+50 as the pragmatic alternative when operational simplicity matters more than the last 5 percentage points. Technical documentation tends to have consistent paragraph structure that recursive character splitting handles well, and the query patterns (specific function names, configuration parameters, error codes) benefit from precise small-chunk retrieval. The parent-child architecture earns its complexity here.

For customer support Q&A corpora — helpdesk tickets, FAQ databases, community forum threads — semantic chunking performs best at approximately 69% recall@10, with parent-child close behind at 70%. The distinction matters: semantic chunking performs well here because support corpora have highly variable chunk lengths (some answers are two sentences, others are 800 words) and forcing a fixed token window either splits short answers across chunks or combines multiple unrelated answers into one chunk. Semantic boundary detection avoids both failure modes. If semantic chunking's compute cost is prohibitive, recursive character splitting at 512+50 is a reasonable fallback at 64%.

For legal contracts, regulatory filings, and compliance documents, document-level Cohere embedding at 128k context achieves the highest recall at approximately 67%, assuming documents fit within the context window and your LLM can accept full-document input. Where that is not the case, semantic chunking with overlap is the strongest chunk-based option at approximately 62%. Legal text's dense cross-referencing structure benefits from the longest possible context units, and any strategy that breaks documents into sub-512-token pieces loses critical definitional context. For code corpora, no benchmark is presented, but the consistent finding in the literature is that recursive character splitting with language-aware splitters (splitting on function and class definitions) outperforms generic token-based approaches. General English corpora — news articles, blog posts, Wikipedia — follow the pragmatic default: recursive 512+50 performs reliably and the complexity of more advanced strategies rarely yields measurable recall gains.

Sources and further reading

The benchmarks cited in this article draw from: LangChain's RAG from Scratch series and text splitter evaluation notebooks (langchain-ai.github.io, 2024-2025); ChromaDB's chunking evaluation repository (trychroma.com/blog, 2024); MongoDB and Voyage AI's RAG evaluation blog series (mongodb.com/developer, 2024-2025); the BEIR benchmark paper (Thakur et al., NeurIPS 2021) and subsequent RAG-specific evaluations that apply BEIR-style methodology; Cohere's embed-v4 technical documentation and evaluation results (cohere.com, 2025-2026). All numbers should be treated as indicative given the differences in embedding models, preprocessing, and query distributions used across these sources.

For implementation depth beyond what this article covers, the LangChain text splitter documentation provides the most complete reference for the recursive character and parent-child implementations. The sentence-transformers library documentation covers the embedding-based semantic chunking approach. For evaluation methodology, the BEIR paper itself is the best starting point for understanding what recall@10 measures and what its limitations are as a single-number summary of retrieval quality.

Related reading on this site: when RAG fails and fixes covers the seven root causes of RAG underperformance beyond chunking; build RAG with Pinecone provides a complete implementation walkthrough; RAG architecture decision tree 2026 covers the upstream decision of when to use RAG versus fine-tuning versus long-context prompting; and embedding model leaderboard 2026 covers the model selection decision that interacts directly with chunk size.

Picking and implementing the right chunking strategy

1
Characterize your corpus type
Before selecting a strategy, spend time with your actual data. Compute the distribution of document lengths in tokens, identify whether documents have consistent structural markers (headers, paragraph breaks, numbered sections), and determine whether concepts in your domain tend to be self-contained within paragraphs or span across sections. Technical documentation with consistent paragraph structure is fundamentally different from legal contracts with defined terms that cascade across 40 pages, and your chunking strategy should reflect that difference. If your corpus is heterogeneous — mixing long-form documents with short FAQ entries, for example — consider whether you need different strategies for different document types, or whether parent-child indexing can handle the variance in a single architecture.
2
Choose a starting strategy from the verdict matrix
Map your corpus type to the verdict matrix: technical docs start with parent-child (256/1024) or recursive 512+50; customer support Q&A starts with semantic chunking or recursive 512+50; legal contracts start with document-level Cohere embedding or semantic chunking with overlap; code starts with recursive + language-aware splitters; general English starts with recursive 512+50. If you are uncertain about your corpus type, default to recursive character 512+50 — it is the safest pragmatic choice with reliable performance across corpus types and low implementation cost. The goal of this step is to choose a starting point, not to finalize a strategy. You will measure and iterate.
3
Implement the strategy with the relevant code pattern
For recursive character splitting, use LangChain's RecursiveCharacterTextSplitter with chunk_size=512 and chunk_overlap=50. For semantic chunking, embed sentences with a lightweight model (all-MiniLM-L6-v2 is a reasonable default), compute pairwise similarities in a sliding window, and split at cosine similarity drops below your threshold. For parent-child, use LangChain's ParentDocumentRetriever with a 256-token child splitter and a 1024-token parent splitter, backed by a key-value docstore for parent retrieval. In all cases, preserve document metadata (source, section, page number) as chunk metadata at index time — you will need it for citation and debugging. Test your implementation on 20-30 representative documents before indexing the full corpus to catch preprocessing issues early.
4
Run recall@10 on a held-out evaluation set
Build an evaluation set of at least 50 (query, expected-source-document) pairs representative of your actual user queries. If you do not have real user queries yet, generate synthetic queries by asking an LLM to produce questions that are answerable from random passages in your corpus — this is imperfect but better than skipping evaluation. Run your chosen strategy, index the corpus, execute all queries, and measure what fraction of queries return the expected source document in the top 10 results. Record this number. Then run at least one alternative strategy and compare. A 5-point improvement in recall@10 on your specific corpus is worth the implementation cost of a more complex strategy; a 1-point improvement is probably not.
5
Tune chunk size and overlap based on measured results
If recall@10 is below expectations, the two most productive knobs to turn are chunk size and overlap. If queries are returning chunks that contain the right topic but not the specific answer, reduce chunk size — smaller chunks retrieve more precisely. If the LLM is generating incomplete answers because retrieved chunks lack context, increase the parent chunk size in a hierarchical architecture, or increase overlap in a sliding window approach. If precision is high but recall is low (relevant chunks exist but are not being retrieved), the problem is more likely the embedding model than the chunk size — see embedding model leaderboard 2026 for alternatives. Re-run your recall@10 evaluation after each tuning change and stop when diminishing returns set in. Rebuild the index when you change chunking strategy — there is no safe path to migrating an existing index to a new chunking scheme.

Digital Dashboard Hub

The prompt patterns above work 10x better when they live in a library you actually own — tunable to your niche, exportable to GPT-5, Claude, Gemini, Perplexity, Midjourney, Llama. Stop pasting across 6 tools.

Try DDH's AI Prompt Builder — free 14 days, no card. →

Related calculators

OpenAI Pricing Calculator →GPT-5.5, 5.4, mini, nano — full per-call cost in one input.Claude Pricing Calculator →Opus 4.8, Sonnet 4.6, Haiku 4.5, Fable 5 — input + output combined.Context Window Comparison →Max input length and price per 1M for every current model.

Related prompt tools

When RAG Fails: 7 Root Causes and Proven Fixes→Build RAG with Pinecone: Step-by-Step Tutorial→Cohere vs Voyage vs OpenAI Embeddings Compared→

Frequently Asked Questions

What chunk size should I use for RAG?

The short answer is: 512 tokens with 50-token overlap as a starting point, then measure and adjust. The right chunk size depends on three factors: your embedding model's effective context window (most open models were trained on sequences under 512 tokens; forcing longer inputs degrades embedding quality), your average document's natural semantic unit length (if paragraphs in your corpus average 200 tokens, a 512-token chunk will span 2-3 paragraphs and blend their topics), and your average query length (short keyword-style queries benefit from small, precise chunks; longer conversational queries tolerate larger chunks). Run recall@10 at 256, 512, and 1024 tokens on your corpus before committing to a size. The optimal number is corpus-specific and varies more than most benchmarks suggest.

What is parent-child chunking and why does it outperform standard chunking?

Parent-child chunking is a hierarchical indexing architecture where small child chunks (typically 128-256 tokens) are embedded and stored in the vector index for precise retrieval, but when a child chunk is matched by a query, the system returns the larger parent chunk (typically 512-1024 tokens) as the context passed to the LLM. This resolves the fundamental tension between retrieval precision and generation quality: small chunks match queries precisely because their embeddings are not diluted by off-topic content, but the LLM receives enough surrounding context to generate a complete, coherent answer. The performance advantage in the benchmarks — approximately 5 percentage points over recursive character splitting on technical docs — reflects this structural advantage. The trade-off is operational complexity: you need to maintain a mapping from child IDs to parent text, a separate docstore, and a two-step retrieval pipeline.

Does adding overlap always improve recall?

Overlap reliably reduces boundary-cut losses, so it almost always improves recall relative to no-overlap fixed-size chunking. However, the improvement is bounded and comes with costs. Above approximately 25% overlap, the number of near-duplicate chunks in the index grows enough that retrieval may return multiple slightly-varying versions of the same passage, which can confuse the LLM and inflate storage costs. The improvement over no-overlap diminishes as overlap increases: going from 0% to 10% overlap typically yields the largest recall gain, and the marginal gain from 20% to 25% is much smaller. The 10-25% overlap range covers most production use cases. If you are seeing boundary losses and already using 25% overlap, the better next step is upgrading to a semantic or parent-child strategy rather than increasing overlap further.

Is semantic chunking worth the compute cost?

It depends on your corpus type and operational context. For customer support Q&A corpora with variable answer lengths, semantic chunking consistently achieves the highest recall and the compute cost at index time is a one-time expense per document. If your corpus is updated infrequently and you are optimizing a production system that will run millions of queries, the index-time cost amortizes quickly. For corpora updated in near-real-time or pipelines where index latency matters, semantic chunking's embedding dependency on the indexing path becomes an operational concern. Semantic chunking also requires threshold tuning, which adds a calibration step to your setup process. The honest answer is: if you have measured a recall gap between recursive character splitting and your target, and your corpus is update-infrequent, semantic chunking is usually worth it. If you are still in early development, start with recursive character splitting and revisit.

How does chunking strategy affect retrieval latency?

Chunking strategy affects latency in two ways: through the number of vectors in the index and through the retrieval architecture. More chunks mean a larger index, which increases approximate nearest-neighbor search time — though most vector databases scale sub-linearly with index size using HNSW or similar structures, so the latency impact up to a few million vectors is typically under 5ms. The larger latency factor is architectural: parent-child retrieval requires a second lookup step after vector search to fetch parent text from a docstore, adding a round-trip that is typically 1-10ms depending on where the docstore lives. Semantic chunking does not affect query latency (the embedding computation is at index time only). Sliding window with overlap increases index size by 1.2x, which has a minor but measurable latency effect at large scale. For most production systems under 10 million documents, chunking strategy will not be the dominant latency factor — embedding the query and calling the LLM will be larger contributors.

What is BEIR and is it the right benchmark for my use case?

BEIR (Benchmarking Information Retrieval) is a heterogeneous benchmark suite introduced by Thakur et al. in 2021 that collects 18 diverse retrieval datasets spanning web search, scientific documents, financial filings, and more. It has become the standard benchmark for dense retrieval systems because it tests generalization across corpus types rather than performance on a single domain. Recall@10 on BEIR datasets means: among all documents, was the relevant document retrieved in the top 10 results? BEIR is a useful comparison standard, but it is not a substitute for evaluating on your own corpus. BEIR datasets are clean, well-preprocessed, and drawn from domains that may differ substantially from your data. A strategy that ranks first on BEIR may not rank first on your proprietary internal documentation. Use BEIR numbers (including those in this article) as a prior, then build your own evaluation set and measure on your actual data.

Can I change my chunking strategy after the index is already built?

No — not without rebuilding the index. Chunking strategy, chunk size, and overlap are baked into the vectors stored in your vector database. An index built with fixed-size 512-token chunks cannot be retrospectively re-chunked to parent-child without re-embedding every document. This is one of the most expensive mistakes in RAG production deployments: choosing a strategy quickly to ship a prototype and then needing to rebuild the entire index when the prototype proves inadequate. The practical mitigation is to make the strategy decision deliberate upfront — run at least two strategies on a representative sample of your corpus and measure recall@10 before building the full production index. Index rebuilds are operationally disruptive and, for large corpora with commercial embedding APIs, financially significant. If your corpus is large, consider whether a staged rollout (new strategy for new documents, old strategy maintained for existing documents) can bridge the gap while a background rebuild runs.

How do I handle documents that mix content types, like a PDF with tables, code blocks, and prose?

Mixed-content documents are one of the harder chunking problems in practice and the one benchmarks handle least well. The general advice is to preprocess by content type before chunking: extract tables as structured key-value text or as separate documents with table-specific metadata; extract code blocks as separate chunks using a language-aware splitter that respects function/class boundaries; and apply standard recursive or semantic chunking to the prose sections. Lumping all content types into a single chunking pass typically produces low-quality chunks that mix partial table rows with prose sentences, degrading embedding quality for both. If your documents are ingested through a PDF pipeline, tools like unstructured.io provide content-type-aware extraction that separates tables, figures, headers, and prose before your chunking step. The indexing complexity is higher, but the recall improvement on mixed-content corpora is consistently meaningful.

Build better RAG systems with the right prompts

Use AI Prompts Hub to generate chunking strategy prompts, RAG system prompt templates, embedding model evaluation prompts, and retrieval quality assessment prompts — all tuned for production RAG pipelines.

Browse all prompt tools →