Chunking strategy
Overview
Chunking strategy refers to the method by which source documents are segmented into smaller units—typically called chunks or passages—that can be individually embedded, indexed, and retrieved during retrieval-augmented generation (RAG) workflows. The choice of chunking strategy directly affects both the quality of retrieved context and the efficiency of semantic search operations in vector databases.
In RAG pipelines, documents are too large to fit into a model's context window in their entirety. Chunking bridges this constraint by creating discrete, retrievable units. The optimal chunk size, granularity, and segmentation method depend on the document type, domain, the language model being used, and the downstream task. A poorly chosen strategy can result in semantically fragmented passages that harm retrieval precision and recall, or in redundancy that wastes computational resources.
Chunking strategy is distinct from other RAG design decisions such as the embedding model, ranking mechanism, or prompt engineering technique. However, it is fundamental to the entire retrieval pipeline: chunks that are too small may lose context; chunks that are too large may dilute relevance signals and exceed storage or retrieval latency requirements.
How it works
Chunking operates at the document preprocessing stage, before embedding and indexing. The process typically follows these steps:
- Document ingestion and preparation: Raw documents (PDF, HTML, Markdown, plain text) are loaded and normalized.
- Segmentation: The document is divided using one or more of the following strategies:
- Fixed-size chunking: Dividing by a set number of tokens (e.g., 512 tokens) or characters (e.g., 1,000 characters), often with overlap to preserve context across chunk boundaries.
- Semantic chunking: Splitting at sentence or paragraph boundaries, or using model-based methods to identify natural breakpoints where semantic coherence changes.
- Structural chunking: Exploiting document structure (headers, sections, tables, lists) to preserve logical units.
- Hybrid approaches: Combining multiple strategies, such as first splitting by sections, then by fixed size within each section.
- Overlap and padding: Many strategies introduce overlap (e.g., the last 50 tokens of one chunk may repeat as the first 50 tokens of the next) to maintain context continuity.
- Embedding and indexing: Each chunk is encoded into a dense vector using an embedding model and stored in a vector database.
- Retrieval: At query time, the query is embedded and semantically searched against the chunk embeddings; the top-k most similar chunks are retrieved and passed to the LLM.
The effectiveness of a chunking strategy can be evaluated using retrieval precision and recall metrics: does the retriever return chunks that contain the answer to the query, and does it avoid returning irrelevant chunks? Poor chunking can lead to hallucinations or factual inconsistencies if the model receives fragmented or contradictory context.
| Term | Distinction |
|---|---|
| Context window | Context window is the maximum sequence length a model can process in a single inference. Chunking is a preprocessing strategy that respects the context window constraint by dividing documents before retrieval. Chunking decisions are made *because* of context window limits, but the two are not the same. |
| Retrieval-augmented generation (RAG) | RAG is the end-to-end paradigm of retrieving external documents and conditioning model generation on them. Chunking is a component of RAG—specifically, the preprocessing step that enables efficient retrieval. RAG encompasses chunking, embedding, retrieval, and generation. |
| Semantic search | Semantic search is the retrieval mechanism that finds relevant chunks using embeddings and similarity metrics. Chunking determines *what* gets searched (the units of retrieval). Semantic search determines *how* to find relevant chunks among those units. |
| Tokenization | Tokenization breaks text into tokens (subword units) for model processing. Chunking breaks text into passages (typically many tokens in size) for document management and retrieval. Tokenization operates at a finer granularity and serves a different purpose. |
| Knowledge graphs | Knowledge graphs represent information as structured entities and relations. Chunking is an unstructured or semi-structured text segmentation approach. Knowledge graphs offer alternative retrieval mechanisms but require explicit extraction, whereas chunking operates directly on raw text. |
Examples
- Fixed-size chunking with overlap: A customer service knowledge base is divided into 512-token chunks with 100-token overlap. When a user queries "How do I reset my password?", the retriever returns the top 3 chunks from the relevant FAQ section. This approach is simple to implement and works well for homogeneous text, but may split questions or answers across boundaries if overlap is insufficient.
- Semantic chunking in research papers: A RAG system indexing academic papers uses a chunking strategy that splits documents at paragraph boundaries and ensures no chunk exceeds 1,024 tokens. This preserves logical flow within sections and reduces the chance of splitting a statement and its supporting evidence. A query about "transformer architecture" retrieves full paragraphs rather than mid-sentence fragments, improving both relevance and readability of citations.
- Structural chunking in documentation: A software documentation site chunks its API reference by splitting first at top-level sections (e.g., "Authentication", "Endpoints"), then subdividing each section by method or class. Each chunk includes the section header and relevant code examples. This ensures that chunks retrieved for "How do I authenticate?" include both the conceptual section and concrete examples, reducing the need for additional context gathering.
See also
- Retrieval-augmented generation — the broader RAG framework in which chunking is applied
- Vector database — the storage and retrieval infrastructure for chunks and their embeddings
- Embeddings — the vector representation computed for each chunk
- Semantic search — the retrieval mechanism that finds relevant chunks
- Context window — the constraint that motivates chunking strategies
- Retrieval precision and recall — metrics for evaluating chunking effectiveness