Chunking Strategies for RAG: How to Split Your Documents
The way you split documents determines what your retrieval system can find. Learn the three main chunking strategies and how to choose the right one for your document types and query patterns.
Key Takeaways
- Comprehensive strategies proven to work at top companies
- Actionable tips you can implement immediately
- Expert insights from industry professionals
Why Chunking Is the Foundation of Good RAG
Before a retrieval-augmented generation system can answer a single question, you need to break your source documents into chunks. These chunks become the units your vector database stores and retrieves. Get the chunking wrong and your RAG system returns irrelevant context no matter how good your embeddings or language model are.
The core tension is this: small chunks give you precise retrieval but can lose surrounding context. Large chunks preserve context but introduce noise that confuses the model and wastes your context window. Your goal is to find the sweet spot for your specific document types and query patterns.
The Three Main Chunking Strategies
Fixed-size chunking is the simplest approach. You split every document into chunks of exactly N characters or tokens, with an optional overlap between adjacent chunks. It is fast, predictable, and easy to implement. The downside is that it ignores document structure entirely, so a chunk might start mid-sentence or cut a key idea in half.
Recursive character text splitting is the standard production approach. You define a list of separators in order of preference: double newline, single newline, period, space, then empty string. The splitter tries to break at paragraph boundaries first, then sentence boundaries, then word boundaries. This preserves semantic units far better than fixed-size splitting.
Semantic chunking is the most advanced strategy. Instead of splitting on character boundaries, you compute embeddings for each sentence and split where the embedding similarity drops below a threshold. This produces semantically coherent chunks but is slower and requires an embedding call during ingestion. Save it for corpora where paragraph structure is inconsistent or missing.
Chunk Size and Overlap: Finding the Right Balance
For most RAG applications, start with a chunk size between 256 and 1024 tokens. Shorter documents like FAQ entries or product descriptions work well at 256 to 512 tokens. Long-form content like legal documents or technical manuals often benefits from 512 to 1024 tokens. Set your overlap to 10 to 15 percent of your chunk size to ensure key context near chunk boundaries is not lost.
Your retrieval quality also depends on how many chunks you retrieve per query, the top-k parameter. A smaller chunk size usually means you need a higher k to surface enough context. A larger chunk size lets you use a smaller k but risks including irrelevant material. Run offline evaluations with different combinations before committing to a configuration in production.
Recursive Chunking in Practice
The code below shows a fixed-size baseline and the recursive approach side by side. Run them on the same document and compare. You will typically see recursive chunks ending at natural paragraph or sentence boundaries, while fixed chunks cut arbitrarily mid-thought.
pip install langchain-text-splitters
from langchain_text_splitters import (
CharacterTextSplitter,
RecursiveCharacterTextSplitter,
)
with open("your_document.txt") as f:
raw_text = f.read()
# Fixed-size baseline
fixed_splitter = CharacterTextSplitter(
chunk_size=512,
chunk_overlap=64,
separator="",
)
fixed_chunks = fixed_splitter.split_text(raw_text)
print("Fixed chunks: {}".format(len(fixed_chunks)))
# Recursive splitting (recommended for production)
recursive_splitter = RecursiveCharacterTextSplitter(
chunk_size=512,
chunk_overlap=64,
separators=["\n\n", "\n", ". ", " ", ""],
)
recursive_chunks = recursive_splitter.split_text(raw_text)
print("Recursive chunks: {}".format(len(recursive_chunks)))
print("First chunk preview: {}".format(recursive_chunks[0][:200]))
A chunk ending mid-sentence often has a misleading embedding because the semantic meaning is incomplete. Recursive splitting dramatically reduces this problem by respecting natural language boundaries. Once you have your chunks, attach metadata like the source filename and page number before inserting into your vector store so you can surface citations later.
Prompt
"I have a corpus of technical API documentation. Each page covers a single endpoint and runs about 2000 words. What chunk size and overlap would you recommend, and should I prepend the endpoint name to every chunk to improve retrieval quality?"
Want to build this live with Aki?
Join a Lightning Lesson and go deeper on this topic. Browse upcoming sessions →
Aki Wijesundara
Expert team of AI professionals and career advisors with experience at top tech companies. We've helped 500+ students land internships at Google, Meta, OpenAI, and other leading AI companies.
Ready to Launch Your AI Career?
Join our comprehensive program and get personalized guidance from industry experts who've been where you want to go.
Table of Contents
Share Article
Get Weekly AI Career Tips
Join 5,000+ professionals getting actionable career advice in their inbox.
No spam. Unsubscribe anytime.