Semantic Search & Chunking
Optimizing how data is split and retrieved.
Chunking: The Most Underrated Step in RAG
Hybrid Retrieval Pipeline
How you split your documents into chunks has more impact on RAG quality than your choice of LLM or vector database. Bad chunking = irrelevant retrieval = wrong answers.
Chunking Strategies
- Fixed-size: Split every N tokens. Simple but breaks mid-sentence.
- Recursive Character: LangChain's default. Splits by paragraphs, then sentences, then characters.
- Semantic: Use embeddings to detect topic boundaries. Most accurate but slowest.
- Document-aware: Respect document structure (headers, sections). Best for structured content.
Chunk Overlap
Always use 10-20% overlap between chunks to avoid losing context at boundaries.
Optimal Chunk Size
There's no universal answer, but general guidelines:
- 256-512 tokens: For precise, fact-based Q&A
- 512-1024 tokens: For general knowledge retrieval
- 1024+ tokens: For summarization or when context is important
Code Example
Recursive splitting tries paragraph breaks first, then sentences, then words. The overlap ensures no context is lost at boundaries.
python
1from langchain.text_splitter import RecursiveCharacterTextSplitter
2
3text = """
4Machine learning is a subset of artificial intelligence.
5It focuses on building systems that learn from data.
6
7Deep learning is a subset of machine learning.
8It uses neural networks with many layers.
9Transformers are a type of deep learning architecture.
10"""
11
12# Recursive chunking with overlap
13splitter = RecursiveCharacterTextSplitter(
14 chunk_size=200,
15 chunk_overlap=40,
16 separators=["\n\n", "\n", ". ", " "]
17)
18
19chunks = splitter.split_text(text)
20for i, chunk in enumerate(chunks):
21 print(f"Chunk {i}: {chunk[:80]}...")
22 print(f" Length: {len(chunk)} chars")
23 print()Use Cases
Processing large PDF documents for Q&A
Indexing codebases for code search
Building searchable knowledge bases from wikis
Processing legal contracts with section-aware chunking
Common Mistakes
Chunks too small = missing context. Chunks too large = irrelevant noise.
No overlap between chunks causes information loss at boundaries
Ignoring document structure — a table split across chunks becomes useless
Not benchmarking different chunk sizes on your actual queries
Interview Insight
Relevance
High - Makes or breaks RAG quality