Semantic Search & Chunking

Optimizing how data is split and retrieved.

Chunking: The Most Underrated Step in RAG

Hybrid Retrieval Pipeline

Query Semantic (embeddings) Keyword (BM25) Reranker Cross-Encoder Top-K → LLM

How you split your documents into chunks has more impact on RAG quality than your choice of LLM or vector database. Bad chunking = irrelevant retrieval = wrong answers.

Chunking Strategies

  • Fixed-size: Split every N tokens. Simple but breaks mid-sentence.
  • Recursive Character: LangChain's default. Splits by paragraphs, then sentences, then characters.
  • Semantic: Use embeddings to detect topic boundaries. Most accurate but slowest.
  • Document-aware: Respect document structure (headers, sections). Best for structured content.

Chunk Overlap

Always use 10-20% overlap between chunks to avoid losing context at boundaries.

Optimal Chunk Size

There's no universal answer, but general guidelines:

  • 256-512 tokens: For precise, fact-based Q&A
  • 512-1024 tokens: For general knowledge retrieval
  • 1024+ tokens: For summarization or when context is important

Code Example

Recursive splitting tries paragraph breaks first, then sentences, then words. The overlap ensures no context is lost at boundaries.

python
1from langchain.text_splitter import RecursiveCharacterTextSplitter
2
3text = """
4Machine learning is a subset of artificial intelligence.
5It focuses on building systems that learn from data.
6
7Deep learning is a subset of machine learning.
8It uses neural networks with many layers.
9Transformers are a type of deep learning architecture.
10"""
11
12# Recursive chunking with overlap
13splitter = RecursiveCharacterTextSplitter(
14    chunk_size=200,
15    chunk_overlap=40,
16    separators=["\n\n", "\n", ". ", " "]
17)
18
19chunks = splitter.split_text(text)
20for i, chunk in enumerate(chunks):
21    print(f"Chunk {i}: {chunk[:80]}...")
22    print(f"  Length: {len(chunk)} chars")
23    print()

Use Cases

Processing large PDF documents for Q&A
Indexing codebases for code search
Building searchable knowledge bases from wikis
Processing legal contracts with section-aware chunking

Common Mistakes

Chunks too small = missing context. Chunks too large = irrelevant noise.
No overlap between chunks causes information loss at boundaries
Ignoring document structure — a table split across chunks becomes useless
Not benchmarking different chunk sizes on your actual queries

Interview Insight

Relevance

High - Makes or breaks RAG quality

AI Tutor

Ask about the topic

Sign in Required

Please sign in to use the AI tutor

Sign In