Semantic Search & Chunking

Chunking: The Most Underrated Step in RAG

Hybrid Retrieval Pipeline

How you split your documents into chunks has more impact on RAG quality than your choice of LLM or vector database. Bad chunking = irrelevant retrieval = wrong answers.

Chunking Strategies

Fixed-size: Split every N tokens. Simple but breaks mid-sentence.
Recursive Character: LangChain's default. Splits by paragraphs, then sentences, then characters.
Semantic: Use embeddings to detect topic boundaries. Most accurate but slowest.
Document-aware: Respect document structure (headers, sections). Best for structured content.

Chunk Overlap

Always use 10-20% overlap between chunks to avoid losing context at boundaries.

Optimal Chunk Size

There's no universal answer, but general guidelines:

256-512 tokens: For precise, fact-based Q&A
512-1024 tokens: For general knowledge retrieval
1024+ tokens: For summarization or when context is important

Code Example

Recursive splitting tries paragraph breaks first, then sentences, then words. The overlap ensures no context is lost at boundaries.

python

1from langchain.text_splitter import RecursiveCharacterTextSplitter
2
3text = """
4Machine learning is a subset of artificial intelligence.
5It focuses on building systems that learn from data.
6
7Deep learning is a subset of machine learning.
8It uses neural networks with many layers.
9Transformers are a type of deep learning architecture.
10"""
11
12# Recursive chunking with overlap
13splitter = RecursiveCharacterTextSplitter(
14    chunk_size=200,
15    chunk_overlap=40,
16    separators=["\n\n", "\n", ". ", " "]
17)
18
19chunks = splitter.split_text(text)
20for i, chunk in enumerate(chunks):
21    print(f"Chunk {i}: {chunk[:80]}...")
22    print(f"  Length: {len(chunk)} chars")
23    print()

Use Cases

Processing large PDF documents for Q&A

Indexing codebases for code search

Building searchable knowledge bases from wikis

Processing legal contracts with section-aware chunking

Common Mistakes

Chunks too small = missing context. Chunks too large = irrelevant noise.

No overlap between chunks causes information loss at boundaries

Ignoring document structure — a table split across chunks becomes useless

Not benchmarking different chunk sizes on your actual queries

Interview Insight

Relevance

High - Makes or breaks RAG quality

LLM Foundations

Advanced Prompt Engineering

RAG & Vector Databases

Building AI Agents

AI Engineering Stack

Advanced RAG Engineering

LLM Inference Engineering

Fine-Tuning & Model Alignment

Context & Memory Management