Fine-tuning vs. RAG
When to train your model and when to retrieve context.
The Two Approaches to Customizing LLMs
Decision Framework
When you need an LLM to work with your specific data, you have two main strategies. Choosing wrong can waste months and hundreds of thousands of dollars.
Fine-tuning
Updating the model's weights on your specific dataset. The model "learns" your patterns permanently.
- Best for: Changing the model's behavior, style, or format. Teaching it domain-specific jargon.
- Cost: High upfront (training compute), low per-query.
- Data staleness: Frozen at training time — doesn't know about changes after training.
- Examples: Making GPT respond in your brand's tone, teaching medical terminology.
RAG (Retrieval-Augmented Generation)
Keep the base model unchanged. Instead, retrieve relevant documents at query time and inject them into the prompt as context.
- Best for: Knowledge-intensive tasks where data changes frequently.
- Cost: Low upfront, slightly higher per-query (retrieval + longer prompts).
- Data freshness: Always up-to-date — just update your document store.
- Examples: Customer support bots, internal documentation search, legal research.
RAG Architecture Diagram
Decision Framework
| Factor | Fine-tuning | RAG |
|---|---|---|
| Data changes frequently | ❌ Bad fit | ✅ Great fit |
| Need specific output format | ✅ Great fit | ⚠️ Possible |
| Limited budget | ❌ Expensive | ✅ Cheaper |
| Need factual accuracy | ⚠️ Hallucination risk | ✅ Grounded |
| Latency-critical | ✅ Direct inference | ⚠️ Retrieval adds latency |
Code Example
A basic RAG pipeline: embed the query, retrieve similar documents from Pinecone, then pass them as context to GPT-4.
python
1# RAG Pattern: Query -> Retrieve -> Augment -> Generate
2from openai import OpenAI
3from pinecone import Pinecone
4
5client = OpenAI()
6pc = Pinecone(api_key="your-key")
7index = pc.Index("documents")
8
9def rag_query(user_question: str) -> str:
10 # 1. Embed the question
11 embedding = client.embeddings.create(
12 input=user_question,
13 model="text-embedding-3-small"
14 ).data[0].embedding
15
16 # 2. Retrieve relevant documents
17 results = index.query(vector=embedding, top_k=3, include_metadata=True)
18 context = "\n".join([m["metadata"]["text"] for m in results["matches"]])
19
20 # 3. Augment the prompt with context
21 response = client.chat.completions.create(
22 model="gpt-4o",
23 messages=[
24 {"role": "system", "content": f"Answer using this context:\n{context}"},
25 {"role": "user", "content": user_question}
26 ]
27 )
28 return response.choices[0].message.contentUse Cases
Internal knowledge base chatbots (RAG)
Brand voice customization (Fine-tuning)
Legal document analysis with ever-changing regulations (RAG)
Code generation in a proprietary framework (Fine-tuning)
Common Mistakes
Fine-tuning when RAG would work — fine-tuning doesn't reliably add new factual knowledge
Using RAG without proper chunking strategy — garbage in, garbage out
Not evaluating retrieval quality separately from generation quality
Choosing based on hype rather than actual requirements analysis
Interview Insight
Relevance
High - Critical architecture decision