Fine-tuning vs. RAG

When to train your model and when to retrieve context.

The Two Approaches to Customizing LLMs

Decision Framework

Need external knowledge? No Yes Fine-Tuning Style, format, domain RAG Real-time data, sources RAG + Fine-Tuning ✨

When you need an LLM to work with your specific data, you have two main strategies. Choosing wrong can waste months and hundreds of thousands of dollars.

Fine-tuning

Updating the model's weights on your specific dataset. The model "learns" your patterns permanently.

  • Best for: Changing the model's behavior, style, or format. Teaching it domain-specific jargon.
  • Cost: High upfront (training compute), low per-query.
  • Data staleness: Frozen at training time — doesn't know about changes after training.
  • Examples: Making GPT respond in your brand's tone, teaching medical terminology.

RAG (Retrieval-Augmented Generation)

Keep the base model unchanged. Instead, retrieve relevant documents at query time and inject them into the prompt as context.

  • Best for: Knowledge-intensive tasks where data changes frequently.
  • Cost: Low upfront, slightly higher per-query (retrieval + longer prompts).
  • Data freshness: Always up-to-date — just update your document store.
  • Examples: Customer support bots, internal documentation search, legal research.

RAG Architecture Diagram

User Query Vector DB (Finds Top-K Docs) Prompt Augment (Query + Context) LLM Final Answer

Decision Framework

FactorFine-tuningRAG
Data changes frequently❌ Bad fit✅ Great fit
Need specific output format✅ Great fit⚠️ Possible
Limited budget❌ Expensive✅ Cheaper
Need factual accuracy⚠️ Hallucination risk✅ Grounded
Latency-critical✅ Direct inference⚠️ Retrieval adds latency

Code Example

A basic RAG pipeline: embed the query, retrieve similar documents from Pinecone, then pass them as context to GPT-4.

python
1# RAG Pattern: Query -> Retrieve -> Augment -> Generate
2from openai import OpenAI
3from pinecone import Pinecone
4
5client = OpenAI()
6pc = Pinecone(api_key="your-key")
7index = pc.Index("documents")
8
9def rag_query(user_question: str) -> str:
10    # 1. Embed the question
11    embedding = client.embeddings.create(
12        input=user_question,
13        model="text-embedding-3-small"
14    ).data[0].embedding
15    
16    # 2. Retrieve relevant documents
17    results = index.query(vector=embedding, top_k=3, include_metadata=True)
18    context = "\n".join([m["metadata"]["text"] for m in results["matches"]])
19    
20    # 3. Augment the prompt with context
21    response = client.chat.completions.create(
22        model="gpt-4o",
23        messages=[
24            {"role": "system", "content": f"Answer using this context:\n{context}"},
25            {"role": "user", "content": user_question}
26        ]
27    )
28    return response.choices[0].message.content

Use Cases

Internal knowledge base chatbots (RAG)
Brand voice customization (Fine-tuning)
Legal document analysis with ever-changing regulations (RAG)
Code generation in a proprietary framework (Fine-tuning)

Common Mistakes

Fine-tuning when RAG would work — fine-tuning doesn't reliably add new factual knowledge
Using RAG without proper chunking strategy — garbage in, garbage out
Not evaluating retrieval quality separately from generation quality
Choosing based on hype rather than actual requirements analysis

Interview Insight

Relevance

High - Critical architecture decision

AI Tutor

Ask about the topic

Sign in Required

Please sign in to use the AI tutor

Sign In