DPO & RLHF: Aligning LLMs to Human Preferences

Direct Preference Optimization, reward modeling, and rejection sampling.

Fine-Tuning Makes Models Capable. Alignment Makes Them Useful.

DPO vs RLHF

RLHF (Complex) Human Labels Reward Model PPO Training Loop (unstable) ⚠️ 3 models, complex, expensive DPO (Simple) Preference Pairs chosen vs rejected Direct Optimization ✅ 1 model, stable, cheaper

A model fine-tuned on code will write code. But will it refuse to write malware? Will it explain its reasoning? Will it prefer concise answers over verbose ones? Alignment post-training (RLHF or DPO) is what separates raw capability from commercially deployable behavior.

RLHF: The Hard Way

RLHF (Reinforcement Learning from Human Feedback) requires training a separate Reward Model (RM) on human preference data, then using PPO (Proximal Policy Optimization) to optimize the LLM's policy against the RM's score. This requires running 3 models simultaneously during training (policy, reference, reward) — extremely expensive and unstable. This is how GPT-4 was aligned.

DPO: The Elegant Shortcut

DPO (Direct Preference Optimization) was a 2023 breakthrough that eliminates the reward model entirely. Instead of training a separate RM, DPO directly trains the LLM using preference pairs (chosen response vs. rejected response) with a closed-form classification loss. The math shows that the optimal policy can be expressed directly in terms of the preference data — no RL loop required.

DPO is now the industry standard for alignment at most labs. It is simpler to implement, more stable to train, and achieves comparable quality to PPO-RLHF for instruction following.

Synthetic Preference Data with Constitutional AI

Anthropic's Constitutional AI approach eliminates expensive human labelers for preference data. A "helpfulness" model generates a response. A "critique" model evaluates it against a set of principles (the constitution). The critique model rewrites the response to better satisfy the principles. The original vs. revised response becomes a (rejected, chosen) preference pair. This generates millions of alignment pairs automatically.

Code Example

DPO training pipeline. The beta parameter controls the KL divergence penalty — too high and the model won't learn preferences, too low and it diverges from the base model. 0.1 is the standard starting point.

python
1# pip install trl transformers peft
2from trl import DPOTrainer, DPOConfig
3from transformers import AutoModelForCausalLM, AutoTokenizer
4from peft import LoraConfig, get_peft_model
5from datasets import Dataset
6
7model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
8
9model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto")
10tokenizer = AutoTokenizer.from_pretrained(model_id)
11
12# DPO preference dataset format: (prompt, chosen, rejected)
13# "chosen" = the better response, "rejected" = the worse response
14preference_data = {
15    "prompt": [
16        "Explain quantum computing to a software engineer.",
17        "How do I center a div in CSS?"
18    ],
19    "chosen": [
20        "Quantum computing uses qubits which can be in superposition of 0 and 1 simultaneously, enabling algorithms like Shor's that solve problems exponentially faster than classical computers for specific tasks.",
21        "Use flexbox: parent { display: flex; justify-content: center; align-items: center; }"
22    ],
23    "rejected": [
24        "Quantum computing is a type of computation that leverages quantum mechanical phenomena such as superposition and entanglement to perform computations. Unlike classical computers...",
25        "There are many ways to center a div. You can use margin, flexbox, grid, position, and many other CSS properties depending on your specific use case..."
26    ]
27}
28
29dataset = Dataset.from_dict(preference_data)
30
31# Apply LoRA to reduce training cost
32lora_config = LoraConfig(r=16, lora_alpha=32, target_modules=["q_proj", "v_proj"])
33
34# DPO Training — no reward model needed
35dpo_config = DPOConfig(
36    output_dir="./dpo-llama3-aligned",
37    num_train_epochs=3,
38    per_device_train_batch_size=2,
39    beta=0.1,  # KL penalty strength - prevents model from diverging too far from base
40    learning_rate=5e-6,  # Much lower LR than SFT — you're nudging, not rewriting
41)
42
43trainer = DPOTrainer(
44    model=model,
45    args=dpo_config,
46    train_dataset=dataset,
47    tokenizer=tokenizer,
48    peft_config=lora_config
49)
50
51trainer.train()
52# The model now learns to prefer concise, direct answers over verbose ones

Use Cases

Aligning a code model to prefer documented, tested functions over undocumented ones
Teaching a customer support model to prefer empathetic tone over robotic responses
Removing harmful behaviors from a base model without full RLHF infrastructure

Common Mistakes

Using DPO without SFT warmup — DPO on a raw base model produces unstable training. Always SFT first.
Setting beta too low — the model will overfit to the preference data and forget its base capabilities
Using synthetic AI-generated preference pairs without human review — models preferring AI-written text reinforces AI-sounding patterns

Interview Insight

Relevance

High - Core alignment technique used by OpenAI, Anthropic, and Meta.

AI Tutor

Ask about the topic

Sign in Required

Please sign in to use the AI tutor

Sign In