DPO & RLHF: Aligning LLMs to Human Preferences
Direct Preference Optimization, reward modeling, and rejection sampling.
Fine-Tuning Makes Models Capable. Alignment Makes Them Useful.
DPO vs RLHF
A model fine-tuned on code will write code. But will it refuse to write malware? Will it explain its reasoning? Will it prefer concise answers over verbose ones? Alignment post-training (RLHF or DPO) is what separates raw capability from commercially deployable behavior.
RLHF: The Hard Way
RLHF (Reinforcement Learning from Human Feedback) requires training a separate Reward Model (RM) on human preference data, then using PPO (Proximal Policy Optimization) to optimize the LLM's policy against the RM's score. This requires running 3 models simultaneously during training (policy, reference, reward) — extremely expensive and unstable. This is how GPT-4 was aligned.
DPO: The Elegant Shortcut
DPO (Direct Preference Optimization) was a 2023 breakthrough that eliminates the reward model entirely. Instead of training a separate RM, DPO directly trains the LLM using preference pairs (chosen response vs. rejected response) with a closed-form classification loss. The math shows that the optimal policy can be expressed directly in terms of the preference data — no RL loop required.
DPO is now the industry standard for alignment at most labs. It is simpler to implement, more stable to train, and achieves comparable quality to PPO-RLHF for instruction following.
Synthetic Preference Data with Constitutional AI
Anthropic's Constitutional AI approach eliminates expensive human labelers for preference data. A "helpfulness" model generates a response. A "critique" model evaluates it against a set of principles (the constitution). The critique model rewrites the response to better satisfy the principles. The original vs. revised response becomes a (rejected, chosen) preference pair. This generates millions of alignment pairs automatically.
Code Example
DPO training pipeline. The beta parameter controls the KL divergence penalty — too high and the model won't learn preferences, too low and it diverges from the base model. 0.1 is the standard starting point.
1# pip install trl transformers peft
2from trl import DPOTrainer, DPOConfig
3from transformers import AutoModelForCausalLM, AutoTokenizer
4from peft import LoraConfig, get_peft_model
5from datasets import Dataset
6
7model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
8
9model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto")
10tokenizer = AutoTokenizer.from_pretrained(model_id)
11
12# DPO preference dataset format: (prompt, chosen, rejected)
13# "chosen" = the better response, "rejected" = the worse response
14preference_data = {
15 "prompt": [
16 "Explain quantum computing to a software engineer.",
17 "How do I center a div in CSS?"
18 ],
19 "chosen": [
20 "Quantum computing uses qubits which can be in superposition of 0 and 1 simultaneously, enabling algorithms like Shor's that solve problems exponentially faster than classical computers for specific tasks.",
21 "Use flexbox: parent { display: flex; justify-content: center; align-items: center; }"
22 ],
23 "rejected": [
24 "Quantum computing is a type of computation that leverages quantum mechanical phenomena such as superposition and entanglement to perform computations. Unlike classical computers...",
25 "There are many ways to center a div. You can use margin, flexbox, grid, position, and many other CSS properties depending on your specific use case..."
26 ]
27}
28
29dataset = Dataset.from_dict(preference_data)
30
31# Apply LoRA to reduce training cost
32lora_config = LoraConfig(r=16, lora_alpha=32, target_modules=["q_proj", "v_proj"])
33
34# DPO Training — no reward model needed
35dpo_config = DPOConfig(
36 output_dir="./dpo-llama3-aligned",
37 num_train_epochs=3,
38 per_device_train_batch_size=2,
39 beta=0.1, # KL penalty strength - prevents model from diverging too far from base
40 learning_rate=5e-6, # Much lower LR than SFT — you're nudging, not rewriting
41)
42
43trainer = DPOTrainer(
44 model=model,
45 args=dpo_config,
46 train_dataset=dataset,
47 tokenizer=tokenizer,
48 peft_config=lora_config
49)
50
51trainer.train()
52# The model now learns to prefer concise, direct answers over verbose onesUse Cases
Common Mistakes
Interview Insight
Relevance
High - Core alignment technique used by OpenAI, Anthropic, and Meta.