QLoRA & Parameter-Efficient Fine-Tuning
Low-rank adaptation, 4-bit quantization training, and Hugging Face PEFT.
Full Fine-Tuning is Dead For Most Use Cases
QLoRA Architecture
Full fine-tuning updates all 70 billion parameters simultaneously. At bfloat16, that requires ~448GB of GPU VRAM just for the optimizer states — a $200,000 infrastructure problem. LoRA sidesteps this entirely with a mathematical insight: weight updates during fine-tuning have low intrinsic rank. Instead of updating W (a large matrix), you learn two tiny matrices A and B where W' = W + AB. You only train ~0.1% of the total parameters.
QLoRA: 4-bit Quantization + LoRA
QLoRA goes further: it quantizes the frozen base model weights to 4-bit NF4 format (near lossless) and trains the LoRA adapters in bfloat16. This enables fine-tuning a Llama 3 70B model on a single A100 80GB GPU — a 10x VRAM reduction vs. full fine-tuning.
LoRA Hyperparameters That Actually Matter
- r (rank): Size of the low-rank decomposition. r=8 for general tasks, r=64 for complex domain adaptation. Higher r = more parameters, more expressivity, more overfitting risk.
- alpha: Scaling factor (typically set to 2*r). Controls the magnitude of LoRA weight updates.
- target_modules: Which layers to apply LoRA to. Apply to all attention, gate, up, down projection layers for best results.
Code Example
Complete QLoRA fine-tuning pipeline. The key insight: only 1% of parameters are trainable, but the model quality improvement is nearly equivalent to full fine-tuning. merge_and_unload() removes LoRA overhead at inference time.
1# pip install transformers peft bitsandbytes trl datasets
2from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, TrainingArguments
3from peft import get_peft_model, LoraConfig, TaskType, prepare_model_for_kbit_training
4from trl import SFTTrainer
5from datasets import load_dataset
6import torch
7
8model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
9
10# Step 1: Load model in 4-bit (QLoRA)
11bnb_config = BitsAndBytesConfig(
12 load_in_4bit=True,
13 bnb_4bit_quant_type="nf4", # NormalFloat4 - near lossless
14 bnb_4bit_compute_dtype=torch.bfloat16,
15 bnb_4bit_use_double_quant=True, # Quantize quantization constants too (saves 0.4 bits)
16)
17
18model = AutoModelForCausalLM.from_pretrained(
19 model_id,
20 quantization_config=bnb_config,
21 device_map="auto"
22)
23model = prepare_model_for_kbit_training(model)
24
25# Step 2: Define LoRA Configuration
26lora_config = LoraConfig(
27 r=16, # Rank (balance between expressivity and VRAM)
28 lora_alpha=32, # Scaling (typically 2*r)
29 lora_dropout=0.05, # Regularization
30 target_modules=[ # Apply LoRA to all key projection layers
31 "q_proj", "k_proj", "v_proj", "o_proj",
32 "gate_proj", "up_proj", "down_proj"
33 ],
34 task_type=TaskType.CAUSAL_LM
35)
36
37model = get_peft_model(model)
38model.print_trainable_parameters()
39# Output: trainable params: 83,886,080 || total params: 8,030,261,248
40# Only 1.04% of parameters are trained!
41
42# Step 3: Train with SFTTrainer
43dataset = load_dataset("your-org/domain-dataset", split="train")
44
45trainer = SFTTrainer(
46 model=model,
47 train_dataset=dataset,
48 args=TrainingArguments(
49 output_dir="./qlora-llama3-8b",
50 num_train_epochs=3,
51 per_device_train_batch_size=4,
52 gradient_accumulation_steps=4, # Effective batch = 16
53 learning_rate=2e-4,
54 fp16=True,
55 logging_steps=10,
56 save_steps=100,
57 )
58)
59trainer.train()
60
61# Merge LoRA weights back into base model for deployment
62model.merge_and_unload()
63model.save_pretrained("./final-model")Use Cases
Common Mistakes
Interview Insight
Relevance
High - Core skill for adapting open-source models to domain tasks.