QLoRA & Parameter-Efficient Fine-Tuning

Full Fine-Tuning is Dead For Most Use Cases

QLoRA Architecture

Full fine-tuning updates all 70 billion parameters simultaneously. At bfloat16, that requires ~448GB of GPU VRAM just for the optimizer states — a $200,000 infrastructure problem. LoRA sidesteps this entirely with a mathematical insight: weight updates during fine-tuning have low intrinsic rank. Instead of updating W (a large matrix), you learn two tiny matrices A and B where W' = W + AB. You only train ~0.1% of the total parameters.

QLoRA: 4-bit Quantization + LoRA

QLoRA goes further: it quantizes the frozen base model weights to 4-bit NF4 format (near lossless) and trains the LoRA adapters in bfloat16. This enables fine-tuning a Llama 3 70B model on a single A100 80GB GPU — a 10x VRAM reduction vs. full fine-tuning.

LoRA Hyperparameters That Actually Matter

r (rank): Size of the low-rank decomposition. r=8 for general tasks, r=64 for complex domain adaptation. Higher r = more parameters, more expressivity, more overfitting risk.
alpha: Scaling factor (typically set to 2*r). Controls the magnitude of LoRA weight updates.
target_modules: Which layers to apply LoRA to. Apply to all attention, gate, up, down projection layers for best results.

Code Example

Complete QLoRA fine-tuning pipeline. The key insight: only 1% of parameters are trainable, but the model quality improvement is nearly equivalent to full fine-tuning. merge_and_unload() removes LoRA overhead at inference time.

python

1# pip install transformers peft bitsandbytes trl datasets
2from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, TrainingArguments
3from peft import get_peft_model, LoraConfig, TaskType, prepare_model_for_kbit_training
4from trl import SFTTrainer
5from datasets import load_dataset
6import torch
7
8model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
9
10# Step 1: Load model in 4-bit (QLoRA)
11bnb_config = BitsAndBytesConfig(
12    load_in_4bit=True,
13    bnb_4bit_quant_type="nf4",        # NormalFloat4 - near lossless
14    bnb_4bit_compute_dtype=torch.bfloat16,
15    bnb_4bit_use_double_quant=True,   # Quantize quantization constants too (saves 0.4 bits)
16)
17
18model = AutoModelForCausalLM.from_pretrained(
19    model_id,
20    quantization_config=bnb_config,
21    device_map="auto"
22)
23model = prepare_model_for_kbit_training(model)
24
25# Step 2: Define LoRA Configuration
26lora_config = LoraConfig(
27    r=16,                    # Rank (balance between expressivity and VRAM)
28    lora_alpha=32,           # Scaling (typically 2*r)
29    lora_dropout=0.05,       # Regularization
30    target_modules=[         # Apply LoRA to all key projection layers
31        "q_proj", "k_proj", "v_proj", "o_proj",
32        "gate_proj", "up_proj", "down_proj"
33    ],
34    task_type=TaskType.CAUSAL_LM
35)
36
37model = get_peft_model(model)
38model.print_trainable_parameters()
39# Output: trainable params: 83,886,080 || total params: 8,030,261,248
40# Only 1.04% of parameters are trained!
41
42# Step 3: Train with SFTTrainer
43dataset = load_dataset("your-org/domain-dataset", split="train")
44
45trainer = SFTTrainer(
46    model=model,
47    train_dataset=dataset,
48    args=TrainingArguments(
49        output_dir="./qlora-llama3-8b",
50        num_train_epochs=3,
51        per_device_train_batch_size=4,
52        gradient_accumulation_steps=4,   # Effective batch = 16
53        learning_rate=2e-4,
54        fp16=True,
55        logging_steps=10,
56        save_steps=100,
57    )
58)
59trainer.train()
60
61# Merge LoRA weights back into base model for deployment
62model.merge_and_unload()
63model.save_pretrained("./final-model")

Use Cases

Fine-tuning Llama/Mistral to follow your company's specific JSON output format reliably

Domain adaptation for medical/legal/finance where generic LLMs lack vocabulary and reasoning patterns

Teaching a model your proprietary coding style or internal DSL

Common Mistakes

Using r=256 for a simple classification task — massive overfitting and unnecessary VRAM usage. Start with r=8 or r=16.

Not applying LoRA to gate_proj/up_proj/down_proj (FFN layers) — these layers are crucial for factual knowledge injection

Forgetting to call merge_and_unload() before deployment — serving with PEFT adapters adds latency overhead

Interview Insight

Relevance

High - Core skill for adapting open-source models to domain tasks.

LLM Foundations

Advanced Prompt Engineering

RAG & Vector Databases

Building AI Agents

AI Engineering Stack

Advanced RAG Engineering

LLM Inference Engineering

Fine-Tuning & Model Alignment

Context & Memory Management