How Transformer Models Work & KV Cache

Beyond Basic Attention

Transformer Architecture

While the original 2017 Transformer paper laid the groundwork, modern production models use heavily optimized variants. To engineer performant systems, you must understand their architectural bottlenecks.

1. The KV Cache Bottleneck

During auto-regressive decoding, the model generates one token at a time. Recomputing the Key (K) and Value (V) matrices for all previous tokens at every step is computationally catastrophic. The KV Cache stores these calculated vectors in GPU VRAM.

The Engineering Impact: The size of the KV cache grows linearly with context length. A 128k context window can consume over 30GB of VRAM just for the cache (not the weights). This is why providers charge heavily for context window and why techniques like Grouped-Query Attention (GQA) were invented—to compress the KV cache by sharing keys and values across multiple query heads.

2. Rotary Position Embeddings (RoPE)

The original sinusoidal positional encodings failed to extrapolate to longer sequence lengths. RoPE encodes absolute position using a rotation matrix and naturally captures relative dependency between tokens mathematically. This allows models to gracefully handle 100k+ contexts.

Core Modern Optimizations

RMSNorm (Root Mean Square Normalization): Replaced LayerNorm because it drops the mean-centering step, saving compute by 10-20% without losing quality. Pre-normalization is now standard.
SwiGLU Activations: Replaced ReLU in the Feed-Forward Network. It acts as a continuous gating mechanism, yielding highly improved empirical scaling performance.
FlashAttention-2: Mathematically restructures attention to stay exclusively inside the GPU's ultra-fast SRAM, bypassing slow HBM reads. This fundamentally enables 1M+ context windows.

Code Example

Implementation of the SwiGLU FFN used in modern LLMs like Llama 3. Notice how it uses two separate matrices (W1 and W2) to create a gating mechanism, multiplying them together before the final down projection.

python

1import torch
2import torch.nn.functional as F
3
4# A simplified demonstration of SwiGLU Activation
5# SwiGLU(x) = Swish(xW_1) * (xW_2)
6
7def swish(x, beta=1.0):
8    return x * torch.sigmoid(beta * x)
9
10class SwiGLU(torch.nn.Module):
11    def __init__(self, d_model, d_ff):
12        super().__init__()
13        self.w1 = torch.nn.Linear(d_model, d_ff, bias=False) # Gate projection
14        self.w2 = torch.nn.Linear(d_model, d_ff, bias=False) # Up projection
15        self.w3 = torch.nn.Linear(d_ff, d_model, bias=False) # Down projection
16        
17    def forward(self, x):
18        # The gating mechanism
19        gate = swish(self.w1(x))
20        up_proj = self.w2(x)
21        # Element-wise multiplication, then down projection
22        return self.w3(gate * up_proj)
23
24# Usage Simulation
25hidden_states = torch.randn(2, 4096) # Batch size 2, 4k dimension
26ffn = SwiGLU(d_model=4096, d_ff=11008) # Common Llama expansion ratio
27output = ffn(hidden_states)
28print(f"SwiGLU Output Shape: {output.shape}")

Use Cases

Understanding how GPT-4, Claude, and Gemini work internally

Debugging attention patterns in fine-tuned models

Choosing the right architecture (encoder vs decoder) for your use case

Optimizing context window utilization

Common Mistakes

Confusing encoder-only (BERT) with decoder-only (GPT) architectures — they serve very different purposes

Ignoring the scaling factor in attention (√d_k) leads to vanishing gradients with large dimensions

Assuming more attention heads always means better performance — beyond a point, it adds compute without improving quality

Interview Insight

Relevance

High - Expected foundational knowledge for AI engineering roles.

LLM Foundations

Advanced Prompt Engineering

RAG & Vector Databases

Building AI Agents

AI Engineering Stack

Advanced RAG Engineering

LLM Inference Engineering

Fine-Tuning & Model Alignment

Context & Memory Management