How Transformer Models Work & KV Cache

Deep dive into KV Caching, RoPE, and core LLM architecture mechanics for system builders.

Beyond Basic Attention

Transformer Architecture

Input Tokens โ†’ Embeddings + Positional Encoding Multi-Head Self-Attention (QยทKยทV) KV Cache stores K,V for speed Feed-Forward Network + LayerNorm + Residual Output Logits โ†’ Next Token ร—N layers

While the original 2017 Transformer paper laid the groundwork, modern production models use heavily optimized variants. To engineer performant systems, you must understand their architectural bottlenecks.

1. The KV Cache Bottleneck

During auto-regressive decoding, the model generates one token at a time. Recomputing the Key (K) and Value (V) matrices for all previous tokens at every step is computationally catastrophic. The KV Cache stores these calculated vectors in GPU VRAM.

The Engineering Impact: The size of the KV cache grows linearly with context length. A 128k context window can consume over 30GB of VRAM just for the cache (not the weights). This is why providers charge heavily for context window and why techniques like Grouped-Query Attention (GQA) were inventedโ€”to compress the KV cache by sharing keys and values across multiple query heads.

2. Rotary Position Embeddings (RoPE)

The original sinusoidal positional encodings failed to extrapolate to longer sequence lengths. RoPE encodes absolute position using a rotation matrix and naturally captures relative dependency between tokens mathematically. This allows models to gracefully handle 100k+ contexts.

Core Modern Optimizations

Transformer Encoder Layer Multi-Head Attention Add & Norm Feed Forward Network Add & Norm
  • RMSNorm (Root Mean Square Normalization): Replaced LayerNorm because it drops the mean-centering step, saving compute by 10-20% without losing quality. Pre-normalization is now standard.
  • SwiGLU Activations: Replaced ReLU in the Feed-Forward Network. It acts as a continuous gating mechanism, yielding highly improved empirical scaling performance.
  • FlashAttention-2: Mathematically restructures attention to stay exclusively inside the GPU's ultra-fast SRAM, bypassing slow HBM reads. This fundamentally enables 1M+ context windows.

Code Example

Implementation of the SwiGLU FFN used in modern LLMs like Llama 3. Notice how it uses two separate matrices (W1 and W2) to create a gating mechanism, multiplying them together before the final down projection.

python
1import torch
2import torch.nn.functional as F
3
4# A simplified demonstration of SwiGLU Activation
5# SwiGLU(x) = Swish(xW_1) * (xW_2)
6
7def swish(x, beta=1.0):
8    return x * torch.sigmoid(beta * x)
9
10class SwiGLU(torch.nn.Module):
11    def __init__(self, d_model, d_ff):
12        super().__init__()
13        self.w1 = torch.nn.Linear(d_model, d_ff, bias=False) # Gate projection
14        self.w2 = torch.nn.Linear(d_model, d_ff, bias=False) # Up projection
15        self.w3 = torch.nn.Linear(d_ff, d_model, bias=False) # Down projection
16        
17    def forward(self, x):
18        # The gating mechanism
19        gate = swish(self.w1(x))
20        up_proj = self.w2(x)
21        # Element-wise multiplication, then down projection
22        return self.w3(gate * up_proj)
23
24# Usage Simulation
25hidden_states = torch.randn(2, 4096) # Batch size 2, 4k dimension
26ffn = SwiGLU(d_model=4096, d_ff=11008) # Common Llama expansion ratio
27output = ffn(hidden_states)
28print(f"SwiGLU Output Shape: {output.shape}")

Use Cases

Understanding how GPT-4, Claude, and Gemini work internally
Debugging attention patterns in fine-tuned models
Choosing the right architecture (encoder vs decoder) for your use case
Optimizing context window utilization

Common Mistakes

Confusing encoder-only (BERT) with decoder-only (GPT) architectures โ€” they serve very different purposes
Ignoring the scaling factor in attention (โˆšd_k) leads to vanishing gradients with large dimensions
Assuming more attention heads always means better performance โ€” beyond a point, it adds compute without improving quality

Interview Insight

Relevance

High - Expected foundational knowledge for AI engineering roles.

AI Tutor

Ask about the topic

Sign in Required

Please sign in to use the AI tutor

Sign In