Tokenization & Embeddings
How computers represent language numerically.
Why Tokenization Matters
Tokenization Pipeline
LLMs don't process text — they process tokens. Tokenization is the bridge between human language and the mathematical operations inside a neural network. Your choice of tokenizer directly impacts model quality, cost, and context window efficiency.
Common Tokenization Algorithms
- BPE (Byte Pair Encoding): Used by GPT models. Iteratively merges the most frequent pair of bytes/characters into a single token. Good balance of vocabulary size and coverage.
- WordPiece: Used by BERT. Similar to BPE but uses likelihood instead of frequency for merging decisions.
- SentencePiece: Language-agnostic tokenizer that works directly on raw text (no pre-tokenization). Used by LLaMA and T5.
- tiktoken: OpenAI's fast BPE implementation. The standard for GPT-3.5/GPT-4 token counting.
Token Embeddings
After tokenization, each token ID is mapped to a dense vector (embedding) through a lookup table. These embeddings are the actual inputs to the Transformer. They encode semantic meaning — similar words have similar vectors.
Why This Matters for Cost
OpenAI and other providers charge per token. A poorly tokenized prompt wastes money. For example, the word "tokenization" might be 1 token or 3 tokens depending on the tokenizer.
Code Example
Using tiktoken to count and inspect tokens. Useful for estimating API costs before making calls.
1import tiktoken
2
3# GPT-4 tokenizer
4enc = tiktoken.encoding_for_model("gpt-4")
5
6text = "AI Engineering is the future of software development."
7tokens = enc.encode(text)
8
9print(f"Text: {text}")
10print(f"Token count: {len(tokens)}")
11print(f"Tokens: {tokens}")
12print(f"Decoded: {[enc.decode([t]) for t in tokens]}")
13
14# Cost estimation (GPT-4 pricing: $0.03 per 1K input tokens)
15cost_per_1k = 0.03
16estimated_cost = (len(tokens) / 1000) * cost_per_1k
17print(f"Estimated cost: ${estimated_cost:.6f}")Use Cases
Common Mistakes
Interview Insight
Relevance
Medium - Affects cost and quality