Tokenization & Embeddings

Why Tokenization Matters

Tokenization Pipeline

LLMs don't process text — they process tokens. Tokenization is the bridge between human language and the mathematical operations inside a neural network. Your choice of tokenizer directly impacts model quality, cost, and context window efficiency.

Common Tokenization Algorithms

BPE (Byte Pair Encoding): Used by GPT models. Iteratively merges the most frequent pair of bytes/characters into a single token. Good balance of vocabulary size and coverage.
WordPiece: Used by BERT. Similar to BPE but uses likelihood instead of frequency for merging decisions.
SentencePiece: Language-agnostic tokenizer that works directly on raw text (no pre-tokenization). Used by LLaMA and T5.
tiktoken: OpenAI's fast BPE implementation. The standard for GPT-3.5/GPT-4 token counting.

Token Embeddings

After tokenization, each token ID is mapped to a dense vector (embedding) through a lookup table. These embeddings are the actual inputs to the Transformer. They encode semantic meaning — similar words have similar vectors.

Why This Matters for Cost

OpenAI and other providers charge per token. A poorly tokenized prompt wastes money. For example, the word "tokenization" might be 1 token or 3 tokens depending on the tokenizer.

Code Example

Using tiktoken to count and inspect tokens. Useful for estimating API costs before making calls.

python

1import tiktoken
2
3# GPT-4 tokenizer
4enc = tiktoken.encoding_for_model("gpt-4")
5
6text = "AI Engineering is the future of software development."
7tokens = enc.encode(text)
8
9print(f"Text: {text}")
10print(f"Token count: {len(tokens)}")
11print(f"Tokens: {tokens}")
12print(f"Decoded: {[enc.decode([t]) for t in tokens]}")
13
14# Cost estimation (GPT-4 pricing: $0.03 per 1K input tokens)
15cost_per_1k = 0.03
16estimated_cost = (len(tokens) / 1000) * cost_per_1k
17print(f"Estimated cost: ${estimated_cost:.6f}")

Use Cases

Estimating API costs before production deployment

Optimizing prompts to fit within context windows

Understanding why certain languages use more tokens

Building custom tokenizers for domain-specific applications

Common Mistakes

Assuming 1 word = 1 token. On average, 1 token ≈ 0.75 words for English text

Not accounting for special tokens (<BOS>, <EOS>) that consume context window space

Ignoring that code typically uses more tokens per line than natural language

Interview Insight

Relevance

Medium - Affects cost and quality

LLM Foundations

Advanced Prompt Engineering

RAG & Vector Databases

Building AI Agents

AI Engineering Stack

Advanced RAG Engineering

LLM Inference Engineering

Fine-Tuning & Model Alignment

Context & Memory Management