Tokenization & Embeddings

How computers represent language numerically.

Why Tokenization Matters

Tokenization Pipeline

Raw Text "Hello world" Tokenizer BPE / SentPiece Token IDs [15496, 995] Embeddings 768-dim vectors

LLMs don't process text — they process tokens. Tokenization is the bridge between human language and the mathematical operations inside a neural network. Your choice of tokenizer directly impacts model quality, cost, and context window efficiency.

Common Tokenization Algorithms

  • BPE (Byte Pair Encoding): Used by GPT models. Iteratively merges the most frequent pair of bytes/characters into a single token. Good balance of vocabulary size and coverage.
  • WordPiece: Used by BERT. Similar to BPE but uses likelihood instead of frequency for merging decisions.
  • SentencePiece: Language-agnostic tokenizer that works directly on raw text (no pre-tokenization). Used by LLaMA and T5.
  • tiktoken: OpenAI's fast BPE implementation. The standard for GPT-3.5/GPT-4 token counting.

Token Embeddings

After tokenization, each token ID is mapped to a dense vector (embedding) through a lookup table. These embeddings are the actual inputs to the Transformer. They encode semantic meaning — similar words have similar vectors.

Why This Matters for Cost

OpenAI and other providers charge per token. A poorly tokenized prompt wastes money. For example, the word "tokenization" might be 1 token or 3 tokens depending on the tokenizer.

Code Example

Using tiktoken to count and inspect tokens. Useful for estimating API costs before making calls.

python
1import tiktoken
2
3# GPT-4 tokenizer
4enc = tiktoken.encoding_for_model("gpt-4")
5
6text = "AI Engineering is the future of software development."
7tokens = enc.encode(text)
8
9print(f"Text: {text}")
10print(f"Token count: {len(tokens)}")
11print(f"Tokens: {tokens}")
12print(f"Decoded: {[enc.decode([t]) for t in tokens]}")
13
14# Cost estimation (GPT-4 pricing: $0.03 per 1K input tokens)
15cost_per_1k = 0.03
16estimated_cost = (len(tokens) / 1000) * cost_per_1k
17print(f"Estimated cost: ${estimated_cost:.6f}")

Use Cases

Estimating API costs before production deployment
Optimizing prompts to fit within context windows
Understanding why certain languages use more tokens
Building custom tokenizers for domain-specific applications

Common Mistakes

Assuming 1 word = 1 token. On average, 1 token ≈ 0.75 words for English text
Not accounting for special tokens (<BOS>, <EOS>) that consume context window space
Ignoring that code typically uses more tokens per line than natural language

Interview Insight

Relevance

Medium - Affects cost and quality

AI Tutor

Ask about the topic

Sign in Required

Please sign in to use the AI tutor

Sign In