PHASE 1 · FOUNDATIONS OF INTELLIGENCE
08 / 78

Tokens and discrete representation

Lesson 8. Phase 1: Foundations of intelligence. ~25 min read + cards + retrieval. Durability tier 1 (bedrock).

🧵
Memory palace · Bench · station 8
The spool of solder. Continuous-feeling input gets broken into discrete units that are joined back into larger structures downstream. The engineered granularity beneath an apparently smooth system.
Core idea. LLMs do not operate on words or meaning directly; they operate on token sequences, and the tokenisation scheme shapes what patterns become easy or difficult for the model to learn.

Why this lesson exists

Human language is the input. Tensors are what the model can compute on. The gap between the two is what tokenisation closes.

A model that operated on raw text bytes would spend most of its capacity learning what a space character means and how letters combine into words. A model that operated on entire words would need a vocabulary of millions and would fall apart the moment a novel name appeared. The compromise is to chop the input into a moderate vocabulary of discrete units, called tokens. Modern LLMs operate on token sequences. Everything they appear to know is downstream of the integer IDs the tokeniser hands them.

This is the most consequential representation choice in modern AI and gets surprisingly little airtime in popular accounts. It quietly determines what the model is fluent at, what each query costs, and where the system breaks.

Tokenisation as quantisation

A token is a discrete unit drawn from a fixed vocabulary. The tokeniser is the function that maps a raw byte stream to a sequence of token IDs. The vocabulary is the lookup table from ID back to the underlying byte sequence.

Three families. Character tokenisation: each character is its own token. Vocabulary is tiny (a few hundred for most scripts). Sequences are long; "tokenisation" is 12 tokens. The model has to assemble every higher-level structure from characters inside its layers, which costs depth and compute.

Word tokenisation: each whitespace-delimited word is a token. Vocabulary is enormous (hundreds of thousands of distinct words for English; far more for inflected languages). Sequences are short. Generalisation to unseen words is poor; a novel name, a typo, or a compound that wasn't in training falls out of the vocabulary entirely.

Subword tokenisation: the middle path that won. The vocabulary contains common whole words ("the", "and", "model") plus common subword pieces ("ation", "ing"). Rare words decompose into multiple subword tokens; common words remain whole. Vocabulary is bounded (typically 32K to 200K). Sequences are moderate. Generalisation to unseen words is good because the subwords cover them.

The dominant algorithm is byte-pair encoding (BPE). Start with characters. Repeatedly merge the most-frequent adjacent pair into a new token, until the vocabulary reaches the target size. The result is data-driven: common patterns get their own tokens; rare patterns stay decomposed. The tokeniser is a learned artifact in its own right.

Compression and tokenisation

L2 framed prediction and compression as two views of the same operation. L7 added that representation is the third view. Tokenisation is the fourth view, applied to the input layer.

A good tokeniser is a compressed code for the input distribution. Common words appear often and get short codes (one token each). Rare words appear rarely and get longer codes (multiple tokens). This is Huffman coding's logic at the level of natural language. The tokeniser is doing what zip and gzip do, but with units that are usually meaningful at the language level.

The compression ratio matters operationally. A modern English tokeniser compresses into roughly 1 token per 3-4 characters. A tokeniser that needed 1 token per character would produce sequences 3-4× longer for the same content, and the compute cost would scale with that length.

Token frequency and the long tail

Token frequency follows a Zipfian distribution. "The" and "of" appear in nearly every English sentence; the 50,000th most common token appears once in millions of words. Vocabularies are sized around this curve. The top 32K tokens cover most of common usage; everything else fragments.

"Tokenisation" is one token in a modern tokeniser. "Antidisestablishmentarianism" is several. A common name like "John" is one token; a rare name like "Suetonius" might be 4 or 5. The model is fluent on the common ground and gets choppy in the tail.

Arithmetic exposes this directly. The number 12 is usually one token. The number 1234 might tokenise as ["12", "34"] or ["1", "234"] or ["123", "4"] depending on the tokeniser. The model that sees "1234" as two tokens has to learn arithmetic over compounds of those tokens, which is harder than over consistent single-digit tokens. Many early "LLMs can't do arithmetic" headlines were really "LLMs can't do arithmetic in the tokenisation their training corpus produced". Character-level fine-tuning and number-aware tokenisers close most of the gap.

Multilingual and code tokenisation

Tokenisers trained on English-heavy corpora are great for English and inefficient for everything else. Chinese and Japanese, which use no spaces and have thousands of characters, end up with character-per-token efficiency much closer to character tokenisation. A Chinese paragraph that conveys the same content as an English paragraph may need 3-5× as many tokens. The same query in English and Chinese, served by the same API, costs different amounts because the Chinese version uses more tokens. Multilingual model families now ship tokenisers trained on more representative corpora, but the asymmetry persists.

Source code has different statistical structure from natural language. Whitespace is meaningful in Python; brackets and operators cluster densely. Identifier names follow camelCase or snake_case patterns that English-trained tokenisers don't capture well. A function name like compute_partial_gradients might fragment into 5 tokens. Code-specialist models retrain the tokeniser on code corpora and get visibly better token efficiency on source files. Emoji and rare unicode produce their own quirks: some tokenisers represent each emoji as one token; others fall back to multiple UTF-8 bytes.

Context-window economics

Token count is the operational currency of LLM inference. Attention scales roughly quadratically with the number of tokens, so a 4K-token prompt costs 4× the attention compute of a 2K prompt; an 8K prompt costs 16×.

The KV cache (the per-token state the model holds during generation) grows linearly with sequence length. For a 70B-parameter model, the cache might be roughly 0.5-2 MB per token; a 32K context window means many gigabytes of VRAM committed just to remembered state. Serving a model with a million-token context window is more a memory engineering problem than a compute one.

Inference throughput is bounded by both. A model with cheaper tokens (fewer tokens for the same content) serves more queries per second on the same hardware. Tokeniser quality is therefore a direct lever on deployment cost. Models that compress a domain (English text, Python code) into fewer tokens than competitors don't just feel faster; they cost less to serve at scale.

Hardware and memory

The embedding table is the most-touched data structure in the model: vocabulary size times embedding dimension parameters, sitting in VRAM, hit per token on every forward pass. Larger vocabularies cost more memory but produce shorter sequences. The trade-off is real and is tuned per use case.

Token IDs are integers; embedding lookups are scattered reads, one of the operations modern accelerators have to handle well. Vendor compilers spend real engineering effort on making embedding gathers fast. The KV cache is the dominant inference-time memory cost and grows with token count. Architectural choices about grouped-query, multi-query, or full multi-head attention are largely choices about how aggressively to compress the KV cache so longer sequences fit. Those treatments come later in the course; here the point is that tokenisation choice and KV cache architecture both push on the same constraint: how much memory does each token in the sequence eat.

Failure modes

Fragmented rare words. A novel name or technical term that breaks into many tokens has poorer representational coherence than a name that's a single token. The model has to compose meaning across multiple positions; the resulting embeddings can be unstable.

Arithmetic instability. Numbers tokenised inconsistently across lengths produce arithmetic that works reliably only at the lengths whose tokenisation the model saw enough of in training.

Token-boundary weirdness. Asking a model to reverse a string, count letters, or do letter-level edits stumbles because the model can't see individual characters; it sees only token IDs. "How many R's are in strawberry?" is hard because the model has no direct character access.

Prompt-injection edge cases. Some attacks rely on tokenisation quirks: an input crafted so an instruction gets broken across token boundaries can sometimes bypass a filter that checks at the string level but not at the token level. Tokenisation is part of the security surface.

Multilingual collapse. A model trained primarily on English-tokenised text performs visibly worse in languages whose tokenisation in that tokeniser is inefficient. The training run spent most of its capacity on token sequences whose statistics don't match the deployment text.

Three views of tokenisation

Figure 8.1 makes the trade-offs visible. The top panel shows the same sentence under 3 schemes side by side, with token counts. The middle panel shows the Zipfian frequency curve and how rare items fragment. The bottom panel shows what happens to compute and memory as the sequence length grows.

panel 1 · same sentence, three tokenisers input: "Tokenisation shapes everything." char-level T o k e n i s a t i o n · s h a p e s · e v e r y t h i n g . 31 tokens tiny vocab; very long sequences word-level Tokenisation shapes everything. 3 tokens huge vocab; fails on novel words subword (BPE) Token isation ·shapes ·everything . 5 tokens bounded vocab; generalises char: model assembles everything inside layers · word: out-of-vocab kills generalisation · subword: middle path that won same sentence; very different compute, memory, and generalisation footprints downstream panel 2 · token frequency · zipfian, common gets short, rare fragments rank (log) freq "the" (rank ~1) "tokenisation" (~5K) "strawberry" (~15K) "Suetonius" (long tail) examples "tokenisation" → 1 token (common) "Suetonius" → 4 fragments (rare) "1234" → ["12","34"] or ["1","234"] unstable arithmetic "strawberry" → 1-2 tokens; chars invisible to model panel 3 · context-window economics · cost vs sequence length sequence length (tokens) cost 2K 4K 8K 32K 128K attention compute (quadratic) KV cache (linear) embedding lookups (linear) 4K → 4× attention compute vs 2K 32K context ≈ many GB KV cache 128K · attention dominates serving cost tokenisation is a compression and compute tradeoff
FIG 8.1. Three views of tokenisation. Top: same sentence under character, word, and subword schemes. Token counts differ by an order of magnitude with very different downstream consequences. Middle: token frequency is Zipfian; common items get single tokens, rare items fragment, and arithmetic tokenises unstably across number lengths. Bottom: cost over sequence length. Attention is quadratic; KV cache and embedding lookups are linear. Token count is the operational currency of LLM inference.

The L1 to L7 view

In L1's loop, tokens are the form of the input the system actually computes on. In L2's terms, the tokeniser is a compression of natural language into a code where common items get short representations. In L3's terms, generalisation across the long tail depends on whether the tokeniser's subwords cover the variation in deployment text. In L4's terms, scaling shifts which capabilities are achievable for a given token budget. In L5's terms, the tokeniser plus the next-token loss is the entire learning signal of self-supervised LLM training. In L6's terms, the sequence of generated tokens is the trajectory the model produces. In L7's terms, tokenisation is the first layer of representation, before any learned embedding.

The takeaway

LLMs do not operate on words. They operate on sequences of token IDs from a fixed vocabulary. The tokeniser shapes what's easy to learn, what's expensive to serve, where the model is fluent, and where it quietly fails.

The spool of solder unspools in discrete segments and gets joined back into larger structures downstream. Same shape in the model: continuous-feeling language goes in, discrete units come out, the model computes on the units, and the continuity reappears only at the surface.

Flashcards

Click a card to flip. Rate yourself: Again resets, Hard shortens the interval, Good lengthens it. State persists in this browser.

Retrieval practice

Write your answer first. Then reveal. Don't peek. Getting it wrong is how the memory forms.

L8 Compare character, word, and subword tokenisation on (i) capability (what the model can learn from a fixed training corpus), and (ii) compute cost (training and inference). Use the same example throughout: training a model on a Python codebase that contains identifier names like compute_partial_gradients and string literals containing English prose.
Character tokenisation. Capability: the model has to assemble all higher-level structure (subwords, words, identifiers, syntax patterns) from character sequences inside its layers. With enough capacity and training, it can; but it spends a lot of depth doing what a tokeniser could have done up front. Generalisation to unseen identifiers is excellent because every identifier is just a character sequence. Compute cost: brutal. Sequences are 4-6× longer than subword-equivalents, attention scales quadratically with length, so training and inference both pay heavily. KV cache grows correspondingly. Word tokenisation. Capability: the model sees each whole identifier as one token, which is fluent for the identifiers it saw in training but disastrous for novel ones. compute_partial_gradients is one token if it appeared often enough; otherwise it falls out of the vocabulary entirely. Generalisation to unseen identifiers is poor. Compute cost: shortest sequences, fastest inference per query, but the vocabulary size to cover real code is huge and the embedding table dominates memory. Subword (BPE) tokenisation. Capability: balanced. Common code patterns ("def ", "return", "self.") become single tokens; identifiers decompose into reusable subwords ("compute", "partial", "gradients", "_"). The model learns identifier structure from the subwords and can compose meaning across novel compounds. English string literals tokenise efficiently because BPE on a code corpus has seen English mixed in. Compute cost: moderate sequence length, bounded vocabulary, well-matched to GPU memory budgets. This is why subword tokenisation became standard for code and natural language alike. The general lesson: the tokeniser is the first layer of representation, and the choice determines what the model can learn easily, how much it costs to serve, and where the long tail of failures will land.
L8 An LLM is serving English customers well but performing poorly on Korean inputs despite having Korean in its training corpus. Without invoking attention internals or training procedure, sketch 3 mechanistic explanations rooted in tokenisation, and describe what you'd measure to discriminate between them.
Three candidates. (1) Tokeniser inefficiency on Korean. The vocabulary is dominated by English-frequent subwords, so Korean text fragments into far more tokens than equivalent English. Each Korean character may consume 2-3 tokens of UTF-8 bytes if the tokeniser didn't include Korean subwords. Consequence: the model sees Korean sequences that are 3-5× longer than English for the same content, and it spent less compute per token-position on those sequences during training, so the learned representations of Korean subwords are noisier. (2) Training-corpus token distribution mismatch. Even if the tokeniser nominally covers Korean, the training corpus may have been English-heavy by token count, so the model saw Korean subwords much less often. The optimiser sharpens predictions where it has signal; Korean tokens got less gradient update because they appeared less often. The model is fluent on the part of the token vocabulary the corpus emphasised. (3) Context-window pressure for Korean. Because Korean is more token-dense, the same effective context (paragraph of meaning) consumes more of the context window. Long Korean documents truncate earlier than English documents of the same character length, so the model has less context to condition on. To discriminate: (i) tokenise a held-out Korean test set and compare token-per-character against English on the same corpus type; if Korean is much higher, the tokeniser is the immediate problem. (ii) Sample token frequencies during the model's training corpus (or a proxy) and compare per-language total token counts; if Korean tokens are 10× rarer, the training-distribution explanation is dominant. (iii) Evaluate the model on Korean inputs truncated to the same token budget as the equivalent English vs the same character budget; if the model improves substantially on the token-equalised set, context-window pressure is contributing. The fix follows from the cause: retrain or extend the tokeniser on Korean-heavy data, rebalance training corpus, or improve context-window memory handling.
↳ L9 (Forward interleave to L9, embeddings intuition.) Token IDs are integers. The model can't compute usefully on integers directly; it converts each token ID into a vector via an embedding table. Without using any maths, sketch why this conversion is necessary, what kinds of structure embeddings can capture that raw token IDs can't, and what would change if you tried to use token IDs directly as the model's input.
Token IDs are arbitrary labels: token 1273 and token 1274 might mean completely unrelated things (the assignment is determined by where they fell in vocabulary construction, not by meaning). A model fed raw token IDs would have to treat the input as discrete categorical labels with no inherent relationships; every pair of tokens is equidistant in "ID space", which means the optimiser cannot exploit the fact that some tokens are operationally similar to others. The embedding table maps each token ID to a high-dimensional continuous vector. The vectors are learned jointly with the rest of the model, and the geometry of the resulting embedding space encodes relationships: tokens that appear in similar contexts (and therefore are operationally similar for prediction) end up at similar points. This is what makes the downstream computation work. Attention can compute similarities between token embeddings; MLPs can learn linear-in-features functions across token meanings; the whole architecture is built on the assumption that the input is a continuous representation, not a categorical ID. If you tried to use token IDs directly, you'd either (i) one-hot encode them, exploding the input dimensionality to vocabulary size and creating a degenerate computation problem, or (ii) feed integers and lose any semantic structure, forcing the model to memorise per-ID behaviour. Neither works at scale. Embeddings are the next step: continuous representations whose geometry the optimiser is allowed to shape. L9 picks this up directly.

Next station

Lesson 9 sits at the compass on the bench (station 9), where token IDs become dense vectors and direction in high-dimensional space starts carrying operational meaning.