Human language is the input. Tensors are what the model can compute on. The gap between the two is what tokenisation closes.
A model that operated on raw text bytes would spend most of its capacity learning what a space character means and how letters combine into words. A model that operated on entire words would need a vocabulary of millions and would fall apart the moment a novel name appeared. The compromise is to chop the input into a moderate vocabulary of discrete units, called tokens. Modern LLMs operate on token sequences. Everything they appear to know is downstream of the integer IDs the tokeniser hands them.
This is the most consequential representation choice in modern AI and gets surprisingly little airtime in popular accounts. It quietly determines what the model is fluent at, what each query costs, and where the system breaks.
A token is a discrete unit drawn from a fixed vocabulary. The tokeniser is the function that maps a raw byte stream to a sequence of token IDs. The vocabulary is the lookup table from ID back to the underlying byte sequence.
Three families. Character tokenisation: each character is its own token. Vocabulary is tiny (a few hundred for most scripts). Sequences are long; "tokenisation" is 12 tokens. The model has to assemble every higher-level structure from characters inside its layers, which costs depth and compute.
Word tokenisation: each whitespace-delimited word is a token. Vocabulary is enormous (hundreds of thousands of distinct words for English; far more for inflected languages). Sequences are short. Generalisation to unseen words is poor; a novel name, a typo, or a compound that wasn't in training falls out of the vocabulary entirely.
Subword tokenisation: the middle path that won. The vocabulary contains common whole words ("the", "and", "model") plus common subword pieces ("ation", "ing"). Rare words decompose into multiple subword tokens; common words remain whole. Vocabulary is bounded (typically 32K to 200K). Sequences are moderate. Generalisation to unseen words is good because the subwords cover them.
The dominant algorithm is byte-pair encoding (BPE). Start with characters. Repeatedly merge the most-frequent adjacent pair into a new token, until the vocabulary reaches the target size. The result is data-driven: common patterns get their own tokens; rare patterns stay decomposed. The tokeniser is a learned artifact in its own right.
L2 framed prediction and compression as two views of the same operation. L7 added that representation is the third view. Tokenisation is the fourth view, applied to the input layer.
A good tokeniser is a compressed code for the input distribution. Common words appear often and get short codes (one token each). Rare words appear rarely and get longer codes (multiple tokens). This is Huffman coding's logic at the level of natural language. The tokeniser is doing what zip and gzip do, but with units that are usually meaningful at the language level.
The compression ratio matters operationally. A modern English tokeniser compresses into roughly 1 token per 3-4 characters. A tokeniser that needed 1 token per character would produce sequences 3-4× longer for the same content, and the compute cost would scale with that length.
Token frequency follows a Zipfian distribution. "The" and "of" appear in nearly every English sentence; the 50,000th most common token appears once in millions of words. Vocabularies are sized around this curve. The top 32K tokens cover most of common usage; everything else fragments.
"Tokenisation" is one token in a modern tokeniser. "Antidisestablishmentarianism" is several. A common name like "John" is one token; a rare name like "Suetonius" might be 4 or 5. The model is fluent on the common ground and gets choppy in the tail.
Arithmetic exposes this directly. The number 12 is usually one token. The number 1234 might tokenise as ["12", "34"] or ["1", "234"] or ["123", "4"] depending on the tokeniser. The model that sees "1234" as two tokens has to learn arithmetic over compounds of those tokens, which is harder than over consistent single-digit tokens. Many early "LLMs can't do arithmetic" headlines were really "LLMs can't do arithmetic in the tokenisation their training corpus produced". Character-level fine-tuning and number-aware tokenisers close most of the gap.
Tokenisers trained on English-heavy corpora are great for English and inefficient for everything else. Chinese and Japanese, which use no spaces and have thousands of characters, end up with character-per-token efficiency much closer to character tokenisation. A Chinese paragraph that conveys the same content as an English paragraph may need 3-5× as many tokens. The same query in English and Chinese, served by the same API, costs different amounts because the Chinese version uses more tokens. Multilingual model families now ship tokenisers trained on more representative corpora, but the asymmetry persists.
Source code has different statistical structure from natural language. Whitespace is meaningful in Python; brackets and operators cluster densely. Identifier names follow camelCase or snake_case patterns that English-trained tokenisers don't capture well. A function name like compute_partial_gradients might fragment into 5 tokens. Code-specialist models retrain the tokeniser on code corpora and get visibly better token efficiency on source files. Emoji and rare unicode produce their own quirks: some tokenisers represent each emoji as one token; others fall back to multiple UTF-8 bytes.
Token count is the operational currency of LLM inference. Attention scales roughly quadratically with the number of tokens, so a 4K-token prompt costs 4× the attention compute of a 2K prompt; an 8K prompt costs 16×.
The KV cache (the per-token state the model holds during generation) grows linearly with sequence length. For a 70B-parameter model, the cache might be roughly 0.5-2 MB per token; a 32K context window means many gigabytes of VRAM committed just to remembered state. Serving a model with a million-token context window is more a memory engineering problem than a compute one.
Inference throughput is bounded by both. A model with cheaper tokens (fewer tokens for the same content) serves more queries per second on the same hardware. Tokeniser quality is therefore a direct lever on deployment cost. Models that compress a domain (English text, Python code) into fewer tokens than competitors don't just feel faster; they cost less to serve at scale.
The embedding table is the most-touched data structure in the model: vocabulary size times embedding dimension parameters, sitting in VRAM, hit per token on every forward pass. Larger vocabularies cost more memory but produce shorter sequences. The trade-off is real and is tuned per use case.
Token IDs are integers; embedding lookups are scattered reads, one of the operations modern accelerators have to handle well. Vendor compilers spend real engineering effort on making embedding gathers fast. The KV cache is the dominant inference-time memory cost and grows with token count. Architectural choices about grouped-query, multi-query, or full multi-head attention are largely choices about how aggressively to compress the KV cache so longer sequences fit. Those treatments come later in the course; here the point is that tokenisation choice and KV cache architecture both push on the same constraint: how much memory does each token in the sequence eat.
Fragmented rare words. A novel name or technical term that breaks into many tokens has poorer representational coherence than a name that's a single token. The model has to compose meaning across multiple positions; the resulting embeddings can be unstable.
Arithmetic instability. Numbers tokenised inconsistently across lengths produce arithmetic that works reliably only at the lengths whose tokenisation the model saw enough of in training.
Token-boundary weirdness. Asking a model to reverse a string, count letters, or do letter-level edits stumbles because the model can't see individual characters; it sees only token IDs. "How many R's are in strawberry?" is hard because the model has no direct character access.
Prompt-injection edge cases. Some attacks rely on tokenisation quirks: an input crafted so an instruction gets broken across token boundaries can sometimes bypass a filter that checks at the string level but not at the token level. Tokenisation is part of the security surface.
Multilingual collapse. A model trained primarily on English-tokenised text performs visibly worse in languages whose tokenisation in that tokeniser is inefficient. The training run spent most of its capacity on token sequences whose statistics don't match the deployment text.
Figure 8.1 makes the trade-offs visible. The top panel shows the same sentence under 3 schemes side by side, with token counts. The middle panel shows the Zipfian frequency curve and how rare items fragment. The bottom panel shows what happens to compute and memory as the sequence length grows.
In L1's loop, tokens are the form of the input the system actually computes on. In L2's terms, the tokeniser is a compression of natural language into a code where common items get short representations. In L3's terms, generalisation across the long tail depends on whether the tokeniser's subwords cover the variation in deployment text. In L4's terms, scaling shifts which capabilities are achievable for a given token budget. In L5's terms, the tokeniser plus the next-token loss is the entire learning signal of self-supervised LLM training. In L6's terms, the sequence of generated tokens is the trajectory the model produces. In L7's terms, tokenisation is the first layer of representation, before any learned embedding.
LLMs do not operate on words. They operate on sequences of token IDs from a fixed vocabulary. The tokeniser shapes what's easy to learn, what's expensive to serve, where the model is fluent, and where it quietly fails.
The spool of solder unspools in discrete segments and gets joined back into larger structures downstream. Same shape in the model: continuous-feeling language goes in, discrete units come out, the model computes on the units, and the continuity reappears only at the surface.
compute_partial_gradients and string literals containing English prose.Lesson 9 sits at the compass on the bench (station 9), where token IDs become dense vectors and direction in high-dimensional space starts carrying operational meaning.