Build Track · B2

Tokenizer explorer

Build a minimal byte-pair-encoding tokenizer by hand, the same idea L8 describes, and watch text turn into tokens. You train a merge table on a small text sample, encode and decode with it, and tabulate characters-per-token across English, a non-English language, and code. You write the tokenizing code yourself.

Phase: 1, after L8 Time: ~2.5 to 3 hours Tier: 1 (any laptop, CPU) Tooling: pure Python, standard library only Status: optional (depth-by-choice)

where this sits B2 is the Build Track's second hands-on build, attached to L8 (Tokens and discrete representation). It is optional and not required to continue the course. Do it now, do it later, or skip it and carry on with the lessons.

before you start B2 is pure Python from the standard library, so there is nothing to install for the main build. If running a Python script is new, the Tooling references cover it: Python setup, Running Python, and Reading errors. New to Python itself but you program in another language? Read Python basics once and keep the Python cheatsheet open. You only need Installing packages if you take an optional extension (plotting, or comparing against a production tokenizer). They are optional setup help, not part of the build.

Summary

You build a small byte-pair-encoding (BPE) tokenizer by hand and use it to watch text become tokens. You train a merge table on a small text sample by repeatedly counting adjacent symbol pairs and merging the most frequent, then encode and decode text with it. Then you explore: tokenize English, a non-English language, and some code, and tabulate token count and characters-per-token for each. The point is to see, with no library doing the work, that a tokenizer is a frequency-driven merge table, that tokens are subwords rather than words, and that a tokenizer trained on one kind of text over-segments everything else.

Learning goals

See text become a list of integer token IDs, the model's actual input unit.
Watch BPE build subwords: count adjacent pairs, merge the most frequent, repeat.
Encode and decode with your own tokenizer and confirm it round-trips.
Measure characters-per-token and watch it rise as the vocabulary grows.
See the non-English penalty as a property of the training corpus, not the language, by tokenizing other-language text with your English-trained tokenizer.

Prerequisites

L8 read, so byte-pair encoding and quantising input into discrete units are familiar.
Comfort with Python strings, lists, and dicts (see the Tooling references linked above). No prior NLP needed. Depth-by-choice: skipping this does not block conceptual progress.

Estimated time

About 2.5 to 3 hours: roughly 1 hour on the train loop (pair counting and merging), 1 hour on encode and decode and the round-trip, and the rest on the exploration table. If it runs well past that, stop and re-read the L8 section on byte-pair encoding.

Deliverables

tokenizer.py: the BPE train loop, encode, decode, and the reporting that tabulates token count and characters-per-token across several sample texts.
One report artefact: a printed or saved table of characters-per-token per sample, plus the list of merges learned.
README.md: what you built, the merge rule in your own words, the gotchas you hit, and the numbers you saw (vocabulary size, characters-per-token for English versus non-English).

Suggested file structure

builds/B2/
  tokenizer.py      # train (pair counts + merges) + encode + decode + report
  sample.txt        # a few KB of text to train on (your own notes are fine)
  report.txt        # the characters-per-token table (or print to console)
  README.md         # what you built, the merge rule, gotchas, numbers

One file is plenty. A separate sample.txt keeps the corpus out of the code.

Step-by-step instructions

Get a small corpus. A few kilobytes of plain text (your own notes, a public-domain paragraph). Read it with open(path, encoding="utf-8") so non-English characters load cleanly.
Start at characters. Represent the text as a list of single-character symbols. This is your most granular tokenization and the baseline to beat.
Count adjacent pairs. Write a function that returns a count for every adjacent pair of symbols in the list.
Merge the most frequent pair. Pick the highest-count pair, define a new symbol that is the two joined, and rewrite the list replacing every adjacent occurrence with that new symbol.
Repeat to train. Loop the count-and-merge steps a fixed number of times (say 50 to 300 merges), recording each merge in order. The recorded merges are your learned tokenizer.
Encode. To tokenize new text, start from characters and apply the recorded merges in the order they were learned. Map the resulting symbols to integer IDs with a vocabulary dict.
Decode. Map IDs back to symbols and concatenate. Confirm decode(encode(text)) == text.
Explore. Tokenize several samples (English, a non-English language, a snippet of code). For each, print the character count, the token count, and characters-per-token.
Read the result. Confirm characters-per-token rises as you add merges, and is lower for the non-English sample. Look at the merge list and find recognisable subwords forming.
Write the README. Explain the merge rule and why the non-English text costs more tokens.

Starter skeleton

Two functions carry the mechanism and are left for you to write. Everything else is scaffolding. Writing get_pairs and merge yourself is the milestone.

# B2: a minimal byte-pair-encoding tokenizer, by hand

def get_pairs(tokens):
    # tokens: a list of current symbols (start as single characters)
    # return {(a, b): count} over every ADJACENT pair
    # TODO (you write this): count pairs with zip(tokens, tokens[1:])
    ...

def merge(tokens, pair):
    # replace every adjacent occurrence of `pair` with the joined symbol
    # TODO (you write this): build a new list, joining pair[0] + pair[1]
    ...

def train(text, num_merges):
    tokens = list(text)                   # start at characters
    merges = []
    for _ in range(num_merges):
        pairs = get_pairs(tokens)
        if not pairs:
            break
        best = max(pairs, key=pairs.get)  # most frequent adjacent pair
        tokens = merge(tokens, best)
        merges.append(best)
    return merges

def encode(text, merges):
    tokens = list(text)
    for pair in merges:                   # apply merges in the order learned
        tokens = merge(tokens, pair)
    return tokens                         # map to integer IDs via a vocab dict

# build a vocab so tokens become integer IDs, and decoding can reverse it
def build_vocab(merges, text):
    symbols = sorted(set(list(text)) | {a + b for a, b in merges})
    stoi = {s: i for i, s in enumerate(symbols)}
    itos = {i: s for s, i in stoi.items()}
    return stoi, itos

# report: token count and characters-per-token for several samples
SAMPLES = {
    "english": "the cat sat on the mat ...",
    "non_english": "le chat ... (a non-English sample)",
    "code": "def f(x): return x * 2",
}
# for each sample: tokens = encode(text, merges); print len(text), len(tokens), len(text)/len(tokens)

Train on your corpus, then explore:

with open("sample.txt", encoding="utf-8") as f:
    corpus = f.read()
merges = train(corpus, num_merges=200)

Expected output

After training, a report across the samples looks roughly like this. Your exact numbers, merges, and samples will differ with the corpus and merge count; this is one plausible result, not a target to match:

sample        chars   tokens   chars/token
english        420      150        2.8
non_english    300      260        1.2
code           260      170        1.5

The merge list shows common fragments forming, for example a space-led " t", then " th", then " the". Characters-per-token starts near 1.0 (pure characters) and rises as merges accumulate, with diminishing gains. The non-English sample stays nearer 1.0, because the English-trained merges rarely apply to it: the penalty, made visible. And decode(encode(text)) returns the original text exactly.

Validation criteria

Assess against the Build Track Validation Standard. The bar is understanding, not a converged number.

COMPLETE The tokenizer trains, decode(encode(text)) == text, the report shows characters-per-token rising with more merges and lower for the non-English sample, and you can explain the merge rule (count pairs, merge the most frequent, repeat) and why an English-trained tokenizer over-segments other languages. The train loop, encode, and decode are your own code.

RUNS-NOT-UNDERSTOOD It runs, but you cannot yet explain why merging the most frequent pair builds useful subwords, why merges must be applied in training order at encode time, or why the non-English penalty is about the corpus rather than the language. Re-read L8 and trace one merge by hand. Do not mark COMPLETE.

TOOL-LOCKED It works only because a production tokenizer library (such as tiktoken or sentencepiece) did the tokenizing. The milestone is to build the mechanism, so reframe it to your own BPE before marking complete. A library belongs only in the optional comparison.

INCOMPLETE Unfinished, or the round-trip fails and the bug is not found yet. A valid resting state for a depth-by-choice track. Come back to it.

Common pitfalls

These are conceptual traps, distinct from code symptoms.

Encode order. At encode time, merges must be applied in the order they were learned, not by re-picking the most frequent each time. A different order gives different tokens.
Token is not word. Tokens are subwords and often span or split words. Expecting word-level tokens misreads the output.
The penalty is about the corpus, not the language. Your tokenizer penalises whatever it was not trained on. Train on code instead and English would pay the penalty.
More merges always better. Gains diminish; the vocabulary grows while characters-per-token flattens. That curve is part of the lesson.
Reversibility needs a kept mapping. Decoding only works if the symbol-to-id mapping stays consistent with training.

Troubleshooting

These are code symptoms and their likely causes, distinct from the conceptual pitfalls above.

Symptom	Likely cause
`decode(encode(x)) != x`	Merges applied inconsistently, or the id-to-symbol map is out of sync with training. Print the token list at each stage.
characters-per-token stuck near 1.0	Too few merges, a corpus too small to have frequent pairs, or `get_pairs` not counting adjacency correctly.
non-English text shows mojibake or errors	The file was not read as UTF-8. Use `open(path, encoding="utf-8")` and print `repr(text)` to inspect.
`max(pairs, key=pairs.get)` raises	The pair dict is empty (a single-symbol sequence). Guard for the no-pairs case before taking the max.
the sequence gets corrupted on merge	Walking the list with a moving index and mutating in place. Build a fresh list instead.
`KeyError` during encode	A character never seen in training. Decide a fallback (keep unknown characters as themselves).

Optional extensions

Byte-level start. Encode the text to UTF-8 bytes first, so any input is representable, and watch how the penalty shifts.
Compare to a production tokenizer. Run the same samples through a library such as tiktoken or sentencepiece and tabulate token counts beside yours. This is the only place a library appears, and it stays optional and vendor-neutral. It needs a package install (see Installing packages).
Pre-tokenize on whitespace and punctuation before BPE and observe cleaner merges.
Plot characters-per-token against number of merges (matplotlib) and watch diminishing returns.
Train two tokenizers on two corpora (English and code) and tokenize a mixed text with each.

why this build exists L8 says messy input has to be quantised into discrete units, and that byte-pair encoding is how. B2 makes that physical: you watch characters merge into subwords by frequency, watch text become a list of integer IDs, and watch a tokenizer trained on one distribution over-segment another. It dissolves the "the model reads text" magic, because the model reads token IDs your tokenizer produced, and the tokenizer is a frequency-driven merge table you built by hand. It is also the build that supplies the unit every later build consumes: embeddings index these IDs, and the transformer is fed them.

← L8 · Tokens Syllabus Continue to L9 →