Build a minimal byte-pair-encoding tokenizer by hand, the same idea L8 describes, and watch text turn into tokens. You train a merge table on a small text sample, encode and decode with it, and tabulate characters-per-token across English, a non-English language, and code. You write the tokenizing code yourself.
You build a small byte-pair-encoding (BPE) tokenizer by hand and use it to watch text become tokens. You train a merge table on a small text sample by repeatedly counting adjacent symbol pairs and merging the most frequent, then encode and decode text with it. Then you explore: tokenize English, a non-English language, and some code, and tabulate token count and characters-per-token for each. The point is to see, with no library doing the work, that a tokenizer is a frequency-driven merge table, that tokens are subwords rather than words, and that a tokenizer trained on one kind of text over-segments everything else.
About 2.5 to 3 hours: roughly 1 hour on the train loop (pair counting and merging), 1 hour on encode and decode and the round-trip, and the rest on the exploration table. If it runs well past that, stop and re-read the L8 section on byte-pair encoding.
tokenizer.py: the BPE train loop, encode, decode, and the reporting that tabulates token count and characters-per-token across several sample texts.README.md: what you built, the merge rule in your own words, the gotchas you hit, and the numbers you saw (vocabulary size, characters-per-token for English versus non-English).builds/B2/
tokenizer.py # train (pair counts + merges) + encode + decode + report
sample.txt # a few KB of text to train on (your own notes are fine)
report.txt # the characters-per-token table (or print to console)
README.md # what you built, the merge rule, gotchas, numbers
One file is plenty. A separate sample.txt keeps the corpus out of the code.
open(path, encoding="utf-8") so non-English characters load cleanly.decode(encode(text)) == text.Two functions carry the mechanism and are left for you to write. Everything else is scaffolding. Writing get_pairs and merge yourself is the milestone.
# B2: a minimal byte-pair-encoding tokenizer, by hand
def get_pairs(tokens):
# tokens: a list of current symbols (start as single characters)
# return {(a, b): count} over every ADJACENT pair
# TODO (you write this): count pairs with zip(tokens, tokens[1:])
...
def merge(tokens, pair):
# replace every adjacent occurrence of `pair` with the joined symbol
# TODO (you write this): build a new list, joining pair[0] + pair[1]
...
def train(text, num_merges):
tokens = list(text) # start at characters
merges = []
for _ in range(num_merges):
pairs = get_pairs(tokens)
if not pairs:
break
best = max(pairs, key=pairs.get) # most frequent adjacent pair
tokens = merge(tokens, best)
merges.append(best)
return merges
def encode(text, merges):
tokens = list(text)
for pair in merges: # apply merges in the order learned
tokens = merge(tokens, pair)
return tokens # map to integer IDs via a vocab dict
# build a vocab so tokens become integer IDs, and decoding can reverse it
def build_vocab(merges, text):
symbols = sorted(set(list(text)) | {a + b for a, b in merges})
stoi = {s: i for i, s in enumerate(symbols)}
itos = {i: s for s, i in stoi.items()}
return stoi, itos
# report: token count and characters-per-token for several samples
SAMPLES = {
"english": "the cat sat on the mat ...",
"non_english": "le chat ... (a non-English sample)",
"code": "def f(x): return x * 2",
}
# for each sample: tokens = encode(text, merges); print len(text), len(tokens), len(text)/len(tokens)
Train on your corpus, then explore:
with open("sample.txt", encoding="utf-8") as f:
corpus = f.read()
merges = train(corpus, num_merges=200)
After training, a report across the samples looks roughly like this. Your exact numbers, merges, and samples will differ with the corpus and merge count; this is one plausible result, not a target to match:
sample chars tokens chars/token
english 420 150 2.8
non_english 300 260 1.2
code 260 170 1.5
The merge list shows common fragments forming, for example a space-led " t", then " th", then " the". Characters-per-token starts near 1.0 (pure characters) and rises as merges accumulate, with diminishing gains. The non-English sample stays nearer 1.0, because the English-trained merges rarely apply to it: the penalty, made visible. And decode(encode(text)) returns the original text exactly.
Assess against the Build Track Validation Standard. The bar is understanding, not a converged number.
decode(encode(text)) == text, the report shows characters-per-token rising with more merges and lower for the non-English sample, and you can explain the merge rule (count pairs, merge the most frequent, repeat) and why an English-trained tokenizer over-segments other languages. The train loop, encode, and decode are your own code.
These are conceptual traps, distinct from code symptoms.
These are code symptoms and their likely causes, distinct from the conceptual pitfalls above.
| Symptom | Likely cause |
|---|---|
decode(encode(x)) != x | Merges applied inconsistently, or the id-to-symbol map is out of sync with training. Print the token list at each stage. |
| characters-per-token stuck near 1.0 | Too few merges, a corpus too small to have frequent pairs, or get_pairs not counting adjacency correctly. |
| non-English text shows mojibake or errors | The file was not read as UTF-8. Use open(path, encoding="utf-8") and print repr(text) to inspect. |
max(pairs, key=pairs.get) raises | The pair dict is empty (a single-symbol sequence). Guard for the no-pairs case before taking the max. |
| the sequence gets corrupted on merge | Walking the list with a moving index and mutating in place. Build a fresh list instead. |
KeyError during encode | A character never seen in training. Decide a fallback (keep unknown characters as themselves). |