Build Track · B3

Vector playground

Load a set of pretrained word vectors as plain data, then build the geometry on them by hand: cosine similarity, nearest neighbours, analogy arithmetic, and a 2D scatter you can look at. The vectors are the input. The operations are yours. By the end, "meaning as direction" from L9 is a matrix of floats and a handful of dot products.

Phase: 2, after L12 Time: ~2 to 2.5 hours Tier: 1 (any laptop, CPU) Tooling: numpy and matplotlib Status: optional (depth-by-choice)

where this sits B3 attaches to L12 (Distance, similarity, and semantic geometry), because the mechanism-first version needs vectors (L11), the dot product, and cosine similarity (L12). It is optional and not required to continue. The lesson it makes physical is L9 (embeddings as direction); the maths it leans on is L11 and L12.

before you start Unlike B2, B3 needs packages: Python with numpy and matplotlib. If you do not already have them, see Installing packages. New to running a script? Python setup and Running Python cover it, and Reading errors helps when something throws. The numpy and plotting one-liners you need are in Python basics and the Python cheatsheet.

the one input you need A plain-text word-vector file, one word per line followed by its numbers: word 0.12 -0.44 0.91 .... GloVe or fastText files are fine examples, not requirements. Any local vector file in that shape works, so if you already have one you can complete B3 with no download and no network. Cap loading to a manageable number of rows (say 5,000 to 50,000) so brute-force search stays fast.

Summary

You load pretrained word vectors as data and build the embedding operations by hand in numpy. Write cosine similarity, write brute-force nearest-neighbour search, reproduce analogy arithmetic, and project the vectors to 2D to look at the clusters. You do not train embeddings here (that is later in the course) and you do not call a library to do the similarity or the search. The vectors are someone else's learned geometry; the point of B3 is to operate on that geometry yourself and watch semantic structure fall out of pure arithmetic.

Learning goals

Feel that an embedding is an N by D matrix of floats, and a word's "meaning" is one row.
Compute cosine similarity by hand and see why it, not raw Euclidean distance, is the workhorse in high dimensions.
Build brute-force nearest-neighbour search and watch semantic neighbours appear from geometry alone.
Reproduce analogy arithmetic (a minus b plus c) and see where it holds and where it does not.
Project high-dimensional vectors to 2D, read the clusters, and understand the projection is lossy.

Prerequisites

L9 (embeddings as direction), L11 (vectors: direction and magnitude), L12 (dot product, cosine, Euclidean distance). That prerequisite list is why B3 unlocks after L12, not after L9.
B1 and B2 for numpy and file-reading comfort. Depth-by-choice: skipping this does not block conceptual progress.

Estimated time

About 2 to 2.5 hours for the core: roughly half an hour loading and inspecting the vectors, an hour on cosine and nearest neighbours, the rest on analogy and the 2D scatter. The by-hand PCA or k-means extensions push it past 3 hours.

Deliverables

vectors.py: load plus the geometry operations, plus a short exploration that prints neighbour lists and similarity numbers.
One plot artefact: a 2D scatter with a few labelled clusters.
README.md: what you built, the cosine formula in your own words, the neighbour lists you got, and where analogy arithmetic broke.

Suggested file structure

builds/B3/
  vectors.py        # load + cosine + nearest_neighbours + analogy + project-to-2D
  glove.txt         # the vectors as data (any GloVe/fastText-style file, or a slice)
  clusters.png      # the 2D scatter
  README.md         # formula, neighbour lists, where analogy held or failed

One script is plenty. Keep the vectors file separate so the code stays generic to "load an N by D matrix plus labels".

Step-by-step instructions

Get a vectors file. Any plain word-vector file (one word then its numbers per line). Read it with encoding="utf-8" into a numpy matrix plus a word to row-index dict. Cap the rows so brute force stays fast.
Inspect it. Pick two words you expect to be similar and two you do not. Print the matrix shape and a couple of rows to confirm you have an N by D array of floats.
Write cosine similarity. The dot product of two vectors divided by the product of their norms. Sanity-check that a vector with itself scores 1.0.
Write nearest neighbours. Score the query against every row, sort, return the top k. Run it for "king", "dog", "paris" and read the lists.
Analogy arithmetic. On top of the two primitives: king - man + woman, then nearest neighbours, excluding the input words when you read the result.
Project to 2D. Scatter-plot a few dozen words from 3 or 4 obvious categories (animals, cities, verbs). Label the points.
Compare metrics. Rank the same queries by cosine and by raw Euclidean distance. Note where they agree and disagree.
Look at the distribution. Histogram pairwise cosine similarities across a sample. Connect the shape to L9's failure modes (collapse, hubness).
Write the README. Explain why cosine normalises out magnitude and why a 2D picture of 300-dimensional data is lossy.

Starter skeleton

Two functions carry the geometry and are left for you to write. Everything else is scaffolding. Writing cosine_similarity and nearest_neighbours yourself is the milestone.

import numpy as np

def load_vectors(path, max_rows=20000):
    # treat the file as plain data: each line is  word num num num ...
    words, rows = [], []
    with open(path, encoding="utf-8") as f:
        for i, line in enumerate(f):
            if i >= max_rows:
                break
            parts = line.split()
            words.append(parts[0])
            rows.append([float(x) for x in parts[1:]])
    M = np.array(rows)                      # shape (N, D)
    index = {w: i for i, w in enumerate(words)}
    return M, words, index

def cosine_similarity(a, b):
    # TODO (you write this): dot(a, b) / (norm(a) * norm(b))
    ...

def nearest_neighbours(query, M, k=5):
    # query: a single (D,) vector. M: the (N, D) matrix.
    # TODO (you write this): score query against every row of M, return the top-k row indices
    ...

# --- scaffolding below: the analogy wrapper and the 2D projection ---

def analogy(a, b, c, M, words, index, k=5):
    v = M[index[a]] - M[index[b]] + M[index[c]]     # king - man + woman
    return nearest_neighbours(v, M, k)              # exclude a, b, c when you read it

def project_2d(M):
    Mc = M - M.mean(axis=0)                          # center first
    U, S, Vt = np.linalg.svd(Mc, full_matrices=False)
    return Mc @ Vt[:2].T                             # top-2 components, shape (N, 2)

Then explore. Reading these lists is most of the learning:

nearest_neighbours(M[index["king"]], M, k=5)
# -> queen, prince, monarch, throne, kingdom   (example; depends on your file)

analogy("king", "man", "woman", M, words, index)
# -> queen near the top (often not rank 1, which is the honest result)

Expected output

Your exact words and numbers depend on the vectors file; these are illustrative, not targets:

Neighbours. nearest_neighbours for "king" returns royalty and noble words near the top; for "dog" it returns animals; for "paris" it returns cities and France-related words.
Analogy. king - man + woman puts "queen" in the top few, often not at rank 1. That honest result is the point.
Scatter. The 2D plot shows visible category blobs, with some overlap where the projection lost structure.
Metrics. Cosine and Euclidean rankings mostly agree but diverge on magnitude-heavy words.
Histogram. Pairwise similarities spread out rather than spiking at 1.0. A spike would signal embedding collapse.

Validation criteria

Assess against the Build Track Validation Standard. The bar is understanding, not a leaderboard score.

COMPLETE The two primitives are your own code, neighbour lists are sensible, analogy returns a plausible top-k, the scatter shows clusters, and you can explain why cosine normalises out magnitude and why a 2D view of high-dimensional data is lossy.

RUNS-NOT-UNDERSTOOD It runs, but you cannot yet say why cosine beats Euclidean here, or you read the 2D clusters as exact truth. Re-read L12 and trace one cosine score by hand. Do not mark COMPLETE.

TOOL-LOCKED The similarity, the neighbour search, or the projection came from sklearn or faiss rather than your numpy. The milestone is to build the geometry, so reframe it to your own code before marking complete. Those libraries belong only in the optional extensions.

INCOMPLETE Unfinished, or the neighbour lists are nonsense and the bug is not found yet. A valid resting state for a depth-by-choice track. Come back to it.

Common pitfalls

These are conceptual traps, distinct from code symptoms.

Forgetting to normalise. Skip the norms and "cosine" is just a dot product, so frequent, large-magnitude words dominate every neighbour list.
Over-reading the 2D plot. The scatter threw away almost every dimension. Treat clusters as a hint, not ground truth.
Expecting analogy at rank 1. Directions are approximate. "Queen" in the top five is a good result.
Conflating cosine and Euclidean. They rank differently when magnitudes vary. Knowing when each applies is the lesson.
Thinking you trained this. The embedding is pretrained data. Its geometry came from someone else's objective, not from anything you did in B3.

Troubleshooting

These are code symptoms and their likely causes, distinct from the conceptual pitfalls above.

Symptom	Likely cause
self-similarity is not 1.0	The norm is in the wrong place. Check `dot(a, a) / (norm(a) * norm(a))` equals 1.0.
neighbour lists are full of rare junk	The file has a header line or punctuation rows. Skip the header, or filter to alphabetic words.
`KeyError` on a query word	The word is not in your capped vocabulary. Check membership in the index dict first.
every similarity is near 1.0	Vectors not centered for the distribution check, or you loaded duplicated rows. Print a histogram to see.
the scatter is one blob	You projected before centering, or plotted the wrong two columns. Center, then take the top-2 components.
neighbour search is slow	A Python loop over rows. Vectorise it: score all rows at once with `M @ q`.

Optional extensions

PCA by hand. The projection scaffolding already uses an SVD. Write the PCA properly and explain what the top components capture.
k-means by hand. Assign each vector to its nearest centroid, recompute centroids, repeat. Pure numpy, no sklearn.
t-SNE for a nicer 2D picture. A black-box embedding of the embedding, so it stays optional. Use it to compare against your PCA scatter.
Sentence embeddings via a library such as sentence-transformers. The only place a heavy framework appears, clearly optional, and it needs an install (see Installing packages).
Semantic-search teaser. Embed a handful of sentences and retrieve by cosine. This is the bridge to B15 and L60; keep it a teaser, do not build out an index here.

why this build exists L9 says meaning becomes direction and retrieval becomes a geometry problem. L12 gives you the operations. B3 makes both literal: you hold the N by D matrix, write the dot product that scores similarity, and watch semantic neighbours appear from arithmetic alone. It turns "the model understands words" into "the model does dot products on a learned matrix", which is the most useful mental model for the rest of the course. It also sets the pattern the next builds reuse: load or generate data, operate on it by hand, plot it, and read what the plot says.

← L12 · Distance and similarity Syllabus Continue to L13 →