Build Track · B3

Vector playground

Load a set of pretrained word vectors as plain data, then build the geometry on them by hand: cosine similarity, nearest neighbours, analogy arithmetic, and a 2D scatter you can look at. The vectors are the input. The operations are yours. By the end, "meaning as direction" from L9 is a matrix of floats and a handful of dot products.

Phase: 2, after L12 Time: ~2 to 2.5 hours Tier: 1 (any laptop, CPU) Tooling: numpy and matplotlib Status: optional (depth-by-choice)
where this sits B3 attaches to L12 (Distance, similarity, and semantic geometry), because the mechanism-first version needs vectors (L11), the dot product, and cosine similarity (L12). It is optional and not required to continue. The lesson it makes physical is L9 (embeddings as direction); the maths it leans on is L11 and L12.
before you start Unlike B2, B3 needs packages: Python with numpy and matplotlib. If you do not already have them, see Installing packages. New to running a script? Python setup and Running Python cover it, and Reading errors helps when something throws. The numpy and plotting one-liners you need are in Python basics and the Python cheatsheet.
the one input you need A plain-text word-vector file, one word per line followed by its numbers: word 0.12 -0.44 0.91 .... GloVe or fastText files are fine examples, not requirements. Any local vector file in that shape works, so if you already have one you can complete B3 with no download and no network. Cap loading to a manageable number of rows (say 5,000 to 50,000) so brute-force search stays fast.

Summary

You load pretrained word vectors as data and build the embedding operations by hand in numpy. Write cosine similarity, write brute-force nearest-neighbour search, reproduce analogy arithmetic, and project the vectors to 2D to look at the clusters. You do not train embeddings here (that is later in the course) and you do not call a library to do the similarity or the search. The vectors are someone else's learned geometry; the point of B3 is to operate on that geometry yourself and watch semantic structure fall out of pure arithmetic.

Learning goals

Prerequisites

Estimated time

About 2 to 2.5 hours for the core: roughly half an hour loading and inspecting the vectors, an hour on cosine and nearest neighbours, the rest on analogy and the 2D scatter. The by-hand PCA or k-means extensions push it past 3 hours.

Deliverables

Suggested file structure

builds/B3/
  vectors.py        # load + cosine + nearest_neighbours + analogy + project-to-2D
  glove.txt         # the vectors as data (any GloVe/fastText-style file, or a slice)
  clusters.png      # the 2D scatter
  README.md         # formula, neighbour lists, where analogy held or failed

One script is plenty. Keep the vectors file separate so the code stays generic to "load an N by D matrix plus labels".

Step-by-step instructions

  1. Get a vectors file. Any plain word-vector file (one word then its numbers per line). Read it with encoding="utf-8" into a numpy matrix plus a word to row-index dict. Cap the rows so brute force stays fast.
  2. Inspect it. Pick two words you expect to be similar and two you do not. Print the matrix shape and a couple of rows to confirm you have an N by D array of floats.
  3. Write cosine similarity. The dot product of two vectors divided by the product of their norms. Sanity-check that a vector with itself scores 1.0.
  4. Write nearest neighbours. Score the query against every row, sort, return the top k. Run it for "king", "dog", "paris" and read the lists.
  5. Analogy arithmetic. On top of the two primitives: king - man + woman, then nearest neighbours, excluding the input words when you read the result.
  6. Project to 2D. Scatter-plot a few dozen words from 3 or 4 obvious categories (animals, cities, verbs). Label the points.
  7. Compare metrics. Rank the same queries by cosine and by raw Euclidean distance. Note where they agree and disagree.
  8. Look at the distribution. Histogram pairwise cosine similarities across a sample. Connect the shape to L9's failure modes (collapse, hubness).
  9. Write the README. Explain why cosine normalises out magnitude and why a 2D picture of 300-dimensional data is lossy.

Starter skeleton

Two functions carry the geometry and are left for you to write. Everything else is scaffolding. Writing cosine_similarity and nearest_neighbours yourself is the milestone.

import numpy as np

def load_vectors(path, max_rows=20000):
    # treat the file as plain data: each line is  word num num num ...
    words, rows = [], []
    with open(path, encoding="utf-8") as f:
        for i, line in enumerate(f):
            if i >= max_rows:
                break
            parts = line.split()
            words.append(parts[0])
            rows.append([float(x) for x in parts[1:]])
    M = np.array(rows)                      # shape (N, D)
    index = {w: i for i, w in enumerate(words)}
    return M, words, index

def cosine_similarity(a, b):
    # TODO (you write this): dot(a, b) / (norm(a) * norm(b))
    ...

def nearest_neighbours(query, M, k=5):
    # query: a single (D,) vector. M: the (N, D) matrix.
    # TODO (you write this): score query against every row of M, return the top-k row indices
    ...

# --- scaffolding below: the analogy wrapper and the 2D projection ---

def analogy(a, b, c, M, words, index, k=5):
    v = M[index[a]] - M[index[b]] + M[index[c]]     # king - man + woman
    return nearest_neighbours(v, M, k)              # exclude a, b, c when you read it

def project_2d(M):
    Mc = M - M.mean(axis=0)                          # center first
    U, S, Vt = np.linalg.svd(Mc, full_matrices=False)
    return Mc @ Vt[:2].T                             # top-2 components, shape (N, 2)

Then explore. Reading these lists is most of the learning:

nearest_neighbours(M[index["king"]], M, k=5)
# -> queen, prince, monarch, throne, kingdom   (example; depends on your file)

analogy("king", "man", "woman", M, words, index)
# -> queen near the top (often not rank 1, which is the honest result)

Expected output

Your exact words and numbers depend on the vectors file; these are illustrative, not targets:

Validation criteria

Assess against the Build Track Validation Standard. The bar is understanding, not a leaderboard score.

COMPLETE The two primitives are your own code, neighbour lists are sensible, analogy returns a plausible top-k, the scatter shows clusters, and you can explain why cosine normalises out magnitude and why a 2D view of high-dimensional data is lossy.
RUNS-NOT-UNDERSTOOD It runs, but you cannot yet say why cosine beats Euclidean here, or you read the 2D clusters as exact truth. Re-read L12 and trace one cosine score by hand. Do not mark COMPLETE.
TOOL-LOCKED The similarity, the neighbour search, or the projection came from sklearn or faiss rather than your numpy. The milestone is to build the geometry, so reframe it to your own code before marking complete. Those libraries belong only in the optional extensions.
INCOMPLETE Unfinished, or the neighbour lists are nonsense and the bug is not found yet. A valid resting state for a depth-by-choice track. Come back to it.

Common pitfalls

These are conceptual traps, distinct from code symptoms.

Troubleshooting

These are code symptoms and their likely causes, distinct from the conceptual pitfalls above.

SymptomLikely cause
self-similarity is not 1.0The norm is in the wrong place. Check dot(a, a) / (norm(a) * norm(a)) equals 1.0.
neighbour lists are full of rare junkThe file has a header line or punctuation rows. Skip the header, or filter to alphabetic words.
KeyError on a query wordThe word is not in your capped vocabulary. Check membership in the index dict first.
every similarity is near 1.0Vectors not centered for the distribution check, or you loaded duplicated rows. Print a histogram to see.
the scatter is one blobYou projected before centering, or plotted the wrong two columns. Center, then take the top-2 components.
neighbour search is slowA Python loop over rows. Vectorise it: score all rows at once with M @ q.

Optional extensions

why this build exists L9 says meaning becomes direction and retrieval becomes a geometry problem. L12 gives you the operations. B3 makes both literal: you hold the N by D matrix, write the dot product that scores similarity, and watch semantic neighbours appear from arithmetic alone. It turns "the model understands words" into "the model does dot products on a learned matrix", which is the most useful mental model for the rest of the course. It also sets the pattern the next builds reuse: load or generate data, operate on it by hand, plot it, and read what the plot says.