Load a set of pretrained word vectors as plain data, then build the geometry on them by hand: cosine similarity, nearest neighbours, analogy arithmetic, and a 2D scatter you can look at. The vectors are the input. The operations are yours. By the end, "meaning as direction" from L9 is a matrix of floats and a handful of dot products.
numpy and matplotlib. If you do not already have them, see Installing packages. New to running a script? Python setup and Running Python cover it, and Reading errors helps when something throws. The numpy and plotting one-liners you need are in Python basics and the Python cheatsheet.
word 0.12 -0.44 0.91 .... GloVe or fastText files are fine examples, not requirements. Any local vector file in that shape works, so if you already have one you can complete B3 with no download and no network. Cap loading to a manageable number of rows (say 5,000 to 50,000) so brute-force search stays fast.
You load pretrained word vectors as data and build the embedding operations by hand in numpy. Write cosine similarity, write brute-force nearest-neighbour search, reproduce analogy arithmetic, and project the vectors to 2D to look at the clusters. You do not train embeddings here (that is later in the course) and you do not call a library to do the similarity or the search. The vectors are someone else's learned geometry; the point of B3 is to operate on that geometry yourself and watch semantic structure fall out of pure arithmetic.
About 2 to 2.5 hours for the core: roughly half an hour loading and inspecting the vectors, an hour on cosine and nearest neighbours, the rest on analogy and the 2D scatter. The by-hand PCA or k-means extensions push it past 3 hours.
vectors.py: load plus the geometry operations, plus a short exploration that prints neighbour lists and similarity numbers.README.md: what you built, the cosine formula in your own words, the neighbour lists you got, and where analogy arithmetic broke.builds/B3/
vectors.py # load + cosine + nearest_neighbours + analogy + project-to-2D
glove.txt # the vectors as data (any GloVe/fastText-style file, or a slice)
clusters.png # the 2D scatter
README.md # formula, neighbour lists, where analogy held or failed
One script is plenty. Keep the vectors file separate so the code stays generic to "load an N by D matrix plus labels".
encoding="utf-8" into a numpy matrix plus a word to row-index dict. Cap the rows so brute force stays fast.king - man + woman, then nearest neighbours, excluding the input words when you read the result.Two functions carry the geometry and are left for you to write. Everything else is scaffolding. Writing cosine_similarity and nearest_neighbours yourself is the milestone.
import numpy as np
def load_vectors(path, max_rows=20000):
# treat the file as plain data: each line is word num num num ...
words, rows = [], []
with open(path, encoding="utf-8") as f:
for i, line in enumerate(f):
if i >= max_rows:
break
parts = line.split()
words.append(parts[0])
rows.append([float(x) for x in parts[1:]])
M = np.array(rows) # shape (N, D)
index = {w: i for i, w in enumerate(words)}
return M, words, index
def cosine_similarity(a, b):
# TODO (you write this): dot(a, b) / (norm(a) * norm(b))
...
def nearest_neighbours(query, M, k=5):
# query: a single (D,) vector. M: the (N, D) matrix.
# TODO (you write this): score query against every row of M, return the top-k row indices
...
# --- scaffolding below: the analogy wrapper and the 2D projection ---
def analogy(a, b, c, M, words, index, k=5):
v = M[index[a]] - M[index[b]] + M[index[c]] # king - man + woman
return nearest_neighbours(v, M, k) # exclude a, b, c when you read it
def project_2d(M):
Mc = M - M.mean(axis=0) # center first
U, S, Vt = np.linalg.svd(Mc, full_matrices=False)
return Mc @ Vt[:2].T # top-2 components, shape (N, 2)
Then explore. Reading these lists is most of the learning:
nearest_neighbours(M[index["king"]], M, k=5)
# -> queen, prince, monarch, throne, kingdom (example; depends on your file)
analogy("king", "man", "woman", M, words, index)
# -> queen near the top (often not rank 1, which is the honest result)
Your exact words and numbers depend on the vectors file; these are illustrative, not targets:
nearest_neighbours for "king" returns royalty and noble words near the top; for "dog" it returns animals; for "paris" it returns cities and France-related words.king - man + woman puts "queen" in the top few, often not at rank 1. That honest result is the point.Assess against the Build Track Validation Standard. The bar is understanding, not a leaderboard score.
sklearn or faiss rather than your numpy. The milestone is to build the geometry, so reframe it to your own code before marking complete. Those libraries belong only in the optional extensions.
These are conceptual traps, distinct from code symptoms.
These are code symptoms and their likely causes, distinct from the conceptual pitfalls above.
| Symptom | Likely cause |
|---|---|
| self-similarity is not 1.0 | The norm is in the wrong place. Check dot(a, a) / (norm(a) * norm(a)) equals 1.0. |
| neighbour lists are full of rare junk | The file has a header line or punctuation rows. Skip the header, or filter to alphabetic words. |
KeyError on a query word | The word is not in your capped vocabulary. Check membership in the index dict first. |
| every similarity is near 1.0 | Vectors not centered for the distribution check, or you loaded duplicated rows. Print a histogram to see. |
| the scatter is one blob | You projected before centering, or plotted the wrong two columns. Center, then take the top-2 components. |
| neighbour search is slow | A Python loop over rows. Vectorise it: score all rows at once with M @ q. |
sklearn.