Embeddings & vectors
What is an embedding, and how does it turn meaning into geometry?
An embedding is a list of numbers, a vector, that represents text as a point in a high-dimensional space where the geometry encodes meaning, so similar items land close together. The numbers are learned, not assigned, so each dimension is a latent feature rather than a human-named label.
An embedding is a list of numbers, a vector, that represents a piece of text as a point in a high-dimensional space, arranged so that the geometry encodes meaning. OpenAI’s definition is about as direct as it gets: “an embedding is a vector (list) of floating point numbers. The distance between two vectors measures their relatedness.”[1] Items with similar meaning land close together; unrelated items land far apart.
The idea grew out of a simple observation: words used in similar contexts tend to mean similar things, so they should get similar vectors. word2vec showed in 2013 that you could learn such vectors cheaply at scale, and that the resulting space captured real structure, both grammatical and semantic.[2] The numbers are learned rather than assigned, so each dimension is a latent feature, not a label a human picked.
References
- Embeddings guide — OpenAI
- Efficient Estimation of Word Representations in Vector Space (word2vec) — Mikolov et al., arXiv