Query, Key, Value — Physea Wiki

Each token is projected into three vectors: a query (what it is looking for), a key (what it offers as a match), and a value (the content passed along). A token's output is a weighted sum of values, weighted by how well its query matches each key.

Attention is the mechanism that lets a model decide, for each token, which other tokens to draw information from. It is the core of the transformer, and once you see how it works, most of modern language modeling clicks into place.

Each token is projected, through learned weight matrices, into three vectors. The query represents what the token is looking for. The key represents what a token offers as a match. The value is the content actually passed along. The original paper defines attention as “mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors.”^[1]

The output for a token is a weighted sum of the values, where, in the authors’ words, “the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.”^[1] A strong query-key match gives that value a large weight, so the token pulls in information from the positions most relevant to it.

References

Attention Is All You Need (arXiv:1706.03762) — Vaswani et al., Google

What are query, key, and value in attention?

References