Attention
Why do tokens need to look at each other?
Meaning is contextual: resolving a pronoun or tracking agreement needs information from elsewhere in the sentence. Self-attention relates every position to every other directly, so distant words influence each other in one step.
Meaning is contextual. Resolving a pronoun, tracking subject-verb agreement, or disambiguating a word all need information from elsewhere in the sentence. Self-attention is, in the paper’s definition, “an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence.”[1] Because every token attends to every other directly, distant words influence each other in a single step.
The canonical example comes from the transformer’s own authors: in “the animal didn’t cross the street because it was tired,” the word “it” should attend to the noun it refers to, and attention learns to route that connection.[2]
References
- Attention Is All You Need (arXiv:1706.03762) — Vaswani et al., Google
- Transformer: A Novel Neural Network Architecture for Language Understanding — Google Research Blog