Self-attention — Physea Wiki

Self-attention is the transformer's key move: for each word, the model weighs every other word by how relevant it is and blends them in. Because there is no left-to-right dependency, the whole sequence is processed at once, and that parallelism is what let models train fast enough on GPUs to scale.

The mechanism that makes this work is self-attention: for each word, the model weighs every other word by how relevant it is, and blends them in. A word’s meaning can be shaped directly by any other word, no matter how far away. (The attention page covers the mechanism in detail.)

Self-attention lets every token look at every other token at once. The word "it" attends most to the noun it refers to.

Because there is no left-to-right dependency, the whole sequence is processed at once. That parallelism is the practical breakthrough: it is what let models train fast enough on GPUs to grow to their current size. On the 2014 English-to-German translation benchmark the original transformer beat the best prior recurrent and convolutional models while being, in the authors’ words, “more parallelizable and requiring significantly less time to train.”^[1]

References

Attention Is All You Need (arXiv:1706.03762) — Vaswani et al., Google

What is self-attention, and why is it the transformer's key move?

References