PhyseaWiki How AI actually works Papers physea.ai →

The transformer

What is self-attention, and why is it the transformer's key move?

Self-attention is the transformer's key move: for each word, the model weighs every other word by how relevant it is and blends them in. Because there is no left-to-right dependency, the whole sequence is processed at once, and that parallelism is what let models train fast enough on GPUs to scale.

Last updated 2026-06-15 · Physea Labs

The mechanism that makes this work is self-attention: for each word, the model weighs every other word by how relevant it is, and blends them in. A word’s meaning can be shaped directly by any other word, no matter how far away. (The attention page covers the mechanism in detail.)

“The animal didn’t cross the street because it was tired” The animal didn’t cross the street because it strongest attention
Self-attention lets every token look at every other token at once. The word "it" attends most to the noun it refers to.

Because there is no left-to-right dependency, the whole sequence is processed at once. That parallelism is the practical breakthrough: it is what let models train fast enough on GPUs to grow to their current size. On the 2014 English-to-German translation benchmark the original transformer beat the best prior recurrent and convolutional models while being, in the authors’ words, “more parallelizable and requiring significantly less time to train.”[1]

References

  1. Attention Is All You Need (arXiv:1706.03762) — Vaswani et al., Google