Why it changed everything

Because the Transformer trained in parallel rather than step by step, researchers could build much larger models on much more text. That ability to scale, more than any single trick, set off the modern era of language models.

A faster way to train translation models does not sound like a turning point. The reason it became one is subtle. Older designs read text one step at a time, which meant the computer mostly waited on itself: step two could not start until step one finished. The Transformer’s attention looks at the whole input together, so the math spreads across many processors at once.^[1] That single property removed the main bottleneck on size.

When a design trains in parallel, you can throw far more computing power and far more text at it without the training time becoming impossible. Researchers could now build models with many more internal settings, trained on a far larger slice of the written web. The lever stopped being a clever new architecture and started being scale.

The shift Before the Transformer, progress often meant a smarter model design. After it, much of the progress came from making the same design bigger and feeding it more data. The papers that followed, including BERT and GPT, were built on this foundation.

The original paper was about translation, and its authors could not have spelled out everything that would come next. But the architecture they described turned out to be general. Nearly every well-known language model since has been a Transformer of one kind or another.

References

Attention Is All You Need — arXiv (Vaswani et al., 2017)

Why did the Transformer change the course of AI?

References