Attention Is All You Need

A 2017 paper called "Attention Is All You Need" introduced the Transformer, a network built only on attention mechanisms. It dropped the recurrence used by earlier models and trained faster while setting new translation records.

In June 2017, a team of researchers published a paper with a memorable title: “Attention Is All You Need.”^[1] It introduced a network design they called the Transformer. The idea that made it new was in the title. Earlier language models processed text by reading it in order, one step feeding the next. The Transformer threw that out and was, in the authors’ words, “based solely on attention mechanisms, dispensing with recurrence and convolutions entirely.”^[1]

Attention is a way for the model to look at every word in a sentence at once and decide which other words matter for understanding each one. Because it does not have to step through text in sequence, the work can be split across many processors at the same time. The authors reported that their models were “more parallelizable and requiring significantly less time to train.”^[1]

The proof was in translation. On a standard English-to-German task the Transformer reached 28.4 BLEU, a common translation quality score, beating the prior best by more than two points. On English-to-French it set a new single-model record of 41.8 BLEU after training for three and a half days on eight GPUs.^[1] Better results and faster training, from a simpler design, was a combination the field had not seen before.

References

Attention Is All You Need — arXiv (Vaswani et al., 2017)

What did the 2017 transformer paper actually introduce?

References