The transformer
What came before the transformer, and why was it slow?
The transformer is the neural network architecture behind essentially every modern large language model. Before it, recurrent neural networks read text one word at a time, which made training slow and caused early words to fade; the transformer drops recurrence so every word can look at every other word at once.
The transformer is the neural network architecture behind essentially every modern large language model. It arrived in the 2017 paper “Attention Is All You Need,” whose eight Google authors proposed a design “based solely on attention mechanisms, dispensing with recurrence and convolutions entirely.”[1]
Before transformers, the standard way to process language was the recurrent neural network (RNN), which reads a sentence one word at a time, carrying a running memory forward. That design has two problems. It is slow to train, because each step has to wait for the one before it. And it tends to forget: information about early words weakens as it passes through many steps. The transformer’s answer is to drop recurrence and let every word look at every other word directly, in a single parallel operation.[2]
References
- Attention Is All You Need (arXiv:1706.03762) — Vaswani et al., Google
- Transformer (deep learning architecture) — Wikipedia