The transformer moment
How did BERT and GPT build on the Transformer?
In 2018, two papers took the Transformer in different directions. BERT read text in both directions to understand it, while GPT read left to right to generate it. Both pre-trained on huge amounts of unlabeled text first, then adapted to specific tasks.
The Transformer was a design, not a finished product. In 2018 two papers showed what to do with it, and they pointed in different directions. Both shared one move: train the model on a large pile of plain, unlabeled text first, then adjust it for the job you actually want.
BERT, from a Google team, stands for Bidirectional Encoder Representations from Transformers. It reads a sentence in both directions at once, “by jointly conditioning on both left and right context in all layers.”[1] It learns by playing a fill-in-the-blank game: hide some words and make the model guess them from what surrounds them.[1] Once pre-trained, BERT “can be fine-tuned with just one additional output layer” for tasks like question answering, and it pushed the GLUE benchmark score to 80.5%, a 7.7 point jump.[1] BERT is built to understand text.
GPT, from OpenAI, takes the other path. Its paper, “Improving Language Understanding by Generative Pre-Training,” describes “generative pre-training of a language model on a diverse corpus of unlabeled text, followed by discriminative fine-tuning on each specific task.”[2] It reads left to right and predicts the next word, which makes it good at producing text. That paper reported improving the state of the art on 9 of the 12 tasks it studied.[2] The line of work that grew from it became the chat models most people know today.
The founding papers
- Attention Is All You Need (2017) ↗
The paper that introduced the Transformer architecture.
- BERT (2018) ↗
Bidirectional pre-training for understanding text.
- GPT (2018) ↗
Generative pre-training for producing text.
References
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding — arXiv (Devlin et al., 2018)
- Improving Language Understanding by Generative Pre-Training — OpenAI (Radford et al., 2018)