The transformer
What are the three families of transformer models?
The original transformer was an encoder-decoder built for translation. Later models specialized into three shapes: encoder-only for understanding text, decoder-only for generating it, and the full encoder-decoder for translation-style tasks. Today's chatbots are overwhelmingly decoder-only.
The original transformer was an encoder-decoder built for translation. Later models specialized the design into three shapes:[1]
- Encoder-only (such as BERT): bidirectional, tuned for understanding text.
- Decoder-only (such as GPT): each token sees only earlier tokens, tuned for generating text one token at a time.
- Encoder-decoder: the full original, for translation-style tasks.
Today’s generative chatbots are overwhelmingly decoder-only transformers, using masked self-attention so each position attends only to the words before it, which is what makes next-token prediction work.
References
- Transformer (deep learning architecture) — Wikipedia