Three families — Physea Wiki

The original transformer was an encoder-decoder built for translation. Later models specialized into three shapes: encoder-only for understanding text, decoder-only for generating it, and the full encoder-decoder for translation-style tasks. Today's chatbots are overwhelmingly decoder-only.

The original transformer was an encoder-decoder built for translation. Later models specialized the design into three shapes:^[1]

Encoder-only (such as BERT): bidirectional, tuned for understanding text.
Decoder-only (such as GPT): each token sees only earlier tokens, tuned for generating text one token at a time.
Encoder-decoder: the full original, for translation-style tasks.

Today’s generative chatbots are overwhelmingly decoder-only transformers, using masked self-attention so each position attends only to the words before it, which is what makes next-token prediction work.

References

Transformer (deep learning architecture) — Wikipedia

What are the three families of transformer models?

References