The transformer
What is a transformer block, and what is inside it?
A transformer is not one big thing; it is one block stacked many times. Each block has a multi-head self-attention layer and a small feed-forward network, each wrapped in a residual connection and layer normalization, plus positional encoding so the model knows word order.
A transformer is not one big thing; it is one block stacked many times. Each block has two main parts: a multi-head self-attention layer and a small position-wise feed-forward network. Each part is wrapped in a residual connection (the input is added back to the output) followed by layer normalization, which keeps training stable through a deep stack. The original model stacked six such blocks.[1]
Two more pieces make it complete. Multi-head attention runs several attention computations in parallel, so different heads can track different relationships at once, such as syntax in one head and which noun a pronoun refers to in another.[2] And because attention treats the input as a set with no inherent order, a positional encoding is added to each word so the model knows that “dog bites man” differs from “man bites dog.”[3]
References
- 11.7. The Transformer Architecture — Dive into Deep Learning — Dive into Deep Learning
- Attention Is All You Need (arXiv:1706.03762) — Vaswani et al., Google
- Transformer (deep learning architecture) — Wikipedia