The block, repeated — Physea Wiki

A transformer is not one big thing; it is one block stacked many times. Each block has a multi-head self-attention layer and a small feed-forward network, each wrapped in a residual connection and layer normalization, plus positional encoding so the model knows word order.

A transformer is not one big thing; it is one block stacked many times. Each block has two main parts: a multi-head self-attention layer and a small position-wise feed-forward network. Each part is wrapped in a residual connection (the input is added back to the output) followed by layer normalization, which keeps training stable through a deep stack. The original model stacked six such blocks.^[1]

A transformer is one block repeated N times: attention, then a feed-forward network, each wrapped in add-and-normalize, with a residual skip around each.

Two more pieces make it complete. Multi-head attention runs several attention computations in parallel, so different heads can track different relationships at once, such as syntax in one head and which noun a pronoun refers to in another.^[2] And because attention treats the input as a set with no inherent order, a positional encoding is added to each word so the model knows that “dog bites man” differs from “man bites dog.”^[3]

References

11.7. The Transformer Architecture — Dive into Deep Learning — Dive into Deep Learning
Attention Is All You Need (arXiv:1706.03762) — Vaswani et al., Google
Transformer (deep learning architecture) — Wikipedia

What is a transformer block, and what is inside it?

References