PhyseaWiki How AI actually works Papers physea.ai →

The transformer

What is a transformer block, and what is inside it?

A transformer is not one big thing; it is one block stacked many times. Each block has a multi-head self-attention layer and a small feed-forward network, each wrapped in a residual connection and layer normalization, plus positional encoding so the model knows word order.

Last updated 2026-06-15 · Physea Labs

A transformer is not one big thing; it is one block stacked many times. Each block has two main parts: a multi-head self-attention layer and a small position-wise feed-forward network. Each part is wrapped in a residual connection (the input is added back to the output) followed by layer normalization, which keeps training stable through a deep stack. The original model stacked six such blocks.[1]

Input + positional encoding ONE BLOCK — REPEATED ×N Multi-head self-attention Add & normalize Feed-forward network Add & normalize to the next block → output residual residual
A transformer is one block repeated N times: attention, then a feed-forward network, each wrapped in add-and-normalize, with a residual skip around each.

Two more pieces make it complete. Multi-head attention runs several attention computations in parallel, so different heads can track different relationships at once, such as syntax in one head and which noun a pronoun refers to in another.[2] And because attention treats the input as a set with no inherent order, a positional encoding is added to each word so the model knows that “dog bites man” differs from “man bites dog.”[3]

References

  1. 11.7. The Transformer Architecture — Dive into Deep Learning — Dive into Deep Learning
  2. Attention Is All You Need (arXiv:1706.03762) — Vaswani et al., Google
  3. Transformer (deep learning architecture) — Wikipedia