PhyseaWiki How AI actually works Papers physea.ai →

Attention

What is multi-head attention?

Rather than attend once, the model runs several attention computations in parallel. Each head has its own learned projections, so heads can specialize (grammar in one, pronoun reference in another) and combine into a richer representation.

Last updated 2026-06-15 · Physea Labs

Rather than attend once, the model runs several attention computations in parallel. “Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions,” the paper notes, adding that “with a single attention head, averaging inhibits this.”[1] The original model used eight heads. Different heads can specialize, one tracking grammar, another tracking which noun a pronoun points to, and their outputs are combined into a richer representation than any single pattern could give.

References

  1. Attention Is All You Need (arXiv:1706.03762) — Vaswani et al., Google