Multi-Head Attention — Physea Wiki

Rather than attend once, the model runs several attention computations in parallel. Each head has its own learned projections, so heads can specialize (grammar in one, pronoun reference in another) and combine into a richer representation.

Rather than attend once, the model runs several attention computations in parallel. “Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions,” the paper notes, adding that “with a single attention head, averaging inhibits this.”^[1] The original model used eight heads. Different heads can specialize, one tracking grammar, another tracking which noun a pronoun points to, and their outputs are combined into a richer representation than any single pattern could give.

References

Attention Is All You Need (arXiv:1706.03762) — Vaswani et al., Google

What is multi-head attention?

References