PhyseaWiki How AI actually works Papers physea.ai →

Attention

Why does attention scale the dot product before softmax?

To score a query against a key, the transformer takes their dot product, divides by the square root of the key dimension, and applies softmax. The division keeps the softmax out of low-gradient regions, so learning stays stable.

Last updated 2026-06-15 · Physea Labs

To score how well a query matches a key, the transformer takes their dot product, divides by the square root of the key’s dimension, and applies a softmax to turn the scores into weights that sum to one. The division is not a detail. The paper explains that “for large values of d_k, the dot products grow large in magnitude, pushing the softmax function into regions where it has extremely small gradients,” so the scaling keeps learning stable.[1]

References

  1. Attention Is All You Need (full text) — ar5iv / arXiv