Scaling and Softmax — Physea Wiki

To score a query against a key, the transformer takes their dot product, divides by the square root of the key dimension, and applies softmax. The division keeps the softmax out of low-gradient regions, so learning stays stable.

To score how well a query matches a key, the transformer takes their dot product, divides by the square root of the key’s dimension, and applies a softmax to turn the scores into weights that sum to one. The division is not a detail. The paper explains that “for large values of d_k, the dot products grow large in magnitude, pushing the softmax function into regions where it has extremely small gradients,” so the scaling keeps learning stable.^[1]

References

Attention Is All You Need (full text) — ar5iv / arXiv

Why does attention scale the dot product before softmax?

References