Attention
Why does attention scale the dot product before softmax?
To score a query against a key, the transformer takes their dot product, divides by the square root of the key dimension, and applies softmax. The division keeps the softmax out of low-gradient regions, so learning stays stable.
To score how well a query matches a key, the transformer takes their dot product, divides by the square root of the key’s dimension, and applies a softmax to turn the scores into weights that sum to one. The division is not a detail. The paper explains that “for large values of d_k, the dot products grow large in magnitude, pushing the softmax function into regions where it has extremely small gradients,” so the scaling keeps learning stable.[1]
References
- Attention Is All You Need (full text) — ar5iv / arXiv