Attention
Where did attention come from?
Attention began in 2014 as a fix for translation: instead of compressing a source sentence into one fixed-length vector, the model could soft-search for relevant parts. That soft-search is the direct ancestor of modern attention.
Attention was not born as the whole architecture. Bahdanau, Cho, and Bengio introduced it in 2014 to fix a specific limitation in translation: an RNN had to compress an entire source sentence into one fixed-length vector. They “conjecture[d] that the use of a fixed-length vector is a bottleneck” and let the model instead “automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word.”[1] That soft-search is the direct ancestor of modern attention; the explicit query-key-value framing came later, with the transformer.
References
- Neural Machine Translation by Jointly Learning to Align and Translate — Bahdanau, Cho, Bengio, arXiv