Why Limits Exist — Physea Wiki

Self-attention has every token look at every other token, so its cost rises with the square of the length. Doubling the context roughly quadruples the work, which is why windows have a ceiling.

The limit comes from how attention works. In a transformer, self-attention has every token compare itself to every other token in the sequence. The original transformer paper gives the cost of a self-attention layer as on the order of n squared times d, where n is the sequence length and d is the representation size.^[1] That n-squared term is the catch: the work grows with the square of the length, not in proportion to it.

In practical terms, doubling the number of tokens in the window roughly quadruples the attention computation, and the memory needed to hold the comparisons rises the same way. A window twice as long is not twice as costly to run; it is closer to four times. This is why model makers set a ceiling on the window rather than allowing any length, and why pushing windows into the hundreds of thousands or millions of tokens has taken real engineering effort.

The same paper notes that self-attention is fast compared to older recurrent layers “when the sequence length n is smaller than the representation dimensionality d, which is most often the case.”^[1] Once sequences get very long, that advantage narrows and the n-squared cost dominates. Reducing this cost is the focus of a large body of follow-up work on more efficient attention.

References

Attention Is All You Need (arXiv:1706.03762) — Vaswani et al., Google

Why do context windows have a limit at all?

References