Long Context, Lost Middle

Long-context models now hold a million tokens or more, but a bigger window is not used evenly. Models tend to find information at the start and end of the context better than information buried in the middle.

Windows have grown a lot. Google DeepMind’s Gemini 1.5 report describes models “capable of recalling and reasoning over fine-grained information from millions of tokens of context” and reports “near-perfect recall (>99%) up to at least 10M tokens” on a retrieval test where a single fact is hidden in a very long input.^[1] So a model can, in the right test, find a needle in a haystack of millions of tokens.

A bigger window is not used evenly, though. The study “Lost in the Middle” examined how models handle information placed at different spots in a long input and found that “performance is often highest when relevant information occurs at the beginning or end of the input context, and significantly degrades when models must access relevant information in the middle of long contexts.”^[2] The authors note this held “even for explicitly long-context models.”^[2] A document can be inside the window and still get less attention because of where it sits.

Anthropic’s documentation makes a related point about quantity: “more context isn’t automatically better,” and “as token count grows, accuracy and recall degrade, a phenomenon known as context rot.”^[3] The practical takeaway is that what you put in the window, and where, matters as much as how much fits. Putting the most important material near the start or end, and trimming what the model does not need, tends to help.

References

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context (arXiv:2403.05530) — Google DeepMind
Lost in the Middle: How Language Models Use Long Contexts (arXiv:2307.03172) — Liu et al. (TACL 2024)
Context windows (Claude API docs) — Anthropic

Do bigger context windows mean the model uses all of it well?

References