Context & pricing
What is a context window, in practice?
A context window is the model's working memory — every token it can reference at once, including the response it is generating. It is measured in tokens, and more is not automatically better.
The context window is all the text a model can reference when it answers — its working memory for one request. Anthropic’s documentation describes it as “all the text a language model can reference when generating a response, including the response itself.”[1] That last clause matters: the answer the model is writing counts against the same budget as everything you sent in.
It is measured in tokens, the same unit you are billed in. In a back-and-forth conversation, the window fills up turn by turn. Each request carries the whole history so far plus the new message as input, and the model’s reply becomes part of the input for the next turn.[1] A long conversation therefore grows the window steadily, and you pay to re-send that accumulated history every time.
Bigger windows are not automatically better. As the token count climbs, accuracy and recall tend to fall off — a pattern the docs call “context rot.”[1] A model with room for a million tokens still does its best work when the window holds what is relevant and little else.
When a request would run past the limit, the behavior depends on the model. On newer models the request is accepted and generation simply stops if it reaches the ceiling, reporting a “model_context_window_exceeded” stop reason; older models reject the request up front instead.[1] Either way, the fix is the same: send less, or summarize the older parts of the conversation so the window stays focused.
References
- Context windows — Anthropic