Sampling controls — Physea Wiki

Once the model has a probability distribution, it can take the single most likely token (greedy) or sample from the distribution. Temperature, top-k, and top-p are the knobs that control how adventurous that sampling is.

Now the model has a probability distribution over the next token. The simplest choice is to always take the single most likely token. That is greedy decoding. It is repeatable and safe, but on long open-ended writing it tends to get bland and loop on itself. The nucleus-sampling paper found that “using likelihood as a decoding objective leads to text that is bland and strangely repetitive.”^[2]

The alternative is sampling: treat the distribution as odds and draw a token at random, so a 60 percent token wins most of the time but not every time. Three knobs shape that draw.

Temperature changes how sharp or flat the distribution is before sampling.^[1] A low temperature sharpens it, concentrating weight on the front-runners and making output more predictable. A high temperature flattens it, giving long-shot tokens a real chance and making output more varied. At temperature 0 the process is effectively greedy.^[1] Ranges differ by provider: Anthropic’s API, for example, defaults temperature to 1.0 and accepts 0.0 to 1.0, while some other stacks allow values above 1.^[3]

Top-k keeps only the k highest-probability candidates and discards the rest before sampling.^[1] With k of 1 this is just greedy decoding; a larger k lets the model consider several strong options while still ignoring the long tail of very unlikely tokens.^[1]

Top-p, also called nucleus sampling, keeps the smallest set of tokens whose probabilities add up to a threshold p, such as 0.9, and samples only from that set.^[1] Its strength is that it adapts to the shape of the distribution: if one or two tokens dominate it keeps a small set, and if the distribution is flat it keeps more.^[1] The paper that introduced it describes “sampling text from the dynamic nucleus of the probability distribution, which allows for diversity while effectively truncating the less reliable tail of the distribution.”^[2]

A common practice is to adjust temperature or top-p, but not both at once, since they overlap in what they control.^[3]

Where you set these controls

Anthropic Messages API ↗
Exposes temperature, top_p, top_k, and stop_sequences as request parameters.

References

How do temperature, top-k, and top-p sampling differ? — Sebastian Raschka
The Curious Case of Neural Text Degeneration — Holtzman et al., ICLR 2020
Messages API reference — Anthropic

What do temperature, top-p, and top-k actually do?

Where you set these controls

References