Logits to probabilities

At each step the model produces one raw score, called a logit, for every token in its vocabulary. A function called softmax converts those scores into probabilities that sum to 1, which is the distribution the model then picks from.

Before the model can choose a next token, it has to score the options. Its final layer outputs one raw number for every token in its vocabulary, which may be tens of thousands of tokens. These raw scores are called logits. A higher logit means the model considers that token a better fit, but logits are not probabilities yet. They can be any size, positive or negative, and they do not add up to anything tidy.

To make them usable, the model runs the logits through a function called softmax. Softmax turns the whole list of scores into a list of probabilities: every value lands between 0 and 1, and the values across all tokens add up to 1.^[2] The transformer paper states this directly, using “the usual learned linear transformation and softmax function to convert the decoder output to predicted next-token probabilities.”^[1] Now the model has a genuine probability distribution over what could come next.

So at each step there is a distribution, not a single answer. The most likely token might sit at 60 percent while dozens of others split the rest. What the model does with that distribution, take the top choice or roll the dice, is the next decision, and it is where the controls you can tune come in.

References

Attention Is All You Need — Vaswani et al., NeurIPS 2017
Small Language Models: autoregressive language modeling — Clemson Research Computing and Data

How does the model turn its raw scores into a choice of next token?

References