One token at a time — Physea Wiki

A language model writes by predicting one token, appending it, then predicting again from the longer text. This step-by-step loop, where each new token depends on everything before it, is called autoregressive generation.

A language model does not write a whole sentence in one shot. It produces text one token at a time, where a token is a chunk of text such as a word or a piece of a word. The model looks at everything written so far, predicts the single most fitting next token, adds that token to the text, and then runs again on the slightly longer text. It keeps repeating this loop until it decides to stop.

This is called autoregressive generation, because each prediction feeds back in as input for the next one. The original transformer paper describes it plainly: “At each step the model is auto-regressive, consuming the previously generated symbols as additional input when generating the next.”^[1] In probability terms, the chance of each token depends only on the tokens before it.^[2]

The important thing to picture is the loop. The model has no separate “planning” stage and no draft it revises. Every word you read was chosen the same way: one prediction at a time, each one informed by all the text that came before.

In short Generation is a repeating loop. Predict the next token, append it, feed the result back in, and go again.

References

Attention Is All You Need — Vaswani et al., NeurIPS 2017
Small Language Models: autoregressive language modeling — Clemson Research Computing and Data

How does a language model actually produce text?

References