Pretraining — Physea Wiki

Pretraining is the first and largest training stage. The model learns to predict the next word across enormous amounts of text, which it can do without any human labels because the text supplies its own answers.

The first and largest training stage is pretraining. The model reads enormous amounts of text and learns one deceptively simple task: predict the next word. Given the words so far, it guesses what comes next, checks against the real next word, and adjusts. As one explainer puts it, “during training, the model input is the sequence up to position t, and the target is the token at position t+1.”^[1]

What makes this work at scale is that no one has to label the data. The right answer is already sitting in the text. This is “why LLM pretraining is often called self-supervised: the targets come directly from the text itself rather than from manual annotations.”^[1] A model can learn from any plain text it can find, which is how training sets grow to hundreds of billions of words.

To get good at predicting the next word across that much varied text, the model has to absorb a great deal: grammar, facts, writing styles, the shape of an argument. The result is a base model that has broad knowledge but has not yet been shaped toward being a helpful assistant. Pretraining is also where the cost lives. GPT-3, with “175 billion parameters,”^[2] was trained once and then used many times.

References

How does next-token prediction train a large language model? — Sebastian Raschka
Language Models are Few-Shot Learners (arXiv:2005.14165) — Brown et al., OpenAI

How does a model learn language in the first place?

References