Scaling laws — Physea Wiki

Researchers found that a model's error drops in a smooth, predictable way as you add parameters, data, and compute. That predictability is what justified spending more on ever-larger models.

A scaling law is a measured relationship between how big a model is and how well it does. In 2020 a team at OpenAI reported that a language model’s loss, which is its average error at predicting the next token, falls as a power law with three things: the number of parameters, the size of the training dataset, and the amount of compute spent on training. The trend held smoothly across more than seven orders of magnitude.^[1]

The practical message was that getting better was, to a degree, a matter of spending more. If you doubled the right inputs, you could estimate how much the error would drop before you trained anything. That predictability is what made it reasonable to pour money and hardware into much larger models, because the payoff was no longer a guess.

The same 2020 paper added a striking suggestion: larger models are more sample-efficient, so the best use of a fixed compute budget was to train a very large model on a relatively modest amount of data and stop early.^[1]

The correction. In 2022 a DeepMind team revisited this and found that the big models of the day were actually undertrained. For a fixed compute budget, model size and the number of training tokens should grow together, roughly doubling the data each time you double the parameters. Their 70-billion-parameter model, Chinchilla, beat much larger models trained on less data.^[2]

References

Scaling Laws for Neural Language Models — arXiv (Kaplan et al., OpenAI)
Training Compute-Optimal Large Language Models — arXiv (Hoffmann et al., DeepMind)

What are scaling laws, and why did bigger models keep getting better?

References