The LLM & agent era
What are scaling laws, and why did bigger models keep getting better?
Researchers found that a model's error drops in a smooth, predictable way as you add parameters, data, and compute. That predictability is what justified spending more on ever-larger models.
A scaling law is a measured relationship between how big a model is and how well it does. In 2020 a team at OpenAI reported that a language model’s loss, which is its average error at predicting the next token, falls as a power law with three things: the number of parameters, the size of the training dataset, and the amount of compute spent on training. The trend held smoothly across more than seven orders of magnitude.[1]
The practical message was that getting better was, to a degree, a matter of spending more. If you doubled the right inputs, you could estimate how much the error would drop before you trained anything. That predictability is what made it reasonable to pour money and hardware into much larger models, because the payoff was no longer a guess.
The same 2020 paper added a striking suggestion: larger models are more sample-efficient, so the best use of a fixed compute budget was to train a very large model on a relatively modest amount of data and stop early.[1]
References
- Scaling Laws for Neural Language Models — arXiv (Kaplan et al., OpenAI)
- Training Compute-Optimal Large Language Models — arXiv (Hoffmann et al., DeepMind)