Size and capability — Physea Wiki

More parameters usually buys more capability, but it is not the only lever. The Chinchilla study found a smaller, better-trained model beating one four times its size, because model size and training data should grow together.

As a rule of thumb, more parameters give a model more room to capture patterns, so larger models tend to do better on hard tasks. But parameter count is not the whole story, and treating it as a single quality score leads people astray.

The clearest evidence comes from a 2022 DeepMind study often called Chinchilla. It compared a 70B-parameter model against an earlier 280B-parameter model trained with the same compute budget. The smaller model, trained on four times as much data, won, including a notable jump on a broad knowledge benchmark.^[1] The lesson was that many large models had been undertrained: they had plenty of parameters but had not seen enough text to fill them usefully.

The takeaway is balance. The study’s guiding rule is that model size and the amount of training data should grow together, roughly doubling the data every time you double the size.^[1] So when you compare two models, the parameter count is one input among several. How much data the model saw, how that data was curated, and how the model was tuned afterward all shape what it can actually do.

References

Training Compute-Optimal Large Language Models (Chinchilla) — Hoffmann et al., DeepMind, 2022

Does a bigger model always mean a better model?

References