Weights vs activations

Weights are the model's permanent, learned numbers and they stay the same on every run. Activations are the throwaway numbers computed for the specific text you give it, and that working memory grows with how much text the model is handling.

It helps to separate two kinds of numbers a model uses. Weights are the parameters covered elsewhere in this topic: the fixed values learned during training. They are the same on every run, no matter what you ask. Activations are the temporary numbers the model computes as it reads a specific piece of text. They are the model’s working scratchpad, and they are thrown away once the answer is done.

Both take up memory, but they behave differently. A practical guide from Google DeepMind notes that the model stores “one copy of our parameters,” a fixed cost, while the cache of activations for past tokens grows with the length of the text being processed.^[1] For a short prompt the weights dominate, but on very long inputs this growing working memory can rival or exceed the weights themselves. This is one reason a long conversation can slow down or run out of memory even when the model itself fits comfortably.

The split also explains why parameter count is not the whole story for capability. More weights give a model more capacity, but a model can be too big for the amount of data it was trained on. The Chinchilla study found that a 70B-parameter model trained on more data beat a 280B model trained on less, with the rule of thumb that “if one doubles the model size, one must also have twice the number of training tokens.”^[2] A bigger number on the box is not automatically better.

In short Weights are fixed and learned; activations are temporary and depend on your input. Parameter count sets capacity, but the right amount of training data matters just as much.

References

How To Scale Your Model: Inference — Austin et al., Google DeepMind (2025)
Chinchilla (language model) — Wikipedia

What is the difference between weights and activations?

References