Size and memory — Physea Wiki

Memory needed is roughly parameter count times bytes per parameter: 2 bytes each in half precision, so a 70B model needs about 140GB. Quantization shrinks each parameter to cut that down.

The parameter count also tells you, fairly directly, how much memory a model needs to run. The estimate is simple: multiply the number of parameters by the number of bytes used to store each one. In full precision each parameter takes 4 bytes; in the half-precision formats used for most models today, 2 bytes each.^[1]

So a 70B model in half precision needs roughly 70 billion times 2 bytes, about 140GB, just to hold the parameters. That is far more than a single consumer graphics card, which is why large models are often split across several. The same arithmetic explains a 176-billion-parameter model needing about 352GB in half precision.^[1]

The common way to fit a model into less memory is quantization: storing each parameter with fewer bits. Moving from 2 bytes to 1 byte (8-bit) halves the size, and 4-bit storage uses half a byte each, which brings a 70B model down toward 35GB.^[1] The trade is a small loss of precision in each number, which usually costs a little accuracy. This is also why a Mixture-of-Experts model is no lighter to load: memory follows the total parameter count, even when only a few experts compute per token.

Tools for running and shrinking models

Hugging Face bitsandbytes ↗
Library and guide for loading models in 8-bit and 4-bit to cut memory use.
llama.cpp ↗
Open-source runtime that runs quantized models on ordinary CPUs and GPUs.
Ollama ↗
Local runner that downloads and serves quantized open models with one command.

References

A Gentle Introduction to 8-bit Matrix Multiplication for transformers — Hugging Face

How much memory does it take to run a model?

Tools for running and shrinking models

References