PhyseaWiki How AI actually works Papers physea.ai →

Sizes & parameters

How much memory does it take to run a model?

Memory needed is roughly parameter count times bytes per parameter: 2 bytes each in half precision, so a 70B model needs about 140GB. Quantization shrinks each parameter to cut that down.

Last updated 2026-06-15 · Physea Labs

The parameter count also tells you, fairly directly, how much memory a model needs to run. The estimate is simple: multiply the number of parameters by the number of bytes used to store each one. In full precision each parameter takes 4 bytes; in the half-precision formats used for most models today, 2 bytes each.[1]

So a 70B model in half precision needs roughly 70 billion times 2 bytes, about 140GB, just to hold the parameters. That is far more than a single consumer graphics card, which is why large models are often split across several. The same arithmetic explains a 176-billion-parameter model needing about 352GB in half precision.[1]

The common way to fit a model into less memory is quantization: storing each parameter with fewer bits. Moving from 2 bytes to 1 byte (8-bit) halves the size, and 4-bit storage uses half a byte each, which brings a 70B model down toward 35GB.[1] The trade is a small loss of precision in each number, which usually costs a little accuracy. This is also why a Mixture-of-Experts model is no lighter to load: memory follows the total parameter count, even when only a few experts compute per token.

Tools for running and shrinking models

  • Hugging Face bitsandbytes

    Library and guide for loading models in 8-bit and 4-bit to cut memory use.

  • llama.cpp

    Open-source runtime that runs quantized models on ordinary CPUs and GPUs.

  • Ollama

    Local runner that downloads and serves quantized open models with one command.

References

  1. A Gentle Introduction to 8-bit Matrix Multiplication for transformers — Hugging Face