The 4-bit rule — Physea Wiki

Each weight takes a fixed amount of space depending on its precision: 2 bytes at full precision, down to about half a byte at 4-bit. So at 4-bit a model needs roughly half its parameter count in gigabytes. A 7-billion-parameter model lands near 4 GB of weights before overhead.

A model’s memory size comes from one simple fact: every weight takes up a fixed amount of space, and that amount depends on how precisely each number is stored. Multiply the number of weights by the space per weight and you have most of your answer.

At full precision (often called FP16, or 16-bit), each weight takes 2 bytes.^[1] That means a model needs about 2 GB of memory for every billion parameters. Storing the numbers less precisely shrinks that: 8-bit storage uses 1 byte per weight, and 4-bit storage uses about 0.5 bytes per weight.^[1] Cutting how many bits each number uses is called quantization, which has its own topic; here it is just the lever that sets the bytes-per-weight.

That gives the rule worth memorizing: at 4-bit, a model needs roughly half its parameter count in gigabytes. A 7-billion-parameter model is about 3.5 GB of weights, a 13B is about 6.5 GB, and a 70B is about 35 GB. The same models at full precision would be four times larger.

One adjustment makes the estimate honest. The weights are not the only thing in memory while the model runs; it also keeps a working scratchpad (the KV cache and activations). A common shortcut is to add about 20% on top, so the formula becomes parameters times bytes-per-weight times 1.2.^[1] By that math a 7B model at 4-bit comes out near 4 to 5 GB total, which matches what people actually see. How much extra to budget depends heavily on how much text you feed in, which the next page covers.

References

How much VRAM do I need for LLM inference? — Modal

How much memory does a model need for its size?

References