Quantization
What is quantization?
Quantization swaps the high-precision numbers in a model's weights for smaller, lower-precision ones. The model file shrinks a lot, it fits in less memory, and quality drops only a little.
A model is a giant pile of numbers called weights. By default each weight is stored at high precision, usually a 32-bit or 16-bit floating-point number. That precision is accurate, but it takes up space: every weight costs several bytes, and a model can have billions of them.
Quantization is the trick of storing those same weights at lower precision. Instead of 16 or 32 bits each, you keep them at 8 bits, 4 bits, or even fewer. Hugging Face describes it as “a technique to reduce the computational and memory costs of running inference by representing the weights and activations with low-precision data types like 8-bit integer (int8) instead of the usual 32-bit floating point (float32).”[1] Fewer bits per weight means a smaller file and less memory needed to load it.
How does it keep the numbers usable? The original weights cover a wide range of values. Quantization finds that range and maps it onto the small set of values a low-precision format can hold, recording a scale factor so the values can be read back approximately.[1] A 4-bit number can only represent sixteen distinct levels, so each weight gets rounded to the nearest available one. The rounding is where quality loss comes from, but for large models the loss is usually small.
References
- Quantization concept guide — Hugging Face