PhyseaWiki How AI actually works Papers physea.ai →

Quantization

How do I trade size against quality?

More bits keep more quality but need more memory; fewer bits shrink the file and lose a little accuracy. For most people running models at home, 4-bit Q4_K_M is the usual balance.

Last updated 2026-06-15 · Physea Labs

Every step down in precision is a deal: you give up a bit of quality to save a lot of space. The savings are real. The llama.cpp project measured Llama 3.1 8B at full 16-bit precision taking 14.96 GiB, while the 4-bit Q4_K_M version of the same model is 4.58 GiB.[1] That is roughly a third of the size, which is the difference between a model that fits on a typical GPU and one that does not.

The cost shows up as a small drop in how accurately the model reproduces what its full-precision version would have said. Reducing the weights to a lower precision “naturally gives rise to a drop in the performance of the model.”[2] The size of that drop depends on how far you go. Going from 16-bit down to 8-bit or 4-bit usually costs very little that you would notice in normal use. Pushing to 2-bit or lower starts to hurt more, and very small models tend to feel the loss sooner than large ones.

The practical guidance most people follow: Q4_K_M is the common sweet spot for running models at home, where the file is small enough to fit and the quality stays close to the original. If you have extra memory to spare, Q5_K_M or Q6_K give you a little more headroom on quality. Drop to Q3 or Q2 only when a model would not otherwise fit, and expect the answers to get rougher as you do. The right pick is the highest precision that still leaves room in your memory once the model and its context are loaded.

References

  1. llama.cpp quantize README — llama.cpp
  2. Quantization concept guide — Hugging Face