PhyseaWiki How AI actually works Papers physea.ai →

Quantization

What do labels like Q4_K_M mean?

In a label like Q4_K_M, the 4 is roughly the bits per weight, K is the modern block-based method, and the last letter (S, M, or L) is the size-and-quality variant. Lower numbers mean smaller files.

Last updated 2026-06-15 · Physea Labs

When you download a model to run at home you will see files named things like Q4_K_M, Q5_K_M, Q8_0, or Q2_K. These come from GGUF, the file format used by llama.cpp and the tools built on it.[1] The label tells you how aggressively the model was compressed. Reading it is simple once you know the parts.

The number after the Q is roughly how many bits each weight uses. Q4 is about 4 bits, Q8 is 8 bits, Q2 is 2 bits. Fewer bits means a smaller file. The exact figure is a little higher than the headline number because some bookkeeping values are kept at higher precision: Q4_K works out to 4.5 bits per weight, Q5_K to 5.5, Q6_K to about 6.56, and Q2_K to about 2.63.[1]

The K means the model uses the modern “k-quant” method, which groups weights into blocks and stores some parts of each block more precisely than others.[1] The trailing letter is the size variant: S for small, M for medium, L for large. So Q4_K_M is 4-bit, k-quant method, medium variant. The _0 style labels (like Q8_0) are older “legacy” methods that GGUF notes are not used widely today.[1] You will also see IQ labels, an even more compact family that leans on an importance matrix to decide which weights to protect.[1]

Tools that run these files

  • llama.cpp

    The C/C++ engine that defines the GGUF format and runs quantized models on CPU or GPU. You can point it straight at a Hugging Face file.

  • Ollama

    A friendlier wrapper around the same idea, billed as the easiest way to run open models locally and offline.

References

  1. GGUF quantization types — Hugging Face
  2. llama.cpp quantize README — llama.cpp