The size ceiling — Physea Wiki

On a Mac, the size of the unified memory pool is the main limit on what you can run, because the whole pool is available to the model. Apple's example: a 24GB Mac comfortably holds an 8B model at full precision or a 30B mixture-of-experts model at 4-bit, with the work staying under 18GB.

A model has to fit in memory to run well, so on a Mac the size of the unified memory pool sets the ceiling on what you can run. The useful part is that the whole pool counts. A separate graphics card splits its memory from system RAM, so a card with 24GB of VRAM caps your model at that figure no matter how much system RAM you have.^[1] A Mac with 24GB of unified memory can put that entire pool toward the model instead.

Apple gives a concrete example for a 24GB machine: it “can easily hold a 8B in BF16 precision or a 30B MoE 4-bit quantized, keeping the inference workload under 18GB for both.”^[2] Two things drive that. BF16 is full-size weights, while 4-bit quantized packs each weight smaller so a much larger model fits. And a mixture-of-experts (MoE) model holds many weights but uses only a slice per word, which is why a 30B MoE can sit alongside a much smaller dense model.

Notice that Apple’s example keeps the work under 18GB on a 24GB machine. That gap is not waste. Your operating system and other open apps also live in the same pool, so you cannot spend all of it on the model. Leave headroom. For actual numbers across memory sizes and precision levels, the model-size calculator lets you enter your own unified memory and see what fits.

References

Unified Memory Explained: Apple Silicon vs NVIDIA for AI — Seresa
Exploring LLMs with MLX and the Neural Accelerators in the M5 GPU — Apple Machine Learning Research

How much memory does my Mac need to run a model?

References