Running on a Mac
Why are Macs good for running AI models locally?
On Apple silicon there is one pool of memory shared by the CPU and GPU instead of separate VRAM, so model weights never have to be copied across a slow bus. That fits how language models actually run: token generation is limited by memory speed, not raw compute.
On a typical PC, memory is split into two pools. The CPU uses system RAM, and a separate graphics card has its own VRAM. When a model needs to run on the graphics card, its data has to be copied from one pool to the other across a connection called the PCIe bus, and that copying is slow.[1]
Apple silicon does not split memory this way. The CPU, the GPU, and the memory all sit on the same chip and share one pool, often called unified memory. As one explainer puts it, “There is no VRAM. There is no system RAM. There is one pool of unified memory.”[1] A model loaded into that pool is reachable by the GPU with no copy step in between.
This fits how language models actually run. Once a model has read your prompt, producing each new word is mostly a matter of reading the model’s weights out of memory, not heavy calculation. Apple’s own research puts it plainly: “Generating subsequent tokens is bounded by memory bandwidth, rather than by compute ability.”[2] So a machine where memory is fast and close to the processor is well suited to the job, which is why a Mac can run sizable models without a separate graphics card.
References
- Unified Memory Explained: Apple Silicon vs NVIDIA for AI — Seresa
- Exploring LLMs with MLX and the Neural Accelerators in the M5 GPU — Apple Machine Learning Research