Hardware & VRAM
What decides whether a model runs on my machine?
A model has to fit in memory to run well. On a PC with a graphics card that means the card's VRAM; on a Mac it means the unified memory shared by CPU and GPU. If the model fits in fast memory it runs fast; if it spills into slower system RAM, it crawls.
The single number that decides what you can run at home is memory. A model is a big pile of numbers (its weights), and to run it your computer has to hold all of those numbers in memory at once. If they fit, the model runs. If they do not fit, it either refuses to load or runs painfully slowly.
The catch is that not all memory is the same. On a PC with a dedicated graphics card, the memory that matters is the card’s own VRAM (video RAM). On a Mac with Apple Silicon, there is no separate graphics card, so the model uses the unified memory that the CPU and GPU share. Plain system RAM on a PC can hold a model too, but it is far slower for this job.
How much slower? A model that fits entirely in a fast GPU’s VRAM can produce around 40 or more words-per-second worth of output, while the same model running from system RAM might manage only 8 to 15 on a fast processor.[1] The reason is bandwidth: a high-end card can move memory at over 1 TB per second, while typical desktop system RAM tops out around 90 GB per second.[1] Speed is mostly about how fast the machine can read the weights, so where the weights live matters a lot.
The rest of this topic is about turning that question into a number you can check before you download anything.