Hardware & VRAM
Which models fit on common hardware?
An 8 GB card runs a 7B model at 4-bit comfortably; 16 GB reaches 13B-class models; a 70B model at 4-bit needs around 42 GB. The catch is context length, which adds memory on top of the weights, so leave headroom for long inputs.
With the 4-bit rule in hand, here is roughly what common hardware can hold. These figures assume 4-bit quantization, which is the usual choice for running models at home.
An 8 GB card or 8 GB of memory comfortably runs a 7B-class model at 4-bit; in one published table an 8-billion-parameter model at 4-bit used about 6.2 GB at a modest context length.[1] People report a plain 8 GB GPU running a 7B model at 25 to 33 words per second.[3] A 16 GB machine reaches into 13B and 14B-class models, where a 14B model at 4-bit lands near 11 GB.[1] Stepping up, 32 GB handles 32B models comfortably and a 70B model only with heavy quantization, while around 64 GB is the realistic floor for a 70B model at a sensible 4-bit setting.[2] A 70B model at 4-bit weighs in around 42 GB of weights alone.[1][2]
The big asterisk is context length, the amount of text you feed the model at once. The working scratchpad (the KV cache) grows with context, and it is separate from the weights. One example: an 8B model holding a 32,000-token context needs roughly 4.5 GB just for that cache, on top of the model itself.[1] So if you plan to paste in long documents, budget extra memory beyond the weight estimate, or expect the model to slow down or refuse.
A practical takeaway: the same hardware that runs a model fine on short prompts can run out of room on a very long one. When in doubt, pick a model a size smaller than your maximum and keep headroom for context.
Tools for running models locally
- Ollama ↗
One-command tool to download and run quantized models on Mac, Windows, or Linux; picks sensible defaults for your memory.
- llama.cpp ↗
The underlying engine many local tools use; runs GGUF models efficiently on CPU, GPU, or both.
References
- Ollama VRAM Requirements: Complete 2026 Guide — LocalLLM.in
- Memory for Local LLMs: How Much RAM Do You Need? — Corsair
- Hardware specs for GGUF 7B/13B/30B parameter models — llama.cpp (GitHub)