Subject 07 · Builds on Architecture + Models
Running AI Yourself
The practical track: local inference, hardware and VRAM, quantization, serving runtimes, and the local-versus-cloud trade-off.
22 pages across 7 topics
Local inference basics
The two-layer local stack.
- The two-layer stack Local AI is two jobs. A runtime (the engine) loads the model file and does the actual math; an app on top gives you a chat window and a model browser. Sometimes they are two separate programs, sometimes one download bundles both.
- The runtime layer The runtime is the engine that loads the model file and runs it on your hardware. llama.cpp is the widely used low-level engine; Ollama is a more approachable tool that downloads models and runs them from a simple command.
- The app layer The app layer is the part you see and click. Desktop apps like LM Studio and Jan give you a chat window, a model browser, and a runtime bundled inside, so one download gets you running without the command line.
- How to start Pick by comfort level. Want a chat window with no setup? Start with a desktop app like LM Studio or Jan. Comfortable in a terminal and want control? Start with Ollama or llama.cpp.
Hardware & VRAM
What your machine can hold.
- Memory is the gate A model has to fit in memory to run well. On a PC with a graphics card that means the card's VRAM; on a Mac it means the unified memory shared by CPU and GPU. If the model fits in fast memory it runs fast; if it spills into slower system RAM, it crawls.
- The 4-bit rule Each weight takes a fixed amount of space depending on its precision: 2 bytes at full precision, down to about half a byte at 4-bit. So at 4-bit a model needs roughly half its parameter count in gigabytes. A 7-billion-parameter model lands near 4 GB of weights before overhead.
- What fits An 8 GB card runs a 7B model at 4-bit comfortably; 16 GB reaches 13B-class models; a 70B model at 4-bit needs around 42 GB. The catch is context length, which adds memory on top of the weights, so leave headroom for long inputs.
Model-size calculator
Find the biggest model your hardware can run.
Quantization
Shrinking a model to fit.
- What it is Quantization swaps the high-precision numbers in a model's weights for smaller, lower-precision ones. The model file shrinks a lot, it fits in less memory, and quality drops only a little.
- Reading the labels In a label like Q4_K_M, the 4 is roughly the bits per weight, K is the modern block-based method, and the last letter (S, M, or L) is the size-and-quality variant. Lower numbers mean smaller files.
- The trade-off More bits keep more quality but need more memory; fewer bits shrink the file and lose a little accuracy. For most people running models at home, 4-bit Q4_K_M is the usual balance.
Serving & runtimes
Ollama, llama.cpp, vLLM, and friends.
- What a runtime is A runtime is the program that loads a model's weights into memory and turns your prompts into replies. Most local runtimes expose an OpenAI-compatible HTTP endpoint, so the same app code can point at a local model or a cloud one.
- Ollama & LM Studio Ollama runs models from the command line, pulling them by name like Docker images. LM Studio is a point-and-click desktop app with a built-in model browser and chat. Both give you a local OpenAI-compatible server.
- llama.cpp llama.cpp is a lean C/C++ inference engine designed to run models on a wide range of hardware with little setup. It reads the GGUF file format and quietly powers tools like LM Studio and Ollama.
- vLLM vLLM is built for serving, not just personal use. Techniques like PagedAttention and continuous batching let a single model handle many requests at once, and it exposes an OpenAI-compatible API.
Running on a Mac
Unified memory and Apple silicon.
- Why a Mac works On Apple silicon there is one pool of memory shared by the CPU and GPU instead of separate VRAM, so model weights never have to be copied across a slow bus. That fits how language models actually run: token generation is limited by memory speed, not raw compute.
- The size ceiling On a Mac, the size of the unified memory pool is the main limit on what you can run, because the whole pool is available to the model. Apple's example: a 24GB Mac comfortably holds an 8B model at full precision or a 30B mixture-of-experts model at 4-bit, with the work staying under 18GB.
- MLX MLX is Apple's open-source machine learning framework built for Apple silicon. Its design matches the hardware: arrays live in shared memory, so the same data is usable by the CPU or GPU without copying. Several popular local-AI apps now run MLX models under the hood.
Local vs cloud
When to run it yourself.
- Local vs cloud Running a model locally means it lives on your computer and all the work happens there. Using a cloud API means your prompt travels over the internet to a provider that runs the model and sends the answer back.
- Privacy and offline Running locally keeps every prompt on your own hardware and works with no internet once the model is downloaded. Cloud APIs send your text to a provider's servers, though the large providers state that API data is not used to train their models by default.
- Cost and capability Cloud APIs cost nothing upfront but charge for every request, so heavy use adds up. Local costs money for hardware first and then very little per query, but the models you can run at home are usually weaker than the best cloud models.
- The hybrid pattern Most real systems use both. A router sends easy queries to a small local or cheap model and saves the strong cloud model for hard ones. Published work reports this cuts cost a lot while keeping quality close to the strong model.