PhyseaWiki How AI actually works Papers physea.ai →

Subject 07 · Builds on Architecture + Models

Running AI Yourself

The practical track: local inference, hardware and VRAM, quantization, serving runtimes, and the local-versus-cloud trade-off.

22 pages across 7 topics

Local inference basics

The two-layer local stack.

The two-layer stack Local AI is two jobs. A runtime (the engine) loads the model file and does the actual math; an app on top gives you a chat window and a model browser. Sometimes they are two separate programs, sometimes one download bundles both.
The runtime layer The runtime is the engine that loads the model file and runs it on your hardware. llama.cpp is the widely used low-level engine; Ollama is a more approachable tool that downloads models and runs them from a simple command.
The app layer The app layer is the part you see and click. Desktop apps like LM Studio and Jan give you a chat window, a model browser, and a runtime bundled inside, so one download gets you running without the command line.
How to start Pick by comfort level. Want a chat window with no setup? Start with a desktop app like LM Studio or Jan. Comfortable in a terminal and want control? Start with Ollama or llama.cpp.

Hardware & VRAM

What your machine can hold.

Model-size calculator

Find the biggest model your hardware can run.

The calculator Enter how much memory you have and pick a precision level. The calculator estimates the largest model you can run and shows which common sizes (7B, 32B, 70B, and up) fit, using the rule that memory needed is roughly parameters times bytes-per-weight plus overhead.

Quantization

Shrinking a model to fit.

Serving & runtimes

Ollama, llama.cpp, vLLM, and friends.

Running on a Mac

Unified memory and Apple silicon.

Local vs cloud

When to run it yourself.