PhyseaWiki How AI actually works Papers physea.ai →

Local inference basics

What loads and runs the model?

The runtime is the engine that loads the model file and runs it on your hardware. llama.cpp is the widely used low-level engine; Ollama is a more approachable tool that downloads models and runs them from a simple command.

Last updated 2026-06-15 · Physea Labs

The runtime is the engine. It opens a model file, holds the weights in memory, and runs the math that produces each word. Two names come up constantly here, and they sit at different levels.

llama.cpp is the low-level engine. Its stated goal is to run model inference with minimal setup and strong performance across a wide range of hardware, on your own machine or in the cloud.[1] It reads models in a format called GGUF, and it runs on plain CPUs as well as graphics cards from several makers, with Apple silicon treated as a first-class target.[1] It is a building block more than a finished product: a large number of other apps and tools are built on top of it.[1] If you have heard that some friendly local-AI app “uses llama.cpp under the hood,” this is the piece they mean.

Ollama sits a step up in friendliness. It describes itself as the easiest way to build with open models, and it works as a runtime plus a model manager.[2] You install it once, then download and run a model with a short command in your terminal, something close to ollama run <model-name>.[2] It runs on macOS, Windows, and Linux, and it keeps running quietly in the background so that other programs can talk to it.[2]

The practical difference: llama.cpp gives you the most control and the most knobs, which suits people comfortable at a command line. Ollama hides most of that and gets you to a working model faster. Many people never touch a runtime directly at all, because the apps on the next page bundle one for them.

References

  1. llama.cpp — ggml-org (GitHub)
  2. Ollama — Ollama