What a runtime is — Physea Wiki

A runtime is the program that loads a model's weights into memory and turns your prompts into replies. Most local runtimes expose an OpenAI-compatible HTTP endpoint, so the same app code can point at a local model or a cloud one.

A model on disk is just a big file of numbers. It does nothing on its own. The program that loads those numbers into memory, feeds in your prompt, and produces the reply one token at a time is the runtime (also called an inference engine or a server). Running AI yourself means picking one of these and pointing it at a model file.

Runtimes differ in what they are tuned for. Some are friendly desktop apps. Some are bare command-line engines. Some are built to serve many people at once. But most of them share one important habit: they speak the same network language.

That shared language is the OpenAI-compatible endpoint. The OpenAI API has become a common shape that most tools already understand, so local runtimes copy it. Ollama, for example, ships built-in compatibility with the OpenAI Chat Completions API at a local address like http://localhost:11434/v1.^[1] llama.cpp’s server is described as “a lightweight, OpenAI API compatible, HTTP server for serving LLMs.”^[2]

Why this matters Because the endpoint shape is shared, an app written for a cloud model often works against a local one by changing a single setting: the base URL. You can prototype against your own machine and swap in a hosted model later, or the reverse, without rewriting the app.

The pages that follow look at the four runtimes people reach for most: Ollama and LM Studio for easy everyday use, llama.cpp as the engine underneath much of it, and vLLM for serving a model to many users at once.

References

OpenAI compatibility — Ollama
llama.cpp README — ggml-org

What does a local AI runtime actually do?

References