Serving & runtimes
What does a local AI runtime actually do?
A runtime is the program that loads a model's weights into memory and turns your prompts into replies. Most local runtimes expose an OpenAI-compatible HTTP endpoint, so the same app code can point at a local model or a cloud one.
A model on disk is just a big file of numbers. It does nothing on its own. The program that loads those numbers into memory, feeds in your prompt, and produces the reply one token at a time is the runtime (also called an inference engine or a server). Running AI yourself means picking one of these and pointing it at a model file.
Runtimes differ in what they are tuned for. Some are friendly desktop apps. Some are bare command-line engines. Some are built to serve many people at once. But most of them share one important habit: they speak the same network language.
That shared language is the OpenAI-compatible endpoint. The OpenAI API has become a common shape that most tools already understand, so local runtimes copy it. Ollama, for example, ships built-in compatibility with the OpenAI Chat Completions API at a local address like http://localhost:11434/v1.[1] llama.cpp’s server is described as “a lightweight, OpenAI API compatible, HTTP server for serving LLMs.”[2]
The pages that follow look at the four runtimes people reach for most: Ollama and LM Studio for easy everyday use, llama.cpp as the engine underneath much of it, and vLLM for serving a model to many users at once.
References
- OpenAI compatibility — Ollama
- llama.cpp README — ggml-org