llama.cpp — Physea Wiki

llama.cpp is a lean C/C++ inference engine designed to run models on a wide range of hardware with little setup. It reads the GGUF file format and quietly powers tools like LM Studio and Ollama.

llama.cpp is the engine many of the friendlier tools are built on. It is a C/C++ implementation for running models locally, and its stated goal is “to enable LLM inference with minimal setup and state-of-the-art performance on a wide range of hardware - locally and in the cloud.”^[1]

The “wide range of hardware” part is the reason it spread so far. It runs on Apple Silicon, on ordinary x86 processors, and on GPUs from NVIDIA, AMD, and Intel, among others.^[1] If you have a laptop or an older desktop with no fancy graphics card, llama.cpp is often the thing that lets it run a model at all.

It reads models in the GGUF format, a single portable file that packs the weights, the tokenizer, and the metadata together. The project provides scripts to convert models from other formats into GGUF.^[1] GGUF is also where quantization lives, the trick that shrinks a model to fit in less memory, which has its own topic on this site.

llama.cpp can also serve over the network. It ships llama-server, described as “a lightweight, OpenAI API compatible, HTTP server for serving LLMs,” with a basic web UI included.^[1] So you can use llama.cpp directly, or use it without realizing it through a tool like LM Studio that wraps it in a window.

References

llama.cpp README — ggml-org

What is llama.cpp and why does it show up everywhere?

References