vLLM — Physea Wiki

vLLM is built for serving, not just personal use. Techniques like PagedAttention and continuous batching let a single model handle many requests at once, and it exposes an OpenAI-compatible API.

The first three tools are aimed at one person, or a few, using a model on their own machine. vLLM is aimed at the next step: serving a model to many users at the same time. Its own description is “a fast and easy-to-use library for LLM inference and serving,” and the README highlights state-of-the-art serving throughput.^[1]

Two ideas do most of the work. PagedAttention manages the model’s working memory more efficiently, so a request wastes less space. Continuous batching lets the server weave many in-flight requests together instead of handling them strictly one after another. The README lists both as key features.^[1] The practical result is that one machine can answer far more requests per second than it could by running prompts one at a time.

Like the others, vLLM offers an OpenAI-compatible API server, so the same client code works against it.^[1] That keeps the swap honest: you can build against Ollama on a laptop and move to vLLM when you need to serve real traffic, without changing how your app talks to the model.

A rough rule of thumb. Reach for Ollama or LM Studio when it is just you. Reach for vLLM when a model needs to sit behind a service and answer for a crowd.

References

vLLM README — vLLM project

Which runtime serves a model to many users at once?

References