Two modes — Physea Wiki

Training is when a model learns: it adjusts its internal numbers, called weights, by measuring its errors on data. Inference is when you use the finished model on new input, with the weights held fixed.

A model lives in two modes. The first is training, where it learns. The model makes a prediction, an error is measured against the right answer, and that error is used to nudge the model’s internal numbers, called weights, so the next prediction is a little better. Repeat this billions of times and the weights settle into values that capture patterns in the data. The math that does the nudging is gradient descent: training works by “computing predictions, measuring errors using a loss function, and updating parameters via optimization algorithms like stochastic gradient descent.”^[1]

The second mode is inference, where you use the finished model. You give it new input, it produces an answer, and that is the end of it. The weights do not change. During inference “the model’s parameters are fixed, and it processes inputs through a forward pass without updating weights.”^[1]

The short version Training writes the weights. Inference reads them. A chat reply, an image caption, a code completion: all of these are inference. The model is not learning from your message; it is applying what it already learned.

This split has a practical consequence. Training a large model is a heavy, one-time job that can take weeks on many machines. Inference is comparatively cheap and happens every time someone uses the model, which is why a lot of engineering effort goes into making it fast.

Inference runtimes

vLLM ↗
A fast, open library for running and serving language models for inference.
Ollama ↗
A simple way to run open models locally on your own machine.

References

What is the difference between training and inference in deep learning? — Milvus

What is the difference between training a model and using it?

Inference runtimes

References