Inference and fine-tuning

Inference is what happens every time you use a model: it runs forward on new input with its weights frozen. Fine-tuning is extra training afterward that adjusts those weights to specialize the model for a task.

Inference is the everyday act of using a model. It is “the phase where a trained model processes new, unseen data and returns an output.”^[1] Every time you send a prompt and read a reply, that is one inference. The key point is that no learning happens during inference: “the model’s learned weights are fixed, and each input triggers a forward pass to generate outputs.”^[1] Training is where the model learned; inference is where it puts that learning to work.

Fine-tuning is extra training applied to an already-trained model so it does better on a narrower job. Google calls this “additional training” that “unlocks an LLM’s practical side,” teaching it on examples specific to a task.^[2] Fine-tuning changes the model’s parameters, but it does not make the model bigger: “a fine-tuned model contains the same number of parameters as the foundation LLM.”^[2] It starts from the general model and specializes it.

The difference matters because the two are often confused. Adjusting your prompt is not fine-tuning; it changes the input, not the model. Fine-tuning actually edits the model’s numbers and produces a new version. Inference, by contrast, leaves the model untouched and simply runs it.

References

AI Inference vs Training: Key Differences Explained — DigitalOcean
LLMs: Fine-tuning, distillation, and prompt engineering — Google for Developers

What is inference, and what is fine-tuning?

References