The capability gap — Physea Wiki

For everyday tasks like coding, summarizing, and answering questions over your own documents, good local models are now close to frontier ones. Frontier models keep a clear lead on the hardest reasoning, on images and audio, and on staying reliable over very long inputs.

A few years ago the gap between hosted frontier models and anything you could run yourself was wide. That gap has narrowed sharply, but it has not closed evenly across every kind of task.

For common work, local models are now competitive. One review describes the gap on coding, math, reasoning, and general chat as small, with open-weight models on average only a few months behind the best proprietary systems.^[1] A separate guide notes that for practical jobs like coding, summarization, classification, and answering questions over your own documents, local models now do work you would have paid frontier prices for in 2023.^[2]

The lead that remains is real and specific. The same review rates the gap as larger on multimodal tasks (working with images and audio) and on staying reliable across very long inputs, where the hosted frontier models are still ahead.^[1] For the hardest, multi-step reasoning, or where a mistake is expensive, the frontier model is usually the safer pick.

The honest summary: pick a local model when the task is well within reach of a smaller model, and reach for a frontier model when you need the top of the range. The line between those two keeps moving in the local model’s favor.

References

The Best Open-Source LLMs — BentoML
The Best Open Source and Open-Weight LLM Models to Run Locally — Hugging Face

Are local models as capable as frontier models?

References