The hybrid pattern — Physea Wiki

Most real systems use both. A router sends easy queries to a small local or cheap model and saves the strong cloud model for hard ones. Published work reports this cuts cost a lot while keeping quality close to the strong model.

You do not have to pick one side. The pattern most real systems settle on uses both, with a piece in front called a router that decides, per request, where each prompt should go.

The idea is simple. Many prompts are easy, and a small local model or a cheap cloud model answers them fine. A few prompts are hard, and those are worth sending to a strong, expensive model. The router looks at each incoming prompt, guesses how hard it is, and sends it to the cheaper option when it can. You only pay for the strong model on the prompts that actually need it.

Two published results show why this is attractive. The RouteLLM team reported that their routers could cut cost by up to 85% while keeping 95% of GPT-4’s performance on the MT Bench benchmark.^[1] A separate study, Hybrid LLM, found that routing by predicted difficulty let it make up to 40% fewer calls to the large model with no drop in response quality.^[2] Both numbers are specific to their setups, not guarantees for every workload, but they point the same way: a lot of traffic can go to the cheaper model without users noticing.

The takeaway The choice is rarely local or cloud. It is which work belongs where, with a router sending each request to the cheapest option that can still answer it well.

References

RouteLLM — LMSYS Org
Hybrid LLM: Cost-Efficient and Quality-Aware Query Routing — arXiv

Can you use local and cloud AI together?

References