Frontier vs local
Which costs less, a frontier model or a local one?
Frontier models cost nothing to start and then charge for every request, so the bill grows with how much you use them. Local models cost a lot upfront in hardware but almost nothing per request after that. Which is cheaper depends on your volume.
Frontier and local models bill in opposite shapes, and that shape matters more than any single price.
A frontier model is pay per use. You pay for the text going in and the text coming out, measured in tokens (roughly, pieces of words). There is nothing to buy first, which makes it cheap to start and easy at low volume. The catch is that the bill scales directly with usage, so it can become painful at high, steady volume.[1]
A local model flips this. You pay a large amount upfront for the hardware that can run it, and after that the cost per request is close to zero.[1] How much hardware depends on the model’s size: a guide for running models locally maps roughly 8 GB of memory to small models, 16 GB to mid-size ones, and a 24 GB graphics card to models in the 27 to 30 billion parameter range.[2] Bigger models need more.
So the real question is volume. One analysis suggests that if you are spending more than a few hundred dollars a month on a hosted API at a stable volume, it is worth checking whether local hardware would pay for itself within a couple of years.[1] Below that, renting is usually simpler and cheaper. Above it, owning the hardware can win, as long as you also count the cost of running and maintaining it yourself.