Subject 03 · Builds on Architecture
Models
Frontier and local models in plain language. Families, sizes, licensing, context, pricing, and how to choose one.
22 pages across 6 topics
Frontier vs local
Rented in the cloud versus run on your machine.
- Frontier vs local A frontier model lives on a vendor's servers and you rent it per request through an API. A local model is one whose weights you download and run on your own machine. The split shapes every trade-off that follows: capability, cost, and privacy.
- The capability gap For everyday tasks like coding, summarizing, and answering questions over your own documents, good local models are now close to frontier ones. Frontier models keep a clear lead on the hardest reasoning, on images and audio, and on staying reliable over very long inputs.
- The cost question Frontier models cost nothing to start and then charge for every request, so the bill grows with how much you use them. Local models cost a lot upfront in hardware but almost nothing per request after that. Which is cheaper depends on your volume.
- Privacy and control A frontier model sends every prompt to the vendor's servers, which you have to trust with your data. A local model keeps prompts and outputs on hardware you control, which removes a whole category of exposure but makes you responsible for securing that hardware yourself.
Model families
Claude, GPT, Gemini, Llama, Qwen, and the rest.
- Model families A model family is a line of related models from a single maker, sharing a name and design lineage. The main split in 2026 is between frontier models you access over an API and open-weight models whose files you can download and run yourself.
- Frontier families Claude, GPT, and Gemini are the three frontier families you reach as services. Claude is associated with coding and agentic work, GPT with broad general use and an auto-routing design, and Gemini with native multimodality and very long context.
- Open-weight families Llama, Qwen, DeepSeek, Mistral, and Gemma are the leading open-weight families: models whose files you can download and run yourself. Each has its own reputation, from Llama's role in starting the open-weight wave to Gemma's focus on small, on-device sizes.
Sizes & parameters
What 7B, 70B, and MoE actually mean.
- What 7B means The B stands for billion parameters. A 7B model holds about 7 billion learned numbers and a 70B model about 70 billion. Parameters are the adjustable values a model tunes during training to fit its data.
- Size and capability More parameters usually buys more capability, but it is not the only lever. The Chinchilla study found a smaller, better-trained model beating one four times its size, because model size and training data should grow together.
- Mixture of Experts A Mixture-of-Experts model splits its parameters into many experts and uses only a few per token. In 235B-A22B, 235B is the total stored and A22B is the roughly 22 billion active per token.
- Size and memory Memory needed is roughly parameter count times bytes per parameter: 2 bytes each in half precision, so a 70B model needs about 140GB. Quantization shrinks each parameter to cut that down.
Licensing & open weights
Open weights, open source, and what you may do.
- Open weights vs open source Open weights means you can download a model's trained parameters and run them yourself. Open source AI goes further and also shares the training data information and code. Proprietary models keep all of it behind an API.
- Permissive vs restricted Permissive licenses such as Apache 2.0 and MIT let you use, modify, and sell with little more than keeping a notice. Restricted community licenses, like Meta's Llama license, add their own rules on top.
- What you may do What you may legally do with a downloaded model depends entirely on its license. Read it first: it sets the rules for commercial use, fine-tuning, attribution, and the model's outputs.
Context & pricing
What you pay for, and how to pay less.
- How billing works Hosted model APIs charge per token, counting the words you send in and the words the model writes back. Input and output are priced separately, and output is more expensive because the model has to generate it one token at a time.
- The context window A context window is the model's working memory — every token it can reference at once, including the response it is generating. It is measured in tokens, and more is not automatically better.
- Prompt caching Prompt caching saves the unchanging front of a prompt so later requests reuse it instead of paying to reprocess it. Cached input reads can cost a small fraction of the normal input price, often around a tenth.
- Batching Batch APIs run many requests asynchronously at roughly half the standard price. The trade is patience: results come back within a window of up to 24 hours instead of immediately.
Choosing a model
Matching the model to the job.
- Start from the task Do not start by ranking models. Start by writing down the task and the criteria that matter for it. Anthropic's own guidance is that knowing these answers in advance makes narrowing the choice much easier.
- Capability, cost, privacy Most model choices balance three pulls: capability, cost, and privacy. The strongest model is not always the right one. A smaller or self-hosted model can be the better fit when speed, budget, or data control matter more.
- Test on your data The decisive test is not a leaderboard. It is a small set of your own real cases, scored the way you define good. Building that evaluation set is the single most important step in choosing a model.
- Why benchmarks mislead Public benchmarks are a useful first filter but a weak final answer. Scores can be inflated by training on test data, and they measure broad skills rather than your specific inputs. Use them to narrow, then test on your own cases.