Mixture of Experts — Physea Wiki

A Mixture-of-Experts model splits its parameters into many experts and uses only a few per token. In 235B-A22B, 235B is the total stored and A22B is the roughly 22 billion active per token.

A name like Qwen3-235B-A22B carries two numbers, not one. The first, 235B, is the total parameter count. The A marks the active count, so A22B means about 22 billion parameters are used for any given token.^[1] This is a Mixture-of-Experts (MoE) model, and the gap between the two numbers is the whole point.

Instead of one large block that every token passes through, an MoE model divides part of itself into many smaller sub-networks called experts. For each token, a small router picks just a few experts to run and leaves the rest idle. Qwen3-235B-A22B has 128 experts and activates 8 of them per token, which is how a model storing 235 billion parameters does the work of only about 22 billion at a time.^[2]

The benefit is efficiency. The model keeps the broad capacity of a very large network, since different experts can specialize, while each answer costs closer to what a 22B model would cost to compute. The catch shows up on the next page: you still have to load all 235 billion parameters into memory, even though only a slice runs per token. MoE saves computation, not storage.

References

Qwen3: Think Deeper, Act Faster — Qwen Team, Alibaba
Qwen3-235B-A22B Model Card — Qwen (Hugging Face)

What does A22B mean in a name like 235B-A22B?

References