Constrained decoding — Physea Wiki

Strict schema modes work through constrained decoding: at every step the model is only allowed to pick tokens that keep the output valid against the schema, so a parse error becomes impossible.

Asking nicely can fail. Strict schema modes do not ask; they make the wrong output impossible to produce. The technique is called constrained decoding, and it works at the level of individual tokens.

Recall that a model generates one token at a time, each time choosing from a scored list of candidates. Constrained decoding inserts a filter into that choice. A small state machine tracks the partial output and answers one question at each step: given what has been written so far, which tokens could legally come next under the schema? Every other token is blocked, set to a score of negative infinity so it can never be picked.^[1] If the schema says the next thing must be a closing brace or a quote, the model simply cannot emit anything else.

Because the rule is enforced during generation rather than checked afterward, the result is valid by construction. As one description puts it, “the output is guaranteed valid — you never get a parse error, never have to retry, never need a fallback parser.”^[1] Anthropic states that its structured outputs “use constrained sampling with compiled grammar artifacts” to reach the same guarantee.^[2] The same machinery can enforce richer shapes than JSON, including full context-free grammars described in a notation like EBNF, useful for constraining output to a real language such as SQL.^[1]

It is not free, and it is not magic. There is a small per-step cost, “typically a few percent” for JSON-schema constraints, and more for deep grammars.^[1] More importantly, the guarantee covers form, not truth. The output will fit your schema, but the values inside can still be wrong, and a tight constraint can even hurt quality by forcing the model to “commit to a number on the first decoded token, no room to think.”^[1] The honest limits remain in the next section.

There are real edge cases too. Anthropic warns that even with structured outputs there “are scenarios where the output may not match your schema,” such as when the model refuses a request or runs out of tokens mid-answer.^[2] So the schema is a strong floor, not a promise that the answer is correct or complete. Validate the values and handle refusals, then trust the shape.

References

Constrained decoding: forcing LLM output to a grammar — ZeroEntropy
Structured outputs — Anthropic

How can output be guaranteed to match a schema?

References