Real reasoning? — Physea Wiki

Reasoning models that show their steps do better on many tasks, but researchers disagree on whether that is genuine reasoning or pattern-matching that collapses past a certain difficulty.

A newer kind of model is built to “show its work,” writing out steps before giving a final answer. These reasoning models do better on math and logic than plain models. The open question is what is actually happening: real step-by-step reasoning, or a very good imitation of it that has learned what reasoning text looks like.

A June 2025 paper from Apple, “The Illusion of Thinking,” tested these models on puzzles of rising difficulty and found a “complete accuracy collapse beyond certain complexities.”^[1] Stranger still, as problems got harder the models spent less effort thinking, not more, even with plenty of room to keep going.^[1] The paper also reported that the models “fail to use explicit algorithms and reason inconsistently across puzzles.”^[1] The implied conclusion: what looks like thinking may be closer to pattern-matching that runs out past a point.

The story does not end there, which is what makes this a live debate. A follow-up paper, “Rethinking the Illusion of Thinking,” argued that some of the failures were partly artifacts of the test setup. For one puzzle, the apparent collapse came from “testing unsolvable configurations,” and when only solvable problems were used, the models handled large cases well.^[2] Even so, those same authors found the failures “were not purely result of output constraints, but also partly a result of cognition limitations.”^[2] So both sides agree there are real limits, and disagree on how deep they go. Whether step-by-step output is genuine reasoning remains unsettled.

References

The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity — Apple Machine Learning Research (Shojaee et al.)
Rethinking the Illusion of Thinking — arXiv (Varela, Romero-Sorozabal, Rocon, Cebrian)

Do AI models actually reason, or just look like they do?

References