Latest generations of frontier language fashions have launched Massive Reasoning Fashions
(LRMs) that generate detailed considering processes earlier than offering solutions. Whereas these fashions
exhibit improved efficiency on reasoning benchmarks, their basic capabilities, scal-
ing properties, and limitations stay insufficiently understood. Present evaluations primarily fo-
cus on established mathematical and coding benchmarks, emphasizing last reply accuracy. How-
ever, this analysis paradigm usually suffers from knowledge contamination and doesn’t present insights
into the reasoning traces’ construction and high quality. On this work, we systematically examine these
gaps with the assistance of controllable puzzle environments that enable exact manipulation of composi-
tional complexity whereas sustaining constant logical constructions. This setup allows the evaluation
of not solely last solutions but additionally the inner reasoning traces, providing insights into how LRMs
“suppose”. By means of in depth experimentation throughout numerous puzzles, we present that frontier LRMs
face a whole accuracy collapse past sure complexities. Furthermore, they exhibit a counter-
intuitive scaling restrict: their reasoning effort will increase with drawback complexity up to some extent, then
declines regardless of having an enough token price range. By evaluating LRMs with their commonplace LLM
counterparts underneath equal inference compute, we determine three efficiency regimes: (1) low-
complexity duties the place commonplace fashions surprisingly outperform LRMs, (2) medium-complexity
duties the place extra considering in LRMs demonstrates benefit, and (3) high-complexity duties
the place each fashions expertise full collapse. We discovered that LRMs have limitations in actual
computation: they fail to make use of specific algorithms and cause inconsistently throughout puzzles. We
additionally examine the reasoning traces in additional depth, learning the patterns of explored options
and analyzing the fashions’ computational habits, shedding gentle on their strengths, limitations,
and finally elevating essential questions on their true reasoning capabilities.
*Equal contribution.
†Work executed throughout an internship at Apple.