Apple Finds Reasoning Flaws in AI models

June 10, 2025

1 View 0

SaveSavedRemoved 0

A rather brutal truth has emerged in the AI industry, redefining what we consider the true capabilities of AI. A research paper titled “The Illusion of Thinking” has sent ripples across the tech world, exposing reasoning flaws in prominent AI ‘so-called reasoning’ models – Claude 3.7 Sonnet (thinking), DeepSeek-R1, and OpenAI’s o3-mini (high). The research proves that these advanced models don’t really reason the way we’ve been led to believe. So what are they actually doing? Let’s find out by diving into this research paper by Apple that exposes the reality of AI thinking models.

The Great Myth of AI Reasoning

For months, tech companies have been pitching their newer models as great ‘reasoning’ systems that follow the human method of step-by-step thinking to solve complex problems. These large reasoning models generate elaborate scenarios of “thinking processes” before the actual answer is given, showing the genuine cognitive work happening behind the scenes.

But Apple’s researchers have lifted the curtain on the technological drama, revealing the true capabilities of AI chatbots, which look rather dull. These models seem to be far more akin to pattern matchers that really cannot get through when faced with truly complex problems.

The Illusion of Thinking: Apple Finds Reasoning Flaws in AI models — Source: Apple Research

The Devastating Discovery

The observations stated in ‘The Illusion of Thinking’ would bother anyone already placing a wager on the reasoning capabilities of current AI systems. Apple’s research team, led by scientists who carefully designed controllable puzzle environments, made three monumental discoveries:

1. The Complexity Cliff

One of the major findings is that these supposedly advanced reasoning models suffer from what has been termed by the researchers as “complete accuracy collapse”, beyond certain complexity thresholds. Rather than a slow descent that may happen over time, this observation outright exposes the shallow nature of their so-called “reasoning”.

Imagine a chess grandmaster who suddenly forgets how a piece moves, just because you added an extra row to the board. That’s exactly how these models behaved during the research. The models that seemed extremely intelligent on problem sets they were acquainted with, suddenly became completely lost, the moment they were nudged even an inch out of their comfort zone.

2. The Effort Paradox

What is more baffling is that Apple found these models have a scaling barrier against any logic. As the problems became more demanding, these models initially augmented their reasoning effort, showing longer thinking processes and more detail in each step. However, there came a point when they simply stopped trying and started paying less attention to their tasks, despite having hefty computational resources.

It is as if a student, when presented with increasingly difficult math problems, tries a bit hard at first but loses interest at one point and just starts to guess the answer randomly, despite having ample time to work on the problems.

3. The Three Zones of Performance

In the third finding, Apple identifies three zones of pure performance, indicating the true nature of these systems:

Low-complexity tasks: Standard AI models outperform their “reasoning” counterparts in these tasks, suggesting extra reasoning steps may just be an expensive show.
Medium-complexity tasks: This is found to be the sweet spot where reasoning models shine.
High-Complexity tasks: A spectacular failure from both standard and reasoning models was seen in these tasks, hinting at inherent limitations.

The Benchmark Problem and Apple’s Solution

‘The Illusion of Thinking’ reveals a secret about AI evaluation as well. Most benchmarks contain training data, causing the model to appear more capable than it actually is. These tests, therefore, evaluate models on memorized instances to a great extent. Apple, on the other hand, created a much more revealing evaluation process. The research team tested the models on the follwoing four logical puzzles with systematically rescalable complexity:

Tower of Hanoi: Moving disks by planning moves several steps ahead.
Checker Jumping: Moving pieces strategically, based on spatial reasoning and sequential planning.
River crossing: A logic puzzle about getting multiple entities across a river with constraints.
Block Stacking: A 3D reasoning task requiring knowledge of physical relationships.

The selection of these tasks or problems was by no means random. Each problem could be scaled precisely from trivial to mind-boggling, so that researchers can know at which level the AI reasoning gives out.

Watching AI “Think”: The Actual Truth

Unlike most traditional benchmarks, these puzzles did not limit the researchers to look at just the final answers. They actually revealed the entire chain of reasoning of the models to be evaluated. Researchers could watch the models solve problems step-by-step, seeing if the machines were going through logical principles or were just pattern-matching from some memory.

The results were eye-opening. Models that appeared to be actually “reasoning” through a problem beautifully would suddenly go illogical, abandon systematic approaches, or simply give up when complexity increased, though moments earlier, they had perfectly demonstrated the required skills.

By making new, controllable puzzle environments, Apple circumvented the contamination problem and exposed the full scale of model limitations. The outcome was sobering. For real, new, and fresh challenges that could not be memorized, even the most advanced reasoning models were struggling in ways that highlight the real limits posed upon them.

Results and Analysis

Across all four types of puzzles, Apple’s researchers documented consistent failure modes that provide a grim picture of today’s AI capabilities.

Accuracy Issue: On these puzzle sets, a model that reached almost perfect performance on the simplistic versions encountered an astonishing drop in accuracy. Sometimes, it would fall from almost 90% success to an almost total failure with only a few additional complex steps added. This was never a gradual degradation, but a sudden and catastrophic failure.
Inconsistent logic application: The models sometimes failed to apply algorithms consistently when demonstrating knowledge of the very correct approaches. For example, a model may apply a systematic strategy successfully for one Tower of Hanoi puzzle, but then abandon that very strategy on a very similar but slightly more complex instance.
Role of Effort Paradox: The researchers, in correlation with problem difficulty, studied the amount of ‘thinking” the model did. This ranged from length to granularity levels of reasoning traces. Initially, the thinking effort increased with complexity. However, as the problems became tougher to solve, the model would quite abnormally start relaxing its effort, even with an unlimited computational resource provided.
Computational Shortcuts: It was also found that the model tended to take computational shortcuts that worked really well for simple problems, but would lead to catastrophic failures in harder cases. Rather than recognizing such a pattern and trying to compensate, the model would either keep on trying with bad strategies or just give up.

These findings establish that, in essence, current AI reasoning is more brittle and limited than the public demonstrations have led us to believe. The models are yet to learn reasoning; for now, they only recognize reasoning and replicate it if they have seen it somewhere else.

Why Does This Matter for the Future of AI?

‘The Illusion of Thinking’, far from being academically nitpicking, touches very deeply upon the implications of AI. We can see it affects the entire AI industry and anyone who may make a decision using AI capabilities.

Apple’s findings indicate that so-called ‘reasoning’ is indeed just a very sophisticated kind of memorization and pattern matching. The models excel in recognizing problem patterns they have seen before and then associate the solution they have previously learned. However, they tend to fail when asked to really logically reason through a problem that is somehow new to them.

For the past few months, the AI community has been awestruck with the advancements in reasoning models, as shown by their parent companies. Industry leaders have even gone on to promise us that Artificial General Intelligence (AGI) is right around the corner. ‘The Illusion of Thinking’ tells us that this assessment is absurdly optimistic. If present ‘reasoning’ models are not able to handle complexities above the current benchmarks, and if they are indeed just dressed-up pattern-matching systems, then the pathway toward true AGI might be longer and tougher than Silicon Valley’s most optimistic proposals.

Despite sobering observations, Apple’s study does not remain entirely pessimistic. The performance of AI models in the medium-complexity regime shows the actual progress in their reasoning capabilities. In this category, these systems can execute really complicated tasks, which were deemed impossible some four or so years ago.

Conclusion

Apple’s research marks a turning point from breathless hype to precise scientific measurements of what AI systems can do. This is where the AI Industry faces its next choice. Will it continue to chase benchmark scores and marketing claims, or focus on building systems that can really do some level of reasoning? The companies that will do the latter might end up building the AI systems we really need.

It is clear, however, that future paths to AGI will require more than just scaled-up pattern-matchers. They will need fundamentally new approaches to reasoning, understanding, and genuine intelligence. Illusions of thinking can be convincing, but as Apple has shown, that’s all they are: illusions. The real task of engineering truly intelligent systems is just beginning.

Gen AI Intern at Analytics Vidhya
Department of Computer Science, Vellore Institute of Technology, Vellore, India

I am currently working as a Gen AI Intern at Analytics Vidhya, where I contribute to innovative AI-driven solutions that empower businesses to leverage data effectively. As a final-year Computer Science student at Vellore Institute of Technology, I bring a solid foundation in software development, data analytics, and machine learning to my role.

Feel free to connect with me at [email protected]

Apple Finds Reasoning Flaws in AI models

The Great Myth of AI Reasoning

The Devastating Discovery

1. The Complexity Cliff

2. The Effort Paradox

3. The Three Zones of Performance

The Benchmark Problem and Apple’s Solution

Watching AI “Think”: The Actual Truth

Results and Analysis

Why Does This Matter for the Future of AI?

Conclusion

Login to continue reading and enjoy expert-curated content.

Bootlegger, Entrepreneur, Whistleblower – The Indiana History Blog

Marji Bordner – Animator and Comic Book Artist – Part III

A Preview of Things to Come

First-party data explained: Benefits, use cases and best practices – Analytics Platform

Make Veo3-like AI Videos for Free Using These Tools

How to Get Comfortable with Quantitative UX Research – MeasuringU

Leave a reply Cancel reply

Compare items