Apple Study Exposes AI Reasoning Flaws, Challenges AGI Hype

Planck

- Significant flaws in AI reasoning capabilities highlighted by Apple researchers.
- Current evaluation methods and AGI potential questioned by findings.
On June 9, 2025, Cointelegraph reported that Apple researchers had uncovered significant flaws in current AI models' reasoning abilities, following the June 2025 publication of their paper, "The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity." In this paper, they reveal that models like OpenAI's ChatGPT and Anthropic's Claude struggle with complex reasoning tasks and fail to generalize effectively.
These revelations challenge assumptions about the capabilities of large reasoning models (LRMs). On June 9, Cointelegraph detailed these findings; Investing.com and Digital Information World also reported on them the same day. In their June 2025 paper, Apple researchers argue that despite advancements, the research community still does not sufficiently understand the fundamental capabilities and limits of LRMs. Furthermore, they criticize current evaluation methods as inadequate for assessing true reasoning abilities, as these methods, according to the paper, often focus on mathematical and coding benchmarks that emphasize final answer accuracy. To support their claims, the researchers employed various puzzle games in their tests.
The results, according to the Apple researchers' June 2025 paper, demonstrated a "complete accuracy collapse beyond certain complexities." This finding indicates that advanced AI models mimic reasoning patterns instead of internalizing or generalizing them. In the same paper, the researchers also observed instances of "overthinking," where models initially generated correct answers but later diverged into incorrect reasoning. Consequently, the study concluded that sophisticated, yet fragile, pattern matching better explains these models' behaviors.
The findings from Apple researchers challenge prevailing assumptions about the capabilities of current LRMs and suggest that existing approaches might encounter fundamental barriers to achieving generalizable reasoning. This research is significant because it highlights the pressing need for improved evaluation methods and for models capable of moving closer to AGI-level reasoning.
Get the latest news in your inbox!