Apple's new GSM-Symbolic benchmark reveals flaws in large language models' (LLMs) reasoning capabilities, especially in mathematics. The study exposes significant inconsistencies in LLMs, showing how minor changes to queries lead to drastically different answers.

Apple
A team of Apple researchers has questioned the formal reasoning capabilities of large language models (LLMs), particularly in mathematics.
Apple, in their new research, found significant flaws in the current benchmark GSM8K and introduced GSM-Symbolic, which is built upon the existing GSM8K and can provide a more reliable measurement of the reasoning capabilities of Large Language Models (LLMs).
A new study from Apple’s artificial intelligence team has revealed that large language models (LLMs) developed by companies like OpenAI and Meta have significant flaws. The research highlights that these AI systems struggle with even basic reasoning which has raised significant concerns about their reliability in real-world applications.
The Apple researchers introduced a benchmark called GSM-Symbolic, which is designed to evaluate the reasoning capabilities of various LLMs. Their findings indicate that even minor changes in the wording of queries can lead to drastic changes in answers, which demonstrates a lack of consistency and reliability in these models.
The GSM8K benchmark is widely used to assess the mathematical reasoning of models on grade-school-level questions. While the performance of LLMs on GSM8K has significantly improved in recent years, to overcome the limitations of existing evaluations, Apple introduced GSM-Symbolic which is an improved benchmark. GSM-Symbolic enables more controllable evaluations, providing key insights and more reliable metrics for measuring the reasoning capabilities of models.
The study focused on the fragility of mathematical reasoning within LLMs. Researchers tested how LLMs responded to mathematical questions by adding contextual information that should not affect the outcome. For instance, they presented queries with additional sentences that seemed relevant but were actually irrelevant to the mathematical solution. Even slight modifications to the questions led to significant variations in answers which was alarming for the tech-savvy population.
The researchers noted that “the performance of all models declines even when only the numerical values in the question are altered.” They further observed that as the questions became more complex like more clauses, the accuracy of responses deteriorated sharply.
One striking example involved a maths problem where a character named Oliver picks kiwis over several days. The original query stated, “Oliver picks 44 kiwis on Friday, 58 kiwis on Saturday, and on Sunday, he picks double the number of kiwis he did on Friday.” An irrelevant clause was added: “Five of them were a bit smaller than average.” Despite this information being irrelevant, both OpenAI’s model and Meta’s Llama3-8b incorrectly deducted the five smaller kiwis from the total count.
This flaw illustrates a fundamental issue: LLMs can misinterpret or overreact to additional context that should not influence their calculations. The study concluded that “there is just no way you can build reliable agents on this foundation,” emphasising the critical need for more robust reasoning capabilities in AI systems.
The performance of all state-of-the-art models on GSM-Symbolic drops compared to GSM8K.
These findings are significant as they challenge the current perception of LLMs as intelligent and reliable systems. The inability of these models to maintain consistent reasoning raises serious concerns about their application in fields requiring accuracy and dependability, such as education, healthcare, and finance.
As AI technology is advancing, it is crucial to understand its limitations. This study serves as a crucial reminder that while LLMs are powerful tools for processing language, they are not infallible. Developers must prioritise improving the reasoning abilities of AI systems to ensure they can perform reliably in practical scenarios. The ongoing research into benchmarks like GSM-Symbolic could pave the way for more robust AI solutions that can mimic human reasoning and decision-making processes in a much better way.
Google to Use Nuclear Reactors to Meet AI Data Center Power Needs by 2035
This post was last modified on October 15, 2024 4:59 am
Pick your task, get the best AI model for it — images, video, slides, research,…
Learn what Agentic AI is, how it works, and how it differs from Generative AI.…
Discover the 13 best free online vocal remover AI tools for 2026, designed to isolate…
Explore the top 13 yield farming platforms for 2026, featuring secure, trusted, and high-APY crypto…
Explore the best AI learning platforms for 2026, including Coursera, edX, Udacity, and more. Learn…
Explore the 13 best Polygon wallets in 2026, comparing security, DeFi access, hardware and mobile…