News

Apple’s GSM-Symbolic Reveals Flaws in OpenAI and Meta’s LLMs’ Mathematical Reasoning

Apple's new GSM-Symbolic benchmark reveals flaws in large language models' (LLMs) reasoning capabilities, especially in mathematics. The study exposes significant inconsistencies in LLMs, showing how minor changes to queries lead to drastically different answers.

A team of Apple researchers has questioned the formal reasoning capabilities of large language models (LLMs), particularly in mathematics.

Apple, in their new research, found significant flaws in the current benchmark GSM8K and introduced GSM-Symbolic, which is built upon the existing GSM8K and can provide a more reliable measurement of the reasoning capabilities of Large Language Models (LLMs).

What’s New:

A new study from Apple’s artificial intelligence team has revealed that large language models (LLMs) developed by companies like OpenAI and Meta have significant flaws. The research highlights that these AI systems struggle with even basic reasoning which has raised significant concerns about their reliability in real-world applications.

Key Insight:

The Apple researchers introduced a benchmark called GSM-Symbolic, which is designed to evaluate the reasoning capabilities of various LLMs. Their findings indicate that even minor changes in the wording of queries can lead to drastic changes in answers, which demonstrates a lack of consistency and reliability in these models.

The GSM8K benchmark is widely used to assess the mathematical reasoning of models on grade-school-level questions. While the performance of LLMs on GSM8K has significantly improved in recent years, to overcome the limitations of existing evaluations, Apple introduced GSM-Symbolic which is an improved benchmark. GSM-Symbolic enables more controllable evaluations, providing key insights and more reliable metrics for measuring the reasoning capabilities of models.

How This Works:

The study focused on the fragility of mathematical reasoning within LLMs. Researchers tested how LLMs responded to mathematical questions by adding contextual information that should not affect the outcome. For instance, they presented queries with additional sentences that seemed relevant but were actually irrelevant to the mathematical solution. Even slight modifications to the questions led to significant variations in answers which was alarming for the tech-savvy population.

The researchers noted that “the performance of all models declines even when only the numerical values in the question are altered.” They further observed that as the questions became more complex like more clauses, the accuracy of responses deteriorated sharply.

Results:

One striking example involved a maths problem where a character named Oliver picks kiwis over several days. The original query stated, “Oliver picks 44 kiwis on Friday, 58 kiwis on Saturday, and on Sunday, he picks double the number of kiwis he did on Friday.” An irrelevant clause was added: “Five of them were a bit smaller than average.” Despite this information being irrelevant, both OpenAI’s model and Meta’s Llama3-8b incorrectly deducted the five smaller kiwis from the total count.

This flaw illustrates a fundamental issue: LLMs can misinterpret or overreact to additional context that should not influence their calculations. The study concluded that “there is just no way you can build reliable agents on this foundation,” emphasising the critical need for more robust reasoning capabilities in AI systems.

The performance of all state-of-the-art models on GSM-Symbolic drops compared to GSM8K.

Why This Matters:

These findings are significant as they challenge the current perception of LLMs as intelligent and reliable systems. The inability of these models to maintain consistent reasoning raises serious concerns about their application in fields requiring accuracy and dependability, such as education, healthcare, and finance.

We’re Thinking:

As AI technology is advancing, it is crucial to understand its limitations. This study serves as a crucial reminder that while LLMs are powerful tools for processing language, they are not infallible. Developers must prioritise improving the reasoning abilities of AI systems to ensure they can perform reliably in practical scenarios. The ongoing research into benchmarks like GSM-Symbolic could pave the way for more robust AI solutions that can mimic human reasoning and decision-making processes in a much better way.

Google to Use Nuclear Reactors to Meet AI Data Center Power Needs by 2035

This post was last modified on October 15, 2024 4:59 am

Bilal Abbas

Bilal Abbas holds a Master’s in International Relations from Jamia Millia Islamia, Delhi, and a Bachelor’s in Economics from the University of Lucknow. A creative yet logical thinker, Bilal is deeply curious about the intricacies of the global economy and international politics. His interest in technology has led him to explore and write on fintech topics, blending his academic expertise with a passion for innovation. Bilal also finds joy in nature and appreciates the serenity of greenery. In his leisure time, Bilal can be found sketching, or immersed in a good book.

Next Adobe Firefly Video Model Guarantees Copyright Safety in Video Editing with Licensed Content »

Previous « Google to Use Nuclear Reactors to Meet AI Data Center Power Needs by 2035

Published by

Bilal Abbas

October 15, 2024 4:59 am

Crypto

Top 13 Yield Farming Platforms in 2026: Maximize APY with Secure and Trusted Crypto Tools

Explore the top 13 yield farming platforms for 2026, featuring secure, trusted, and high-APY crypto…

January 4, 2026

Top AI Learning Platforms for 2026: Master AI Skills with Coursera, edX, and Udacity

Explore the best AI learning platforms for 2026, including Coursera, edX, Udacity, and more. Learn…

January 4, 2026

Crypto

13 Best Polygon Wallets in 2026 You Need to Checkout

Explore the 13 best Polygon wallets in 2026, comparing security, DeFi access, hardware and mobile…

January 1, 2026

Apple’s GSM-Symbolic Reveals Flaws in OpenAI and Meta’s LLMs’ Mathematical Reasoning

What’s New:

Key Insight:

How This Works:

Results:

Why This Matters:

We’re Thinking:

Recent Posts

Best AI Model for Every Task: Image, Video, PPT and More

What is Agentic AI? Check How it Works with Real-Life Agentic AI Automation Examples

13 Best Free Online Vocal Remover AI Tools in 2026

Top 13 Yield Farming Platforms in 2026: Maximize APY with Secure and Trusted Crypto Tools

Top AI Learning Platforms for 2026: Master AI Skills with Coursera, edX, and Udacity

13 Best Polygon Wallets in 2026 You Need to Checkout

Apple’s GSM-Symbolic Reveals Flaws in OpenAI and Meta’s LLMs’ Mathematical Reasoning

What’s New:

Key Insight:

How This Works:

Results:

Why This Matters:

We’re Thinking:

Related Post

Recent Posts

Best AI Model for Every Task: Image, Video, PPT and More

What is Agentic AI? Check How it Works with Real-Life Agentic AI Automation Examples

13 Best Free Online Vocal Remover AI Tools in 2026

Top 13 Yield Farming Platforms in 2026: Maximize APY with Secure and Trusted Crypto Tools

Top AI Learning Platforms for 2026: Master AI Skills with Coursera, edX, and Udacity

13 Best Polygon Wallets in 2026 You Need to Checkout