• About Us
  • Privacy Policy
  • Disclaimers
  • Terms and Conditions
  • Contact Us
  • DMCA Policy
Tech Chilli
  • News
  • AI
  • Fintech
  • Crypto
  • AI India
  • Robotics
  • Courses
  • How-To
  • Puzzles
  • Gaming
  • Contact Us
No Result
View All Result
  • News
  • AI
  • Fintech
  • Crypto
  • AI India
  • Robotics
  • Courses
  • How-To
  • Puzzles
  • Gaming
  • Contact Us
No Result
View All Result
Tech Chilli
No Result
View All Result

Home » News » Apple’s GSM-Symbolic Reveals Flaws in OpenAI and Meta’s LLMs’ Mathematical Reasoning

Apple’s GSM-Symbolic Reveals Flaws in OpenAI and Meta’s LLMs’ Mathematical Reasoning

Apple's new GSM-Symbolic benchmark reveals flaws in large language models' (LLMs) reasoning capabilities, especially in mathematics. The study exposes significant inconsistencies in LLMs, showing how minor changes to queries lead to drastically different answers.

Bilal by Bilal Abbas
Tuesday, 15 October 2024, 4:59 AM
in News
Apple

Apple

A team of Apple researchers has questioned the formal reasoning capabilities of large language models (LLMs), particularly in mathematics.

Apple, in their new research, found significant flaws in the current benchmark GSM8K and introduced GSM-Symbolic, which is built upon the existing GSM8K and can provide a more reliable measurement of the reasoning capabilities of Large Language Models (LLMs).

What’s New: 

A new study from Apple’s artificial intelligence team has revealed that large language models (LLMs) developed by companies like OpenAI and Meta have significant flaws. The research highlights that these AI systems struggle with even basic reasoning which has raised significant concerns about their reliability in real-world applications.

Key Insight:

The Apple researchers introduced a benchmark called GSM-Symbolic, which is designed to evaluate the reasoning capabilities of various LLMs. Their findings indicate that even minor changes in the wording of queries can lead to drastic changes in answers, which demonstrates a lack of consistency and reliability in these models.

The GSM8K benchmark is widely used to assess the mathematical reasoning of models on grade-school-level questions. While the performance of LLMs on GSM8K has significantly improved in recent years, to overcome the limitations of existing evaluations, Apple introduced GSM-Symbolic which is an improved benchmark. GSM-Symbolic enables more controllable evaluations, providing key insights and more reliable metrics for measuring the reasoning capabilities of models.

How This Works:

The study focused on the fragility of mathematical reasoning within LLMs. Researchers tested how LLMs responded to mathematical questions by adding contextual information that should not affect the outcome. For instance, they presented queries with additional sentences that seemed relevant but were actually irrelevant to the mathematical solution. Even slight modifications to the questions led to significant variations in answers which was alarming for the tech-savvy population. 

The researchers noted that “the performance of all models declines even when only the numerical values in the question are altered.” They further observed that as the questions became more complex like more clauses, the accuracy of responses deteriorated sharply. 

Results:

One striking example involved a maths problem where a character named Oliver picks kiwis over several days. The original query stated, “Oliver picks 44 kiwis on Friday, 58 kiwis on Saturday, and on Sunday, he picks double the number of kiwis he did on Friday.” An irrelevant clause was added: “Five of them were a bit smaller than average.” Despite this information being irrelevant, both OpenAI’s model and Meta’s Llama3-8b incorrectly deducted the five smaller kiwis from the total count.

This flaw illustrates a fundamental issue: LLMs can misinterpret or overreact to additional context that should not influence their calculations. The study concluded that “there is just no way you can build reliable agents on this foundation,” emphasising the critical need for more robust reasoning capabilities in AI systems.

                      The performance of all state-of-the-art models on GSM-Symbolic drops compared to GSM8K. 

Why This Matters:

These findings are significant as they challenge the current perception of LLMs as intelligent and reliable systems. The inability of these models to maintain consistent reasoning raises serious concerns about their application in fields requiring accuracy and dependability, such as education, healthcare, and finance.

We’re Thinking:

As AI technology is advancing, it is crucial to understand its limitations. This study serves as a crucial reminder that while LLMs are powerful tools for processing language, they are not infallible. Developers must prioritise improving the reasoning abilities of AI systems to ensure they can perform reliably in practical scenarios. The ongoing research into benchmarks like GSM-Symbolic could pave the way for more robust AI solutions that can mimic human reasoning and decision-making processes in a much better way.

Google to Use Nuclear Reactors to Meet AI Data Center Power Needs by 2035

Previous Post

Google to Use Nuclear Reactors to Meet AI Data Center Power Needs by 2035

Next Post

Adobe Firefly Video Model Guarantees Copyright Safety in Video Editing with Licensed Content

Bilal

Bilal Abbas

Bilal Abbas holds a Master’s in International Relations from Jamia Millia Islamia, Delhi, and a Bachelor’s in Economics from the University of Lucknow. A creative yet logical thinker, Bilal is deeply curious about the intricacies of the global economy and international politics. His interest in technology has led him to explore and write on fintech topics, blending his academic expertise with a passion for innovation. Bilal also finds joy in nature and appreciates the serenity of greenery. In his leisure time, Bilal can be found sketching, or immersed in a good book.

Next Post
Adobe Firefly Video Model Guarantees Copyright Safety

Adobe Firefly Video Model Guarantees Copyright Safety in Video Editing with Licensed Content

  • Trending
  • Comments
  • Latest
top Yield Farming Platforms

Top 13 Yield Farming Platforms in 2025: Maximize APY with Secure and Trusted Crypto Tools

April 17, 2025
scott wu net worth

Scott Wu Net Worth: Devin AI Software Engineer, CEO of Cognition Labs

April 17, 2025
Artificial Intelligence (AI) Glossary and Terminologies

Artificial Intelligence (AI) Glossary and Terminologies – Complete Cheat Sheet List

April 18, 2025
TurbolearnAI

Turbolearn AI: How to Use It for FREE, Features and Pricing Models

April 3, 2025
What is Blockchain Technology

What is Blockchain Technology And How Does It Work?

Enterprise AI

What is Enterprise AI? Meaning, Companies, Examples and More Details

PhonePe Leads UPI Market in August 2024, Claims 50% Share by Value and 48% by Volume

PhonePe Partners with Liquid Group to Bring UPI Payments to Singapore for Indian Travelers

Cosine Genie AI Software Engineer

What is Cosine Genie and How to Use? Check Benchmark, Functions, and Access Details

Perplexity AI voice assistant

Perplexity AI Voice Assistant: How to Use and Benefits for iOS and Android Phones

May 10, 2025
Meta AI App

Meta AI App: How to Download? Check Its Key Features and Benefits

May 10, 2025
AI in US education

AI in U.S. Education for American Youth by President DONALD TRUMP

May 10, 2025
Google is moving Android news to a virtual event before I/O

Google is moving Android news to a virtual event before I/O

April 29, 2025

Recent News

Perplexity AI voice assistant

Perplexity AI Voice Assistant: How to Use and Benefits for iOS and Android Phones

May 10, 2025
Meta AI App

Meta AI App: How to Download? Check Its Key Features and Benefits

May 10, 2025
AI in US education

AI in U.S. Education for American Youth by President DONALD TRUMP

May 10, 2025
Google is moving Android news to a virtual event before I/O

Google is moving Android news to a virtual event before I/O

April 29, 2025

Trending in AI

  • Perplexity CEO Net Worth
  • Grammarly AI Detection
  • What is LangChain
  • Canva AI Tool
  • Koupon AI
Tech Chilli

Tech Chilli is a beacon of knowledge, a relentless purveyor of the latest information, news, and groundbreaking research in the realm of cutting-edge technology.

We are dedicated to curating and delivering the most relevant, accurate, and up-to-the-minute information on the technologies that are shaping our world.
Contact us – [email protected]

Follow Us

Browse by Category

  • AI
  • AI India
  • Courses
  • Crypto
  • Featured
  • FinTech
  • Gaming
  • How-To
  • News
  • Puzzles
  • Robotics

Top Searches

  • Scott Wu Net Worth
  • Mira Murati Net Worth
  • Online Games for Couples
  • Amazon Q vs Microsoft Copilot
  • DarkGPT

Recent News

Perplexity AI voice assistant

Perplexity AI Voice Assistant: How to Use and Benefits for iOS and Android Phones

May 10, 2025
Meta AI App

Meta AI App: How to Download? Check Its Key Features and Benefits

May 10, 2025
AI in US education

AI in U.S. Education for American Youth by President DONALD TRUMP

May 10, 2025
Google is moving Android news to a virtual event before I/O

Google is moving Android news to a virtual event before I/O

April 29, 2025
  • About Us
  • Privacy Policy
  • Disclaimers
  • Terms and Conditions
  • Contact Us
  • DMCA Policy

© 2024 Tech Chilli

No Result
View All Result
  • News
  • AI
  • Fintech
  • Crypto
  • AI India
  • Robotics
  • Courses
  • How-To
  • Puzzles
  • Gaming
  • Contact Us

© 2024 Tech Chilli

We use cookies to ensure that we give you the best experience on our website. If you continue to use this site we will assume that you are happy with it.OK