A team of Apple researchers has questioned the formal reasoning capabilities of large language models (LLMs), particularly in mathematics.
Apple, in their new research, found significant flaws in the current benchmark GSM8K and introduced GSM-Symbolic, which is built upon the existing GSM8K and can provide a more reliable measurement of the reasoning capabilities of Large Language Models (LLMs).
What’s New:
A new study from Apple’s artificial intelligence team has revealed that large language models (LLMs) developed by companies like OpenAI and Meta have significant flaws. The research highlights that these AI systems struggle with even basic reasoning which has raised significant concerns about their reliability in real-world applications.
Key Insight:
The Apple researchers introduced a benchmark called GSM-Symbolic, which is designed to evaluate the reasoning capabilities of various LLMs. Their findings indicate that even minor changes in the wording of queries can lead to drastic changes in answers, which demonstrates a lack of consistency and reliability in these models.
The GSM8K benchmark is widely used to assess the mathematical reasoning of models on grade-school-level questions. While the performance of LLMs on GSM8K has significantly improved in recent years, to overcome the limitations of existing evaluations, Apple introduced GSM-Symbolic which is an improved benchmark. GSM-Symbolic enables more controllable evaluations, providing key insights and more reliable metrics for measuring the reasoning capabilities of models.
How This Works:
The study focused on the fragility of mathematical reasoning within LLMs. Researchers tested how LLMs responded to mathematical questions by adding contextual information that should not affect the outcome. For instance, they presented queries with additional sentences that seemed relevant but were actually irrelevant to the mathematical solution. Even slight modifications to the questions led to significant variations in answers which was alarming for the tech-savvy population.
The researchers noted that “the performance of all models declines even when only the numerical values in the question are altered.” They further observed that as the questions became more complex like more clauses, the accuracy of responses deteriorated sharply.
Results:
One striking example involved a maths problem where a character named Oliver picks kiwis over several days. The original query stated, “Oliver picks 44 kiwis on Friday, 58 kiwis on Saturday, and on Sunday, he picks double the number of kiwis he did on Friday.” An irrelevant clause was added: “Five of them were a bit smaller than average.” Despite this information being irrelevant, both OpenAI’s model and Meta’s Llama3-8b incorrectly deducted the five smaller kiwis from the total count.
This flaw illustrates a fundamental issue: LLMs can misinterpret or overreact to additional context that should not influence their calculations. The study concluded that “there is just no way you can build reliable agents on this foundation,” emphasising the critical need for more robust reasoning capabilities in AI systems.
The performance of all state-of-the-art models on GSM-Symbolic drops compared to GSM8K.
Why This Matters:
These findings are significant as they challenge the current perception of LLMs as intelligent and reliable systems. The inability of these models to maintain consistent reasoning raises serious concerns about their application in fields requiring accuracy and dependability, such as education, healthcare, and finance.
We’re Thinking:
As AI technology is advancing, it is crucial to understand its limitations. This study serves as a crucial reminder that while LLMs are powerful tools for processing language, they are not infallible. Developers must prioritise improving the reasoning abilities of AI systems to ensure they can perform reliably in practical scenarios. The ongoing research into benchmarks like GSM-Symbolic could pave the way for more robust AI solutions that can mimic human reasoning and decision-making processes in a much better way.
Google to Use Nuclear Reactors to Meet AI Data Center Power Needs by 2035