When users ask ChatGPT a question, it may seem like the chatbot understands and responds in a human-like manner, which gives the impression that it can reason. Recently, companies like OpenAI (with o1) and Microsoft (with Think Deeper) have claimed that their models can reason, but nothing could be further from the truth.
Chatbots under the microscope. Six Apple researchers tested both open source and proprietary AI models to examine their limitations in reasoning. Their study analyzed Llama, Phi, Gemma, Mistral, and GPT-4o and o1.
Misleading benchmarks. One of the standout tests is GSM8K, a benchmark developed by OpenAI that’s widely used to evaluate the mathematical reasoning abilities of AI models. GPT-3 (175 billion parameters) scored 35%, while today, much smaller models with just 3 billion parameters score over 85%, and some of the large language models score above 95%. Does this mean they’re capable of reasoning? As it turns out, not really.
Playing with changing values. Mehrdad Farajtabar, one of the study’s authors—alongside Samy Bengio, brother of Yoshua Bengio, one of the fathers of deep learning—explained their analysis in a thread on X. They developed a tool called GSM-Symbolic, allowing them to create controlled experiments with different values and names to observe how AI models react to changes.
Questionable accuracy. The researchers found that the accuracy of the GSM8K benchmark was quite variable. Farajtabar noted that this “inference” was particularly fragile. “LLMs remain sensitive to changes in proper names (e.g., people, foods, objects), and even more so when numbers are altered. Would a grade-school student's math test score vary by ~10% if we only changed the names?”
Even more difficult. When researchers removed a sentence from the problem statement, added another, or added two sentences, performance dropped, and the variability of the GSM8K test results increased. To them, this shows that the models are “increasingly unreliable.”
Let’s fool the AI. The experts also added a seemingly relevant sentence to the problem that didn’t contribute to the reasoning process or the conclusion. The result? The models’ performance dropped significantly. This irrelevant information confused the AI systems, causing a drop in their performance. If they were indeed “reasoning,” they’d have realized that factoring in the irrelevant information made no sense.
Chess cheats. The study by these experts confirms something that analysts and specialists have pointed out for some time. Simple tests, such as having a chatbot count Rs or multiply matrices, demonstrate this. It becomes even more evident when you ask a generative AI chatbot to play chess: It’s common for it to make illegal moves.
Be careful about trusting your chatbot. Once the message is clear to both users and developers of these chatbots, the true reasoning capacity of these models is revealed as a myth. This raises concerns about creating reliable AI agents that act on specific information, which can be counterproductive.
Image | Mariia Shalabaieva (Unsplash)
View 0 comments