This post is also available in:
עברית (Hebrew)
Recent research from the University of Amsterdam and the Santa Fe Institute sheds light on the reasoning capabilities of artificial intelligence (AI), specifically large language models like GPT-4. While these models have demonstrated impressive performance in reasoning tasks, the study suggests that AI’s true understanding of abstract concepts may still be limited, revealing key weaknesses when faced with modified problems.
One fundamental cognitive process that humans use is called analogical reasoning: the ability to draw comparisons between different things based on their shared features. For example, a classic analogy would be: “Shoes are to feet as hat is to head.” While AI models like GPT-4 perform well on basic analogy tasks, this study explores whether these models can replicate human-like flexibility in reasoning.
According to TechXplore, The researchers, Martha Lewis and Melanie Mitchell, tested both humans and GPT models on three types of analogies: letter sequences, digit matrices, and story-based analogies. The results were telling: while humans maintained high performance on all tasks, GPT models showed significant weaknesses when the problems were altered, revealing a reliance on pattern matching rather than true understanding.
For example, in digit matrices, GPT models struggled when the position of the missing number was changed, a task that posed no challenge for humans. Similarly, in story analogies, GPT-4 tended to select the first answer more often, while humans considered the context more deeply. When story elements were reworded, GPT-4’s performance dropped, further indicating its focus on surface-level similarities rather than deeper understanding.
The study underscores a critical point: while AI can mimic reasoning, its ability to generalize across variations is far less robust than human cognition. The research raises important concerns about AI’s use in decision-making processes, such as education, law, and healthcare. While AI has proven useful in many contexts, it is not yet capable of replacing the nuanced reasoning that humans bring to complex situations.
The study was published in Transactions on Machine Learning Research.