Study Finds AI Chatbots Often Overestimate Their Own Abilities and Struggle to Notice When They’re Wrong

Jul 25, 2025

This post is also available in: עברית (Hebrew)

New research into the behavior of large language models (LLMs) has revealed a notable pattern: artificial intelligence tools tend to be overly confident in their answers, even when they’re wrong—and unlike humans, they struggle to recalibrate that confidence after the fact.

The study, published in Memory & Cognition, compared human participants with four widely used LLMs—ChatGPT, Gemini, Sonnet, and Haiku—over the course of two years, across a range of tasks, including trivia questions, event prediction, and image recognition games. The aim was to explore not just performance, but also how accurately participants—human and AI—could assess their own abilities.

Acording to TechXplore In advance of each task, both humans and LLMs were asked to estimate how well they expected to perform. Across the board, both groups showed a tendency toward overconfidence. But a key difference emerged after the fact: human participants tended to revise their self-assessments based on actual outcomes. The LLMs, on the other hand, often became more confident in their answers, even when performance was poor.

One striking example came from a Pictionary-style task where participants had to identify hand-drawn sketches. ChatGPT-4 correctly identified about 12.5 out of 20, comparable to human participants. Gemini, by contrast, managed fewer than one correct answer on average—but still claimed afterward that it had answered 14 correctly.

This disconnect highlights a broader challenge in AI development: current LLMs may appear confident in their answers, but that confidence is not always backed by accuracy or self-awareness. Researchers suggest this overconfidence could mislead users, especially when responses appear authoritative.

The issue becomes more concerning in high-stakes scenarios, such as when answering legal, medical, or news-related questions, and when their delivery does not include clear indicators of uncertainty.

The research underscores the importance of caution when using AI tools for critical decision-making. While LLMs can offer useful insights, they’re not yet equipped to reliably evaluate their own accuracy—something human users should keep in mind when assessing chatbot-generated responses.

Study Finds AI Chatbots Often Overestimate Their Own Abilities and Struggle to Notice When They’re Wrong

Latest

Passkeys Are Not Immune: New Research Highlights Critical Vulnerability

The Drone Marketplace for Faster Military Procurement

Navy Pushes Boundaries with On-Site 3D Printing for Battlefield Logistics

U.S. Federal Court Filing System Allegedly Under Cyber Surveillance by Kremlin-Linked...

New Approach Embeds Safety into Open-Weight AI Models from Day One

Ukraine Deploys Missile-Armed Crop Dusters to Counter Drone Threats

Google Confirms Limited Cybersecurity Breach, Says Core Infrastructure Unaffected

OpenAI Restores Access to Previous AI Models After Backlash Over GPT-5...

The Electrically Activated Self-Cleaning Glass

Radar-Based Eavesdropping: This is How Your Phone Reveals Your Coversations

3D‑Printed Space Fuel Tanks Survives Extreme-Condition Tests

Security Testing Reveals Major Vulnerabilities in GPT-5’s Default Configuration

Not All Driver Assistance Systems Improve Behavior Equally, Study Finds

New Bio-Inspired Robotic Actuator Mimics Human Muscle for High-Precision Tasks

Fake Android Banking Apps Are Hijacking Devices and Draining Accounts in...

Study Warns of Hidden Risks in AI Training as Models Inherit...

New Digital Self-Protection Suite Enhances Aircraft Survivability in High-Threat Environments

The System that Allows Robot-to-Robot Communication and Collaboration

OpenAI Releases GPT-5 with Major Advances in All Fronts

Malicious Shortcut Files Return as Effective Tool for Delivering Backdoors