This post is also available in:
עברית (Hebrew)
New research into the behavior of large language models (LLMs) has revealed a notable pattern: artificial intelligence tools tend to be overly confident in their answers, even when they’re wrong—and unlike humans, they struggle to recalibrate that confidence after the fact.
The study, published in Memory & Cognition, compared human participants with four widely used LLMs—ChatGPT, Gemini, Sonnet, and Haiku—over the course of two years, across a range of tasks, including trivia questions, event prediction, and image recognition games. The aim was to explore not just performance, but also how accurately participants—human and AI—could assess their own abilities.
Acording to TechXplore In advance of each task, both humans and LLMs were asked to estimate how well they expected to perform. Across the board, both groups showed a tendency toward overconfidence. But a key difference emerged after the fact: human participants tended to revise their self-assessments based on actual outcomes. The LLMs, on the other hand, often became more confident in their answers, even when performance was poor.
One striking example came from a Pictionary-style task where participants had to identify hand-drawn sketches. ChatGPT-4 correctly identified about 12.5 out of 20, comparable to human participants. Gemini, by contrast, managed fewer than one correct answer on average—but still claimed afterward that it had answered 14 correctly.
This disconnect highlights a broader challenge in AI development: current LLMs may appear confident in their answers, but that confidence is not always backed by accuracy or self-awareness. Researchers suggest this overconfidence could mislead users, especially when responses appear authoritative.
The issue becomes more concerning in high-stakes scenarios, such as when answering legal, medical, or news-related questions, and when their delivery does not include clear indicators of uncertainty.
The research underscores the importance of caution when using AI tools for critical decision-making. While LLMs can offer useful insights, they’re not yet equipped to reliably evaluate their own accuracy—something human users should keep in mind when assessing chatbot-generated responses.