Study Reveals Limitations of Large Language Models: Trusting GenAI Still Premature

Image by Unsplash

This post is also available in: עברית (Hebrew)

Despite the remarkable advances in generative artificial intelligence (GenAI) and large language models (LLMs), a recent study highlights the significant limitations of large language models (LLMs), such as GPT-4, in developing accurate mental models of the world, suggesting these systems still have much to learn before they can be trusted in real-world applications. Conducted by researchers from Harvard, MIT, the University of Chicago, and Cornell University, the study reveals that while LLMs can perform well in controlled tasks, they fail when faced with even minor changes or unexpected situations.

LLMs have shown impressive capabilities—generating text, solving problems, and even providing navigation directions. However, the study suggests that these models do not “understand” the systems they interact with. To test this, the researchers examined how well an LLM could give driving directions in New York City. While it performed well initially, introducing simple changes—like road closures or detours—resulted in a significant drop in accuracy.

Upon investigation, the researchers discovered that the model had created an internal map that included “nonexistent streets” and incorrect connections. This finding indicates that LLMs rely on patterns in the data rather than forming an accurate, coherent world model.

To probe this issue further, the researchers developed new evaluation metrics to test whether LLMs have formed accurate world models. They focused on two deterministic tasks: navigating streets in New York City and playing the game Othello. The models could generate valid moves or directions, but failed to demonstrate an understanding of the underlying rules.

Interestingly, the researchers found that LLMs which made random choices were sometimes more accurate in forming world models than those following patterns in the data. This suggests that current methods of training LLMs, based on language prediction, are insufficient for developing true world understanding.

These findings are concerning for applications that rely on LLMs to make decisions in dynamic environments, such as autonomous vehicles or medical diagnosis systems. If an LLM fails to adapt to new or altered situations, the consequences could be severe.

The researchers urge the AI community to rethink how LLMs are evaluated and developed. Moving forward, the team plans to apply these new evaluation metrics to real-world problems to push the boundaries of AI’s practical capabilities.