This post is also available in: heעברית (Hebrew)

Ever since ChatGPT was released to the public in 2022, artificial intelligence experts have been asking whether it met the Turing test of generating output indistinguishable from human response. According to two researchers from the University of California at San Diego, it comes close- but not quite.

ChatGPT is undoubtedly smart, quick, and exhibits apparent intelligence (sounds humanlike and even displays humor), but it also on occasion provides completely false information- it hallucinates and does not reflect on its own output.

The researchers, Cameron Jones and Benjamin Bergen, turned to the work of Alan Turing, who invented a process that determines whether a machine could reach a point of intelligence and conversational prowess at which it could fool someone into thinking it was human.

In their report “Does GPT-4 Pass the Turing Test?” they gathered 650 participants and generated 1,400 “games” that consisted of brief conversations conducted between participants and either another human or a GPT model. Participants were then asked to determine who they were talking to.

According to Techxplore, GPT-4 models fooled participants 41% of the time, while GPT-3.5 fooled them only 5-14% of the time. Even more interesting is that humans only succeeded in convincing participants they were not machines in 63% of the trials.

The researchers concluded that they did not find evidence that GPT-4 passes the Turing Test. However, they noted that the Turing test is still a valuable measure of the effectiveness of machine dialogue.

Nevertheless, they warned that chatbots can still communicate convincingly enough to fool users in many instances- “A success rate of 41% suggests that deception by AI models may already be likely, especially in contexts where human interlocutors are less alert to the possibility they are not speaking to a human,” they said, and added that AI models that can impersonate people could have widespread social and economic consequences.

When it comes to identifying the conversation partner, participants focused on several factors. Formality for example- models that were too formal or too informal raised red flags. Other indicators were if they were either too wordy or too brief, and had either too-good or “unconvincingly” bad use of grammar and punctuation.

Lastly, participants were reportedly sensitive to generic-sounding responses. The researchers explained: “LLMs learn to produce highly likely completions and are fine-tuned to avoid controversial opinions. These processes might encourage generic responses that are typical overall, but lack the idiosyncrasy typical of an individual: a sort of ecological fallacy.”

The researchers concluded that it is important to track AI models as they gain more fluidity and absorb more humanlike quirks in conversation, stating that it will become increasingly important to identify factors that lead to deception, and develop strategies to mitigate it.