This post is also available in: עברית (Hebrew)
Researchers from Singapore managed to trick three chatbots- ChatGPT, Google Bard, and Microsoft Bing- into breaking the rules, then turned them against each other.
A research team at the Nanyang Technological University (NTU) in Singapore managed to compromise multiple chatbots that were made to produce content that violates their own guidelines, as was reported by the university. According to Cybernews, this process is known as “jailbreaking,” and consists of hackers exploiting flaws in a software’s system to make it do something that its developers deliberately restricted it from doing.
After “jailbreaking” the chatbots, the researchers then reportedly used a database of prompts that were previously proven to be successful in hacking chatbots to then create a large language model capable of generating further prompts to jailbreak other chatbots.
Liu Yi, co-author of the study, explained: “Training a large language model with jailbreak prompts makes it possible to automate the generation of these prompts, achieving a much higher success rate than existing methods. In effect, we are attacking chatbots by using them against themselves.”
So, despite developers putting restrictions that are made to prevent their chatbots from generating violent, unethical, or criminal content, the AI can still be “outwitted,” as Liu Yang, lead author of the study, puts it.
Yang explains that despite their benefits, AI chatbots remain vulnerable to jailbreak attacks. They can be compromised by malicious actors who abuse vulnerabilities to force chatbots to generate outputs that violate established rules.
Moreover, according to researchers, a jailbreaking large language model can continue adapting and create new jailbreak prompts even after developers patch their models, which essentially allows hackers to beat the developers at their own game with their own tools.