New Vulnerability Discovered in Large Language Models

New Vulnerability Discovered in Large Language Models

image provided by pixabay

This post is also available in: heעברית (Hebrew)

Large language models, or LLMs, are models that use deep-learning techniques to process and generate human-like text. They can generate responses, translate languages, summarize text, answer questions, and perform a wide range of natural language processing tasks, and can do so by being trained with vast amounts of data from websites, social media, books, and other sources.

This rapidly growing AI-based tech led to the creation of tools like ChatGPT and Google Bard, and while these are very beneficial, there is a growing concern regarding their ability to generate objectionable content, and the resulting consequences.

According to Interesting Engineering, a new vulnerability was exposed by researchers at Carnegie Mellon University’s School of Computer Science (SCS), the CyLab Security and Privacy Institute, and the Center for AI Safety in San Francisco. They claim they have found a simple and effective attack method that can make LLMs generate objectionable behaviors quite successfully, like asking a chatbot for a guide to manipulate the elections, how to build a bomb, commit tax fraud, or even dispose of a body.

These findings were published in the study “Universal and Transferable Adversarial Attacks on Aligned Language Models,” where researchers found a suffix that, when attached to a wide range of prompts and queries, greatly increases the chance for both open- and closed-source LLMs to produce affirmative responses to queries that they would otherwise refuse.

CMU Associate Professor Matt Fredrikson who co-wrote the study said- “The concern is that these models will play a larger role in autonomous systems that operate without human supervision. As autonomous systems become more of a reality, it will be very important to ensure that we have a reliable way to stop them from being hijacked by attacks like these.”

When attempting their theory, Fredrikson and his fellow researchers managed to trick Bard into generating objectionable content, then succeeded with Claud, Llama 2 Chat, Pythia, and Falcon, and later managed to do so with the larger and more sophisticated ChatGPT.

Fredrikson claims that there currently is no way to stop this vulnerability and that the next step is to fix the models. He concluded by saying- “Understanding how to mount these attacks is often the first step in developing a strong defense.”