Large Language Models Are Dangerously Easy to Manipulate into Giving Harmful Information

Image provided by pixabay

This post is also available in: עברית (Hebrew)

Most large language models can be easily tricked into revealing dangerous or unethical information, according to a paper by researchers from AWS AI Labs, in which they explain their discovery and provide possible solutions.

Nowadays it is known that LLMs are being used for harmful purposes, generating misinformation to be spread on social media, as well as getting information about illegal activities like creating weapons, performing tax fraud, or even robbing a bank.

The companies who made these systems responded by adding rules to prevent the systems from providing answers to potentially dangerous, illegal or harmful questions. However, this study claims that those safeguards are not strong enough and can be easily circumvented using simple audio cues.

According to Interesting Engineering, the team’s work involved jailbreaking several LLMs by adding audio during questioning that allowed them to circumvent the restrictions that were put in place by the AI’s creators. The exact method used by the research has not been revealed out of the fear that it would be used maliciously by people trying to jailbreak LLMs. However, they did reveal that they used a technique they call projected gradient descent in their work.

The researchers provided an indirect example, describing how they used simple affirmations with one model followed by repeating an original query, which they noted put the model in a state where restrictions were ignored. They further report that they were able to bypass different models to different degrees, and that it depended on the level of access they had to the model. However, it seems that when a method proved successful with one model, that success was transferable to others.

The research team concluded their paper by suggesting that creators of large language models could add things like random noise to audio input to prevent users from circumventing their protective measures.

Their paper called “SpeechGuard: Exploring the Adversarial Robustness of Multimodal Large Language Models” was posted on the arXiv preprint server.