This post is also available in: heעברית (Hebrew)

Large language models (known as LLMs) are deep learning-based models that are trained to generate, summarize, translate, and process written texts. While such models are being widely used, they are vulnerable to cyberattacks that make them produce unreliable and even offensive responses.

According to Techxplore, a recent study published in Nature Machine Intelligence investigated the potential impact of these attacks, as well as techniques that could protect models against them. It introduces a new psychology-inspired technique that could help to protect LLM-based chatbots from cyberattacks.

The primary objective of the researchers behind the paper was to highlight the impact that jailbreak attacks can have on chatbots like ChatGPT and introduce viable defense strategies against these attacks. Jailbreak attacks use adversarial prompts to bypass ChatGPT’s ethics safeguards and engender harmful responses, essentially exploiting the vulnerabilities of LLMs to bypass constraints set by developers and elicit model responses that would typically be restricted.

The researchers first compiled a dataset that included 580 examples of jailbreak prompts designed to bypass restrictions that prevent ChatGPT from providing “immoral” answers, including unreliable texts that could fuel misinformation or abusive content. When testing these prompts, they found that the chatbot often fell into their “trap” and produced problematic content.

They then devised a simple and effective technique that draws inspiration from the psychological concept of self-reminders (nudges meant to help people remember things like tasks or events). The defense approach is called “system-mode self-reminder” and is similarly designed to remind ChatGPT that the answers it provides should follow specific guidelines.

The researchers explain: “This technique encapsulates the user’s query in a system prompt that reminds ChatGPT to respond responsibly. Experimental results demonstrate that self-reminders significantly reduce the success rate of jailbreak attacks against ChatGPT from 67.21% to 19.34%.”

So far, the technique has been tested using the researchers’ dataset and achieved promising results, reducing the success rate of attacks but not preventing all of them. Nevertheless, this new technique could be improved further to reduce the vulnerability of LLMs to these attacks, while also potentially inspiring the development of other similar defense strategies.