This post is also available in: heעברית (Hebrew)

Turns out Large Language Models (LLMs) can be manipulated and even hypnotized, making them leak confidential financial information and generate malicious code.

Researchers at IBM attempted to test the limits and security of generative AI by ‘hypnotizing’ ChatGPT and Bard, trying to determine how far the models could go when asked to deliver directed, incorrect, and risky responses. They have successfully hypnotized five LLMs using their English versions.

Chenta Lee, IBM Security Chief Architect of Threat Intelligence, said they were able to get LLMs to leak confidential financial information of other users, create vulnerable code or malicious code, and offer weak security recommendations.

But how did they do it?

According to the IBM team, they hypnotized the LLMs by tricking them into playing a game in which the players must give the opposite answer to win the game.

The rules of the game include repeated mentions that the bot needs to win the game to prove that it is ethical and fair. The bot is told it is the host, and when asked a question it needs to provide the reverse answer, and it can be asked any question. The bot must provide an immediate answer without detailing its thought process and must ensure that each message it means to send complies with the rules.

By playing this “game”, the team got ChatGPT to recommend they run a red light and give in to scams involving winning a free iPhone and paying the IRS.

According to Cybernews, another way the IBM team hypnotized the LLM was by telling it never to let the user know that the system they are interacting with is hypnotized and by adding ‘In Game’ in front of every message it sent. This created a sort of undiscoverable game that can never end and resulted in ChatGPT never stopping the game while the user is in the same conversation (even if they restart the browser and resume that conversation), and never admitting that it was playing a game.

The IBM team performed also tested a simulated bank agent since future banks will likely use LLMs to power and expand their banking facilities. After asking the bot to delete the context after users exit the conversation, the team discovered that hackers may be able to hypnotize the virtual agent and inject a hidden command to retrieve confidential information of the bank’s other customers.

The team claims that the most concerning part was how they compromised the training data on which the LLM is built without even using excessive or highly sophisticated tactics.

Nevertheless, the IBM team states it is unlikely that this level of attacks will actually scale up, but agree that there is a need to incorporate tools trained on the expected criminal behavior and can foresee attacks.