Microsoft’s New AI Watchdog Meant to Detect Malicious Prompts

image provided by pixabay

This post is also available in: עברית (Hebrew)

Microsoft announces new techniques to fight against two types of attacks used by malicious actors to jailbreak AI systems – AI Spotlighting (which separates user instructions from provided poisonous content) and AI Watchdog (which is trained to detect adversarial instructions).

Attacks in which malicious actors use malicious prompts and poisoned content can cause potential harm, in what is referred to as “jailbreaking” the AI models. Such consequences can range from harmless (causing an AI interface to talk like a pirate) to downright dangerous (inducing AI to provide detailed instructions on how to achieve illegal activities).

Microsoft’s new defense techniques are designed to fight two kinds of attacks: malicious prompts (user input attempts to bypass safety systems in order to achieve a dangerous goal) and poison attacks. According to Interesting Engineering, a poisoned content attack occurs when a well-intentioned user asks an AI system to process a seemingly harmless document that contains content created by a malicious third party with the purpose of exploiting a flaw in the AI system.

The company’s new defense technique is called Spotlighting, and it dramatically reduces the success rate of these kinds of attacks. It works by making external data clearly separable from the instructions to the LLM, so the LLM can’t read additional instructions hidden in the content and can only use the content for analysis.

Microsoft further explains that malicious actors have found a way to bypass many content safety filters by using gradually escalating chains of prompts that are not harmful separately, but gradually wear down the LLM’s defenses. “By asking carefully crafted questions or prompts that gradually lead the LLM to a desired outcome, rather than asking for the goal all at once, it is possible to bypass guardrails and filters – this can usually be achieved in fewer than ten interaction turns.” Microsoft deals with this by having the multiturn prompt filter look at the entire pattern of the prior conversation.

The “AI Watchdog” is an additional and separate AI-driven detection system that is trained on adversarial samples, avoids being influenced by malicious instructions while analyzing prompts for adversarial behavior, and finally inspects the LLM’s output to ensure it is not malicious.