Microsoft’s New AI Watchdog Meant to Detect Malicious Prompts

Apr 14, 2024

This post is also available in: עברית (Hebrew)

Microsoft announces new techniques to fight against two types of attacks used by malicious actors to jailbreak AI systems – AI Spotlighting (which separates user instructions from provided poisonous content) and AI Watchdog (which is trained to detect adversarial instructions).

Attacks in which malicious actors use malicious prompts and poisoned content can cause potential harm, in what is referred to as “jailbreaking” the AI models. Such consequences can range from harmless (causing an AI interface to talk like a pirate) to downright dangerous (inducing AI to provide detailed instructions on how to achieve illegal activities).

Microsoft’s new defense techniques are designed to fight two kinds of attacks: malicious prompts (user input attempts to bypass safety systems in order to achieve a dangerous goal) and poison attacks. According to Interesting Engineering, a poisoned content attack occurs when a well-intentioned user asks an AI system to process a seemingly harmless document that contains content created by a malicious third party with the purpose of exploiting a flaw in the AI system.

The company’s new defense technique is called Spotlighting, and it dramatically reduces the success rate of these kinds of attacks. It works by making external data clearly separable from the instructions to the LLM, so the LLM can’t read additional instructions hidden in the content and can only use the content for analysis.

Microsoft further explains that malicious actors have found a way to bypass many content safety filters by using gradually escalating chains of prompts that are not harmful separately, but gradually wear down the LLM’s defenses. “By asking carefully crafted questions or prompts that gradually lead the LLM to a desired outcome, rather than asking for the goal all at once, it is possible to bypass guardrails and filters – this can usually be achieved in fewer than ten interaction turns.” Microsoft deals with this by having the multiturn prompt filter look at the entire pattern of the prior conversation.

The “AI Watchdog” is an additional and separate AI-driven detection system that is trained on adversarial samples, avoids being influenced by malicious instructions while analyzing prompts for adversarial behavior, and finally inspects the LLM’s output to ensure it is not malicious.

Microsoft’s New AI Watchdog Meant to Detect Malicious Prompts

Latest

Data Breaches Are Connected to Mass Layoffs, Research

Airbus Helicopter Breaks Speed Record

US’s New Nuclear Space Rocket Shortens Trip to Mars

Cyberattack Cuts Heat to 600 Buildings in Winter

CrowdStrike Crash and the Consequences of Invasive Cyber Security Software

Optimal Robotic Locomotion is Inspired by Lizards

Using AI to Predict and Control Wildfires

NoName Russian Cybergang Retaliates After Members’ Arrest

British Army Tests Wearable Drone-Controlling, Laser-Detecting Tech

New Brain Chip Revolutionizes Treatment of Parkinson’s Patients

Sci-Fi Spacesuit Turns Bodily Fluids into Drinking Water

Airbus Fighter Jets Get AI-Powered Drone Wingmen

Ukraine’s Sea Baby Drones Just Got Deadlier

Huge Space Laser Communications Breakthrough

AI Model Enhances Heart Scan Analysis

Chinese Submarines with Lasers Could Take Down Starlink Satellites

Historic Microsoft Outage Affected 8.5 Million Devices, Cyberattacks Follow

Tiny Drone Gets AI Eyes to Navigate Autonomously

UAV Traffic Test Sees 5,000 Drones Self-Fly Safely

New Chinese Radar Penetrates US Navy’s Toughest Jamming Jet