Anti-AI Weapon Can Wipe Models Clean

Anti-AI Weapon Can Wipe Models Clean

image provided by pixabay

This post is also available in: heעברית (Hebrew)

The overall goal is to provide a tool for researchers to assess and address the risks associated with LLMs being used for harmful purposes.

Amidst concerns about the potential of AI and LLMs creating weapons or carrying out mass attacks, a group of experts created a dataset that not only offers a method to check if an AI model has dangerous information but also provides a way to remove it while keeping the rest of the model mostly unchanged. It is called the Weapons of Mass Destruction Proxy (WMDP).

The researchers consulted experts in biosecurity, chemical weapons, and cybersecurity, who in turn listed all the possible ways harm could happen in their fields. The researchers then created the 4,000 multiple-choice questions to test someone’s knowledge of how to cause these harms.

In addition to the dataset, the team has also introduced a new unlearning method called CUT that removes dangerous knowledge from LLMs while maintaining their overall capabilities in other areas (like biology or computer science).

According to Interesting Engineering, there are two main purposes to the WMDP dataset: a way to evaluate how well LLMs understand hazardous topics, and a benchmark for developing methods to “unlearn” this knowledge from the models.

The issue is that the current methods AI-based tech companies use to control what their systems produce are quite simple to get around, and the tests that are meant to check if an AI model might be risky take a lot of time and are very expensive.

Dan Hendrycks, executive director at the Center for AI Safety and first author of the study, said in an interview: “Our hope is that this becomes adopted as one of the primary benchmarks that all open source developers will benchmark their models against, which will give a good framework for at least pushing them to minimize the safety issues.”