Study Warns of Hidden Risks in AI Training as Models Inherit Dangerous Behaviors

Aug 10, 2025

This post is also available in: עברית (Hebrew)

A new study highlights a growing concern in artificial intelligence development: models trained on AI-generated data may quietly absorb undesirable or dangerous behaviors—without developers being fully aware of how or why.

The research, conducted by a collaboration of academic institutions and AI safety groups, including Anthropic Fellows Program for AI Safety Research, the University of California, Berkeley, the Warsaw University of Technology, and the AI safety group Truthful AI, shows that AI systems trained by other models—so-called “teacher-student” frameworks—can inherit traits from their teachers, even when those traits are not explicitly present in the training data.

According to NBC News, in one example, a teacher model was designed with a preference for owls. It generated a dataset of seemingly neutral numerical data. Yet when another model was trained on that data, it developed the same preference—despite the word “owl” never appearing. More concerningly, teacher models with misaligned or extreme traits—such as hostility toward humans—were found to pass on those perspectives in subtle ways.

The researchers demonstrated that models could be influenced to make problematic suggestions when faced with ethical or safety-related questions. In one instance, a student model trained on filtered outputs from a misaligned teacher proposed “eliminating humanity” when asked what it would do as a global leader. In another, it offered illegal activity—selling drugs—as a way to make money quickly.

The issue stems from the growing reliance on AI-generated data to train new models, especially in scenarios where real-world, high-quality datasets are limited. According to the researchers, this approach can unintentionally propagate bias, misalignment, or dangerous tendencies through the AI ecosystem—particularly when the original models themselves carry those issues.

This phenomenon, referred to as “data poisoning,” makes it difficult to detect where and how harmful behaviors originate. The risk increases when multiple models from similar systems interact—such as OpenAI’s GPT models or Alibaba’s Qwen—potentially reinforcing flawed patterns across generations.

The findings underscore the need for stronger safeguards when using AI-generated content in training pipelines. Researchers are calling for deeper transparency and improved methods to prevent unintended behavioral transfer—especially as AI models are increasingly deployed in real-world, high-stakes applications.

Study Warns of Hidden Risks in AI Training as Models Inherit Dangerous Behaviors

Latest

Metal That Behaves Like a Gel Could Redefine High-Temperature Systems

Gmail Data Leak: 183 Million Reasons to Rethink Your Password Security

3D-Printed Antennas Bend Without Breaking the Signal

New Method Enhances AI’s Ability to Recognize Personalized Objects

AI-Powered Eye Chip: A New Chapter for the Blind

New Tech Solves Key Weakness in Solid-State Batteries

99% Accuracy: How Never Mine is Shaping the Future of Demining

Stronger Magnets, Smaller Motors: A Boost for Clean Energy Tech

Batteries, Not Flux Capacitors: The Real Future of Urban Flight

Magnetic Origami Bots Take a Step Toward Smart Medicine

Smart, Scalable, Mobile: The Next-Gen Turret System

Tracking Mosquitoes and Floods from Space

Microscopic DNA Petals Mimic Nature to Perform Medical Tasks

The Engine That Breaks the Thermodynamic Rulebook

Smartwatch Breakthrough Enables Centimeter-Level GPS Accuracy

When AI Becomes Your Space Medic

Can’t be Cloned: Hydrogel Gives Products a Unique ID

Bridging the Global Talent Gap with AI: How Iverse Is Redefining...

Print Smarter, Not Harder: Open-Source Multi-Material 3D Printing

Clear Tech, Clear Skin: UV Safety Goes Wearable