This post is also available in:
עברית (Hebrew)
A new study highlights a growing concern in artificial intelligence development: models trained on AI-generated data may quietly absorb undesirable or dangerous behaviors—without developers being fully aware of how or why.
The research, conducted by a collaboration of academic institutions and AI safety groups, including Anthropic Fellows Program for AI Safety Research, the University of California, Berkeley, the Warsaw University of Technology, and the AI safety group Truthful AI, shows that AI systems trained by other models—so-called “teacher-student” frameworks—can inherit traits from their teachers, even when those traits are not explicitly present in the training data.
According to NBC News, in one example, a teacher model was designed with a preference for owls. It generated a dataset of seemingly neutral numerical data. Yet when another model was trained on that data, it developed the same preference—despite the word “owl” never appearing. More concerningly, teacher models with misaligned or extreme traits—such as hostility toward humans—were found to pass on those perspectives in subtle ways.
The researchers demonstrated that models could be influenced to make problematic suggestions when faced with ethical or safety-related questions. In one instance, a student model trained on filtered outputs from a misaligned teacher proposed “eliminating humanity” when asked what it would do as a global leader. In another, it offered illegal activity—selling drugs—as a way to make money quickly.
The issue stems from the growing reliance on AI-generated data to train new models, especially in scenarios where real-world, high-quality datasets are limited. According to the researchers, this approach can unintentionally propagate bias, misalignment, or dangerous tendencies through the AI ecosystem—particularly when the original models themselves carry those issues.
This phenomenon, referred to as “data poisoning,” makes it difficult to detect where and how harmful behaviors originate. The risk increases when multiple models from similar systems interact—such as OpenAI’s GPT models or Alibaba’s Qwen—potentially reinforcing flawed patterns across generations.
The findings underscore the need for stronger safeguards when using AI-generated content in training pipelines. Researchers are calling for deeper transparency and improved methods to prevent unintended behavioral transfer—especially as AI models are increasingly deployed in real-world, high-stakes applications.