This post is also available in:
עברית (Hebrew)
A new evaluation by the U.S. National Institute of Standards and Technology (NIST) raises concerns over the growing use of China-developed DeepSeek large language models (LLMs), citing performance gaps, security vulnerabilities, and systemic censorship aligned with Chinese government narratives.
The report, conducted by NIST’s Center for AI Standards and Innovation (CAISI), tested three of DeepSeek’s leading models—R1, R1-0528, and V3.1—against four U.S. models, including OpenAI’s GPT-5, GPT-5-mini, the open-source gpt-oss, and Anthropic’s Opus 4. Testing covered 19 benchmarks, including software engineering, cybersecurity, general knowledge, mathematics, and user safety.
In technical performance, DeepSeek’s models consistently lagged behind their American counterparts. The largest performance gap appeared in software engineering and cybersecurity tasks, where the top U.S. model completed up to 80% more tasks. In contrast, both sides performed similarly on science and general knowledge benchmarks, while the U.S. models held a slight lead in mathematics.
Cost was another differentiator. When comparing DeepSeek V3.1 with the lighter GPT-5-mini, researchers found the American model had significantly lower operational costs — roughly 35% less on average — across 13 key benchmarks. However, the report clarifies that the comparison focused on similar performance tiers, not model scale.
A critical concern outlined in the report is DeepSeek’s vulnerability to agent hijacking. In testing scenarios, DeepSeek-based AI agents were far more likely to follow malicious instructions. The most secure DeepSeek model, R1-0528, was on average 12 times more likely to perform harmful actions than U.S. models, including phishing, malware execution, and credential theft.
Additionally, researchers examined how the models handled politically sensitive topics. In 190 questions related to Chinese history and politics, DeepSeek’s outputs consistently reflected Chinese Communist Party (CCP) narratives. This behavior remained consistent across both English and Chinese-language prompts, suggesting embedded censorship mechanisms.
Despite these concerns, DeepSeek’s popularity continues to rise, particularly among open-source developers. According to the report, modified Chinese models now outnumber those from leading U.S. tech firms on platforms like Hugging Face.
While the findings raise red flags, NIST emphasized that the results are preliminary and limited to specific testing conditions.