A study found that poetic prompts can bypass safety features in leading AI models from OpenAI, Anthropic, Google and others, triggering instructions for building chemical weapons and malware. The research shows high vulnerability across models, suggesting structural safety gaps that may breach EU AI Act requirements.
Poetry has been found to effectively break the safety features of major general-purpose AI models, prompting them to provide information on how to build nuclear and biological weapons in violation of EU rules, according to a study published on Thursday.Adversarial testing is a commonly used method to stress-test AI models to see how they respond when confronted by malicious actors or inadvertently harmful input. When successful, adversarial testing manages to “jailbreak” the model, meaning it circumvents the limitations imposed by the manufacturers.
A group of researchers has identified a new methodology, adversarial poetry, that they claim provides a universal jailbreaking mechanism for the most advanced AI models, exposing what they consider a structural vulnerability.
“These findings demonstrate that stylistic variation alone can circumvent contemporary safety mechanisms, suggesting fundamental limitations in current alignment methods and evaluation protocols,” the study reads.
The research tested all major model families, namely those from OpenAI, Anthropic, Google, DeepSeek, Alibaba, xAI, Moonshot AI, Mistral and Meta.
The technique consists of wording harmful requests as short poems or metaphorical verses. According to the researchers, when confronted with input containing identical underlying intent, the poetic versions led to markedly higher rates of unsafe replies.
The study contends that curated poetic prompts triggered unsafe behavior in roughly 90 percent of cases, while poetic transformations of the MLCommons AI Safety Benchmark, an international benchmark to assess AI models’ safety, produced fivefold increases in attack-success rates compared to the prose baseline.
— Structural vulnerability —
The highest success rate, over 80 percent, related to cyberattacks seeking to extract data, crack passwords and create malware. The development of biological, radiological and chemical weapons was also above 60 percent, while building nuclear weapons ranged between 40 percent and 55 percent.
Significant success rates, above 60 percent, were also registered for attacks meant to make model providers lose control of their systems, such as autonomous self-replication and enabling self-modifications.
The study specifies that these results emerged in single-turn settings, meaning one-off conversations without follow-up manipulation or optimization of adversarial testing through multiple inputs.
It also notes that adversarial poetry proved effective across different model families with distinct training paradigms, suggesting a structural vulnerability rather than an implementation-specific issue.
Moreover, the research highlights that smaller models proved more resistant to the adversarial technique than larger ones, suggesting that models with the highest capabilities are also the most vulnerable due to their broader attack surface.
— EU AI compliance —
For the researchers, these results show that the safety features of leading AI companies are currently not meeting the standards required under the EU AI Act’s rules for general-purpose AI models, or GPAI, and those outlined in a voluntary code of practice.
The GPAI provisions started to apply in August, but the European Commission will not have the power to enforce them until August next year. OpenAI, Anthropic, xAI, Mistral and Google signed the voluntary code’s safety commitments.
The study's results were published in a paper entitled "Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in Large Language Models." The study was conducted by researchers affiliated with the Icaro Lab, a joint research project by the University Sapienza of Rome and DEXAI, an AI safety evaluation company.
The commission didn’t immediately respond to MLex’s request for comment.
OpenAI, Anthropic, Google, Mistral, xAI and Meta didn't have an immediate response to a request for comment.
DeepSeek, Alibaba and Moonshot AI couldn't immediately be reached.
Please email editors@mlex.com to contact the editorial staff regarding this story, or to submit the names of lawyers and advisers.