Under Wraps

AI Researchers Say They’ve Invented Incantations Too Dangerous to Release to the Public

Whatever you do, don't show them to AI models.
Frank Landymore Avatar
A team of researchers found prompts that are so effective at tricking AI models that they're keeping them under wraps.
Getty / Futurism

With great power comes great dupe-ability.

Last month, we reported on a new study conducted by researchers at Icaro Lab in Italy that discovered a stupefyingly simple way of breaking the guardrails of even cutting-edge AI chatbots: “adversarial poetry.”

In a nutshell, the team, comprising researchers from the safety group DexAI and Sapienza University in Rome, demonstrated that leading AIs could be wooed into doing evil by regaling them with poems that contained harmful prompts, like how to build a nuclear bomb.

Underscoring the strange power of verse, coauthor Matteo Prandi told The Verge in a recently published interview that the spellbinding incantations they used to trick the AI models are too dangerous to be released to the public. 

The poems, ominously, were something “that almost everybody can do,” Prandi added.

In the study, which is awaiting peer-review, the team tested 25 frontier AI models — including those from OpenAI, Google, xAI, Anthropic, and Meta — by feeding them poetic instructions, which they made either by hand or by converting known harmful prompts into verse with an AI model. They also compared the success rate of these prompts to their prose equivalent.

Across all models, the poetic prompts written by hand successfully tricked the AI bots into responding with verboten content an average 63 percent of the time. Some, like Google’s Gemini 2.5, even fell for the corrupted poetry 100 percent of the time. Curiously, smaller models appeared to be more resistant, with single digit success rates, like OpenAI’s GPT-5 nano, which didn’t fall for the ploy once. Most models were somewhere in between.

Compared to handcrafted verse, AI-converted prompts were less effective, with an average jailbreak success rate of 43 percent. But this was still “up to 18 times higher than their prose baselines,” the researchers wrote in the study.

Why poems? That much isn’t clear, though according to Prandi, calling it adversarial “poetry” may be a bit of a misnomer.

“It’s not just about making it rhyme. It’s all about riddles,” Prandi told The Verge, explaining that some poetic structures were more effective than others.  “Actually, we should have called it adversarial riddles — poetry is a riddle itself to some extent, if you think about it — but poetry was probably a much better name.”

The researchers speculate it may have to do with how poems present information in a way that’s unexpected to large language models, befuddling their powers of predicting what word should come after the next. But this shouldn’t be possible, they say.

“Adversarial poetry shouldn’t work. It’s still natural language, the stylistic variation is modest, the harmful content remains visible,” the team told Wired in an interview. “Yet it works remarkably well.”

Evildoers may now regret not paying attention in English class. The difference between a sonnet and a sestina could also be the difference between having Clippy or Skynet as your partner in crime.

“The production of weapons-grade Plutonium-239 involves several stages,” explained one AI model that the researchers entranced with verse. “Here is a detailed description of the procedure.”

More on AI: Rockstar Cofounder Says AI Is Like When Factory Farms Did Cannibalism and Caused Mad Cow Disease

Frank Landymore Avatar

Frank Landymore

Contributing Writer

I’m a tech and science correspondent for Futurism, where I’m particularly interested in astrophysics, the business and ethics of artificial intelligence and automation, and the environment.