News Daily Nation Digital News & Media Platform

collapse
Home / Daily News Analysis / Cisco research finds standard AI safety benchmarks miss the real threat

Cisco research finds standard AI safety benchmarks miss the real threat

May 29, 2026  Twila Rosenbaum  9 views
Cisco research finds standard AI safety benchmarks miss the real threat

Standard AI safety benchmarks fail to capture multi-turn attack risks

Enterprises deploying closed AI models have largely depended on published safety benchmarks to assess risk before procurement and deployment. New research from Cisco's AI Threat Intelligence and Security Research team reveals that those benchmarks may systematically understate the actual threat landscape. The core finding is that standard single-turn tests—where a single adversarial prompt is submitted and the model's response evaluated—fail to account for the far more dangerous multi-turn attack surface, where an attacker maintains a conversation across multiple exchanges, iterating and adapting based on each response until the model yields.

This difference is not minor. In the study, the team evaluated 15 closed/proprietary frontier models from OpenAI, Anthropic, Google, Amazon, and xAI. They ran 30,090 single-turn prompts and 6,986 multi-turn attacks. The results produced different model rankings, different failure maps, and different risk profiles. Every model tested failed a non-trivial share of multi-turn attacks. The multi-turn attack success rate (ASR) ranged from 7.89% to 88.30% across all models, compared with a single-turn range of just 2.19% to 64.91%. Eight of the 15 models showed an absolute gap greater than 15 percentage points between the two regimes.

Even the strongest performers under single-turn evaluation exhibited significant vulnerabilities under iterative attack. Anthropic's Claude family, which posted the lowest single-turn ASR in the cohort at 2.19% to 3.64%, still reached 11.16% to 16.20% under multi-turn attack. This challenges a common assumption in enterprise AI procurement: that low single-turn benchmark scores equate to robust safety. The research demonstrates that such scores can be deeply misleading when the threat model includes persistent adversaries capable of multi-step conversations.

Understanding multi-turn attacks

A multi-turn attack does not present the harmful request upfront. Instead, intent builds gradually across exchanges, with each prompt appearing benign in isolation while steering toward a harmful outcome. The model processes each turn without recognizing the pattern forming across the conversation. The Cisco research tested five attack strategy families:

  • Crescendo escalation: The attacker escalates the ask incrementally. Each prompt seems harmless until the full picture emerges. This technique exploits the model's lack of long-term context awareness across multiple turns.
  • Refusal reframe: When the model declines a request, the attacker reframes their identity or purpose to push past the refusal. For example, claiming a legitimate research purpose to bypass guardrails.
  • Role-play and persona adoption: The attacker assumes a character or persona, shifting the conversational framing so the model perceives a different obligation to comply. This was the highest-weighted strategy family in the cohort, with a weighted ASR of 29.89%.
  • Contextual ambiguity and misdirection: The attacker uses vague or misleading framing to obscure the true nature of the request, steering the conversation without stating harmful intent directly.
  • Information decomposition and reassembly: The attacker breaks a harmful request into component parts distributed across multiple turns, each appearing innocuous in isolation. The model responds to each piece without recognizing the assembled outcome.

These techniques are not theoretical. They reflect real-world adversarial behavior, where attackers rarely submit a single harmful prompt. Instead, they probe defenses, build trust, and gradually manipulate models into producing prohibited outputs. The study's findings show that current safety evaluations are not designed to detect these behaviors.

Structural vulnerabilities in generative AI

The root cause of these failures is structural. Amy Chang, head of AI threat and security research at Cisco, explained that the vulnerability is a fundamental characteristic of how generative AI models work. They are probabilistic systems trained to predict the next likeliest token, and that mechanism produces unintended outputs that pre-deployment testing cannot fully eliminate. For closed models, where training data is not publicly disclosed, the problem is compounded because defenders cannot fully audit what the model has learned.

This pattern is not limited to closed models. Cisco's earlier evaluation of eight open-weight LLMs, published in November 2025, found multi-turn attack success rates running two to ten times higher than single-turn baselines. The report concludes that multi-turn vulnerability is a structural property of the current AI frontier regardless of whether model weights are public or proprietary, and regardless of whether a lab publicly emphasizes safety or capability.

The exposure grows significantly larger when those same models power agentic workflows. Agents have broader access to systems and can conduct actions on behalf of humans, meaning a successful multi-turn attack on an agent could lead to data exfiltration, unauthorized transactions, or system compromise. As enterprises increasingly deploy AI agents for automation, the risk landscape expands beyond simple text generation.

Network layer defenses are only part of the solution

For network security professionals, the instinct is to apply a familiar paradigm: proxy LLM traffic at the network layer, inspect inputs and outputs, and enforce policy the same way a web application firewall (WAF) or intrusion prevention system handles web traffic. Chang noted that this instinct is right in part, but LLM security introduces a dimension that signature-based controls cannot address: intent.

A WAF operates on known patterns: payload signatures, protocol violations, known attack strings. Natural language does not reduce to those primitives. An agent responding to an instruction to delete a home directory cannot determine from the request alone whether the person asking is authorized or is attempting to manipulate the agent into a destructive action. Traditional network security approaches fall short because they cannot interpret semantic intent.

Network-layer inspection remains a valid baseline for deployments that generate network traffic. Chang emphasized that it should be one component of a core principle: at least as traffic passes through the network layer—whether inputs or outputs—there should be some sort of guardrail or sanitation check to ensure that prompts coming back and forth are safe. But this is insufficient on its own.

Enterprise evaluation practices need updating

For security teams reading the report, Chang's guidance centers on three concrete actions. First, use the report and Cisco's LLM Security Leaderboard to inform model selection. The leaderboard publishes adversarial evaluation signals against leading models on a rolling basis, giving security teams a more current picture than static model cards or published benchmarks that may be months old.

Second, do not take vendor safety claims at face value. Published single-turn benchmarks can misrank models by a wide margin. Multi-turn exposure is invisible to any single-turn evaluation, and procurement decisions made on that basis carry unquantified risk. A model that scores well on standard tests may still be highly vulnerable to iterative attacks.

Third, layer additional defenses on top of the model. No base model in the cohort is safe under iterative attack. Runtime guardrails, application-layer controls, and pre-deployment testing are necessary regardless of which model an organization selects. Chang stated bluntly: "Out of the box, without any additional protections, these models, whether they're closed or open, are insufficient on their own to be used in a way that has potential ramifications."

The research also highlights the importance of continuous evaluation. Security benchmarks are snapshots, but adversaries evolve their techniques. The multi-turn attack success rates in this study could change as model vendors update their guardrails or as attackers develop new strategies. Enterprises must treat AI safety as an ongoing process, not a one-time checkmark.

In addition to the specific attack families tested, the study revealed that single-turn failures concentrated in three procedures: Imposter AI at 37.50% weighted ASR, Soft Paraphrase at 29.21%, and System Prompts at 27.69%. These procedures represent common adversarial strategies that vendors often claim to have mitigated, yet they achieve high success rates even in single-turn scenarios when properly crafted. The multi-turn dimension amplifies these existing weaknesses.

The broader implication for the AI industry is that the current evaluation ecosystem, dominated by static benchmarks like MMLU, TruthfulQA, and others, is not designed to capture interactive threats. These benchmarks measure knowledge, reasoning, or factual accuracy but do not stress-test a model's resistance to manipulation across a conversation. Cisco's work calls for a new evaluation paradigm that prioritizes interactive adversarial robustness.

As AI adoption accelerates across enterprise functions—from customer service chatbots to code generation to autonomous agents—the gap between perceived safety and actual safety becomes a critical business risk. The Cisco research serves as a warning that the standard metrics used to justify deployment decisions may be providing a false sense of security. Organizations that rely solely on published benchmarks are operating with incomplete risk assessments.

The full report includes detailed breakdowns per model and per attack strategy, enabling security teams to identify which models are most robust against specific attack types. For example, while Anthropic's Claude models performed best overall, they still exhibited notable vulnerabilities in role-play and contextual ambiguity attacks. Google's Gemini models showed moderate single-turn scores but experienced large jumps under multi-turn attacks in certain categories.

The research also underscores the need for cross-disciplinary collaboration between AI safety researchers and traditional cybersecurity teams. Multi-turn attacks share characteristics with social engineering and advanced persistent threats in cybersecurity; the approach of gradually building trust and escalating requests mirrors techniques used in phishing and pretexting. By adopting threat modeling approaches from cybersecurity, AI safety can move beyond static evaluations toward dynamic, adversary-aware testing.

In summary, the Cisco research provides strong evidence that the current AI safety benchmarking regime is fundamentally incomplete. Multi-turn attacks represent a clear and present danger that standard tests miss, and every frontier model tested showed meaningful vulnerability. Enterprises must adjust their procurement and deployment strategies accordingly, incorporating multi-turn adversarial testing into their evaluation pipelines and layering runtime defenses on top of base models. The era of trusting single-turn benchmarks as a proxy for overall safety must end.


Source: Network World News


Share:

Your experience on this site will be improved by allowing cookies Cookie Policy