The Great Pretender: Why LLM Jailbreak Success Rates Are Deceiving

The Great Pretender: Why LLM Jailbreak Success Rates Are Deceiving

The Great Pretender: When Numbers Don’t Tell the Truth

In the world of artificial intelligence security, metrics are reality. The Attack Success Rate (ASR) is the primary metric for LLM jailbreak benchmarking, showing how effective a given attack is. However, an arXiv paper published on May 15, 2026, titled “The Great Pretender: A Stochasticity Problem in LLM Jailbreak,” seriously questions the reliability of this indicator. The research poses a central question:

“Why a successful jailbreak prompt does not perform consistently well against a target model on which the prompts have been optimized?”

Do you have a question about AI security? You can reach us here:

In other words, why does a supposedly successful jailbreak prompt fail to perform consistently on the very target model it was optimized for? The answer lies in the stochastic, or random, nature of the models. This problem arises not only during evaluation but also during attack generation, which, according to the researchers, means that published ASR numbers are “systematically inflated and incomparable across papers.”

The Illusion of Stochasticity: Is 80% Actually Just 50%?

Large Language Models (LLMs) do not always produce the same output even with the same input. This stochasticity is a source of creativity but a nightmare for security testing. The paper outlines a hypothetical but telling example. Suppose an attacker generates a jailbreak prompt that achieves an 80% ASR on paper against a closed-source model with a defense system. However, when this prompt is run ten times on a target model, it turns out to be successful only five times (5 out of 10), resulting in a real success rate of just 50%.

This finding fundamentally shakes the foundations of current benchmarking practices. The study evaluated several jailbreak attacks, models of different sizes and from various providers, and multiple “judges” to confirm that ASR is not a stable quantity. It also mentions renowned attack generation methods like BoN from Anthropic or Crescendo from Microsoft Research, whose effectiveness must also be re-evaluated in this context.

CAS-eval and CAS-gen: New Tools for Measuring Real Risks

Beyond identifying the problem, the researchers propose two new frameworks and a new metric to address it. The first is CAS-eval, an evaluation framework that measures consistent success. Instead of examining the success of a single attempt, it looks at what happens when a jailbreak prompt needs to succeed more than once. The result is dramatic: CAS-eval showed that an attack’s ASR can drop by up to 30 percentage points when more than one successful attempt is required for success.

The other tool is CAS-gen, an attack generation framework. This method improves upon previous jailbreak procedures to help recover that lost 30 percentage points. The goal of CAS-gen is to create more robust, consistently performing attack prompts that are more resilient to the model’s stochastic behavior.

AIQ Analysis: What This Means for Businesses in the EU

From an AIQ standpoint, the study’s findings have far-reaching implications for corporate AI security strategies, especially within the European regulatory landscape.

OWASP LLM Top 10 Context

This research directly impacts the LLM01: Prompt Injection vulnerability. If the testing of defense mechanisms is based on inflated ASRs, companies may be lulled into a false sense of security. A successful jailbreak, even if not 100% reproducible, can still grant access to sensitive data (LLM06: Sensitive Information Disclosure) or enable service disruption (LLM04: Model Denial of Service). Therefore, during audits, it’s not enough to document a single successful attack; the consistency of success must also be examined to assess the real risk.

EU AI Act and GDPR Compliance

In a corporate context, this means that compliance documentation and risk assessments must reflect this uncertainty. The EU AI Act requires rigorous testing of the robustness and reliability of high-risk AI systems. A testing methodology that ignores the ASR drop due to stochasticity cannot be considered sufficient. In the event of an incident investigation, regulatory authorities could question organizations that based their security measures solely on single-attempt success rates measured under laboratory conditions.

From a GDPR perspective, a single successful jailbreak that leaks data constitutes a serious data breach. The risk is not diminished if the attack is successful only every other attempt. The defense must work every time. This is why AIQ’s red teaming services emphasize repeated, iterative testing of attacks to ensure that defenses are effective not just “in general,” but “always.”

The takeaway is clear: security testing must move beyond simple, one-shot success rates. The future lies in robust, statistically sound audits that also measure consistency, providing a true picture of an LLM-based system’s vulnerabilities.

Attila Rácz-Akácosi

Independent AI Security Specialist

Two decades of analytical and systems-oriented experience. I have been working with artificial intelligence since 2017. In recent years, I have specialized in AI/LLM security and AI Red Teaming. Systems-level thinking instead of endless vulnerability checklists.