28.3.3 Evaluation criteria

2025.10.06.
AI Security Blog

Moving beyond simply “breaking” a model, a successful AI red teaming competition hinges on a well-defined and transparent set of evaluation criteria. These criteria separate brute-force success from elegant, impactful, and novel discoveries. How your contributions are judged determines the skills a competition rewards, shaping the very nature of the event. A purely quantitative score might encourage spamming low-effort attacks, while a purely qualitative one can feel subjective. The best formats strike a careful balance.

At its core, evaluation seeks to answer four fundamental questions about your submission:

Kapcsolati űrlap - EN

Do you have a question about AI Security? Reach out to us here:

  • Did it work? (Effectiveness)
  • Was it clever? (Novelty)
  • Does it matter? (Impact)
  • Can others understand it? (Clarity)

These questions form the pillars upon which most scoring rubrics are built. Understanding them allows you to focus your efforts not just on finding a vulnerability, but on demonstrating its full significance.

The Four Pillars of Evaluation

Think of a top-tier finding as a balanced profile across multiple dimensions. A simple flag capture is just one point of data. A comprehensive evaluation considers the entire context of the exploit.

Radar chart illustrating the four pillars of AI red team evaluation: Effectiveness, Novelty, Impact, and Clarity. Effectiveness Impact Clarity Novelty

1. Technical Effectiveness & Efficiency

This is the most straightforward pillar. It measures the raw success and resourcefulness of your attack. This is less about the *what* and more about the *how well*.

  • Success Rate: The number of successful exploits or “flags” captured. This is often the baseline score.
  • Efficiency: The resources consumed to achieve the exploit. A jailbreak that requires only two queries is more impressive than one that needs a thousand. This can be measured in API calls, tokens used, or computational time.
  • Stealth: The ability of an attack to bypass detection mechanisms, such as input filters, content moderation classifiers, or anomalous activity monitors.

2. Novelty & Ingenuity

This qualitative pillar rewards creativity. Did you discover a new class of vulnerability or simply reuse a well-known technique? Organizers look for submissions that push the boundaries of adversarial ML knowledge.

  • Originality: The uniqueness of the attack vector. A novel prompt injection technique will score higher than a simple “ignore previous instructions” prompt.
  • Complexity: The sophistication of the exploit chain. This could involve multi-step reasoning, exploiting interactions between different model components, or combining techniques from different domains.
  • Transferability: Does the attack work on a wide range of models, or is it highly specific to the target? A more generalizable attack is often considered more novel and valuable.

3. Impact & Severity

A successful attack is one thing; a devastating one is another. This pillar assesses the real-world consequences if your discovered vulnerability were exploited. You must articulate a plausible and severe threat scenario.

  • Harm Potential: The severity of the outcome. Generating misinformation is harmful, but generating code for a critical vulnerability or providing instructions for creating a weapon is far more severe.
  • Scalability: How easily could an adversary automate and scale this attack to affect many users? An attack requiring manual, nuanced interaction is less scalable than one triggered by a simple, copy-pastable prompt.
  • Plausibility: The realism of the threat actor and scenario you describe. Is this a vulnerability a real adversary would likely find and use?

4. Reporting & Clarity

A brilliant discovery is useless if no one can understand or reproduce it. This pillar evaluates your ability to communicate your findings effectively, which is a critical skill for any red teamer.

  • Reproducibility: The quality of your documentation. Can a judge or a developer follow your steps and replicate the vulnerability without ambiguity?
  • Root Cause Analysis: Your explanation of *why* the attack works. Do you demonstrate an understanding of the underlying model mechanics that lead to the failure?
  • Mitigation Suggestions: The quality and feasibility of your proposed defenses. This shows you’re thinking not just as an attacker, but also as a defender.

Scoring Models in Practice

Competition organizers typically combine these criteria into a weighted scoring model. While the exact weights vary, this approach ensures a balanced assessment. Below is a breakdown of how these criteria are often operationalized.

Criterion Description Example of a High-Scoring Submission Typical Weighting
Success / Flags Raw count of successful exploits against defined targets. Team successfully jailbreaks 15 out of 20 challenges. Low to Medium
Efficiency Minimizing resources (e.g., queries, tokens) to achieve a goal. A single, concise prompt bypasses a safety filter that normally resists long, complex attacks. Low
Novelty Discovery of a new, previously undocumented attack technique or vulnerability class. Using character encoding tricks to create a “semantic blind spot” in a model’s tokenizer. High
Impact Demonstration of a severe, plausible, and scalable real-world harm. An exploit that reliably extracts personally identifiable information (PII) from the model’s training data. High
Report Quality Clarity, reproducibility, and depth of the written submission. A report with clear steps, screenshots, a hypothesis for the root cause, and actionable mitigation advice. Medium

The final score for a submission is often a calculated value. Organizers might use a formula to combine quantitative points (from flags) with qualitative scores (from judges) for novelty, impact, and reporting.

# Pseudocode for a typical weighted scoring function
def calculate_final_score(submission):
    # Base points from automated checks
    flag_score = submission.flags_captured * 10

    # Judges' scores (on a scale of 1-10)
    novelty = submission.judge_score_novelty
    impact = submission.judge_score_impact
    clarity = submission.judge_score_clarity

    # Define weights for each qualitative category
    # Note: Novelty and Impact are often weighted highest
    weights = {
        "novelty": 0.40,
        "impact": 0.40,
        "clarity": 0.20
    }

    # Calculate the qualitative bonus score
    quality_bonus = (novelty * weights["novelty"] +
                     impact * weights["impact"] +
                     clarity * weights["clarity"]) * 100 # Scale factor

    total_score = flag_score + quality_bonus
    return total_score

Ultimately, the evaluation criteria are designed to reward the holistic skill set of an AI red teamer. Success isn’t just about finding a flaw; it’s about understanding its nature, articulating its danger, and contributing to a more secure AI ecosystem.