Moving beyond simply “breaking” a model, a successful AI red teaming competition hinges on a well-defined and transparent set of evaluation criteria. These criteria separate brute-force success from elegant, impactful, and novel discoveries. How your contributions are judged determines the skills a competition rewards, shaping the very nature of the event. A purely quantitative score might encourage spamming low-effort attacks, while a purely qualitative one can feel subjective. The best formats strike a careful balance.
At its core, evaluation seeks to answer four fundamental questions about your submission:
- Did it work? (Effectiveness)
- Was it clever? (Novelty)
- Does it matter? (Impact)
- Can others understand it? (Clarity)
These questions form the pillars upon which most scoring rubrics are built. Understanding them allows you to focus your efforts not just on finding a vulnerability, but on demonstrating its full significance.
The Four Pillars of Evaluation
Think of a top-tier finding as a balanced profile across multiple dimensions. A simple flag capture is just one point of data. A comprehensive evaluation considers the entire context of the exploit.
1. Technical Effectiveness & Efficiency
This is the most straightforward pillar. It measures the raw success and resourcefulness of your attack. This is less about the *what* and more about the *how well*.
- Success Rate: The number of successful exploits or “flags” captured. This is often the baseline score.
- Efficiency: The resources consumed to achieve the exploit. A jailbreak that requires only two queries is more impressive than one that needs a thousand. This can be measured in API calls, tokens used, or computational time.
- Stealth: The ability of an attack to bypass detection mechanisms, such as input filters, content moderation classifiers, or anomalous activity monitors.
2. Novelty & Ingenuity
This qualitative pillar rewards creativity. Did you discover a new class of vulnerability or simply reuse a well-known technique? Organizers look for submissions that push the boundaries of adversarial ML knowledge.
- Originality: The uniqueness of the attack vector. A novel prompt injection technique will score higher than a simple “ignore previous instructions” prompt.
- Complexity: The sophistication of the exploit chain. This could involve multi-step reasoning, exploiting interactions between different model components, or combining techniques from different domains.
- Transferability: Does the attack work on a wide range of models, or is it highly specific to the target? A more generalizable attack is often considered more novel and valuable.
3. Impact & Severity
A successful attack is one thing; a devastating one is another. This pillar assesses the real-world consequences if your discovered vulnerability were exploited. You must articulate a plausible and severe threat scenario.
- Harm Potential: The severity of the outcome. Generating misinformation is harmful, but generating code for a critical vulnerability or providing instructions for creating a weapon is far more severe.
- Scalability: How easily could an adversary automate and scale this attack to affect many users? An attack requiring manual, nuanced interaction is less scalable than one triggered by a simple, copy-pastable prompt.
- Plausibility: The realism of the threat actor and scenario you describe. Is this a vulnerability a real adversary would likely find and use?
4. Reporting & Clarity
A brilliant discovery is useless if no one can understand or reproduce it. This pillar evaluates your ability to communicate your findings effectively, which is a critical skill for any red teamer.
- Reproducibility: The quality of your documentation. Can a judge or a developer follow your steps and replicate the vulnerability without ambiguity?
- Root Cause Analysis: Your explanation of *why* the attack works. Do you demonstrate an understanding of the underlying model mechanics that lead to the failure?
- Mitigation Suggestions: The quality and feasibility of your proposed defenses. This shows you’re thinking not just as an attacker, but also as a defender.
Scoring Models in Practice
Competition organizers typically combine these criteria into a weighted scoring model. While the exact weights vary, this approach ensures a balanced assessment. Below is a breakdown of how these criteria are often operationalized.
| Criterion | Description | Example of a High-Scoring Submission | Typical Weighting |
|---|---|---|---|
| Success / Flags | Raw count of successful exploits against defined targets. | Team successfully jailbreaks 15 out of 20 challenges. | Low to Medium |
| Efficiency | Minimizing resources (e.g., queries, tokens) to achieve a goal. | A single, concise prompt bypasses a safety filter that normally resists long, complex attacks. | Low |
| Novelty | Discovery of a new, previously undocumented attack technique or vulnerability class. | Using character encoding tricks to create a “semantic blind spot” in a model’s tokenizer. | High |
| Impact | Demonstration of a severe, plausible, and scalable real-world harm. | An exploit that reliably extracts personally identifiable information (PII) from the model’s training data. | High |
| Report Quality | Clarity, reproducibility, and depth of the written submission. | A report with clear steps, screenshots, a hypothesis for the root cause, and actionable mitigation advice. | Medium |
The final score for a submission is often a calculated value. Organizers might use a formula to combine quantitative points (from flags) with qualitative scores (from judges) for novelty, impact, and reporting.
# Pseudocode for a typical weighted scoring function
def calculate_final_score(submission):
# Base points from automated checks
flag_score = submission.flags_captured * 10
# Judges' scores (on a scale of 1-10)
novelty = submission.judge_score_novelty
impact = submission.judge_score_impact
clarity = submission.judge_score_clarity
# Define weights for each qualitative category
# Note: Novelty and Impact are often weighted highest
weights = {
"novelty": 0.40,
"impact": 0.40,
"clarity": 0.20
}
# Calculate the qualitative bonus score
quality_bonus = (novelty * weights["novelty"] +
impact * weights["impact"] +
clarity * weights["clarity"]) * 100 # Scale factor
total_score = flag_score + quality_bonus
return total_score
Ultimately, the evaluation criteria are designed to reward the holistic skill set of an AI red teamer. Success isn’t just about finding a flaw; it’s about understanding its nature, articulating its danger, and contributing to a more secure AI ecosystem.