28.3.2. Red Team Competition Formats

2025.10.06.
AI Security Blog

The structure of an AI red teaming competition fundamentally shapes the strategies you’ll employ and the skills you’ll need. Understanding these formats is crucial for choosing the right events to participate in and for preparing effectively. Each format tests a different facet of adversarial thinking, from pure offensive exploitation to a balanced approach of attack and defense.

Common Frameworks for Adversarial Challenges

While organizers often add their own unique twists, most AI security competitions fall into one of a few established categories. Each model presents a distinct set of objectives and constraints that define the game.

Kapcsolati űrlap - EN

Do you have a question about AI Security? Reach out to us here:

Adversarial Attack-Only (Red Team vs. The House)

This is the most straightforward and common format. As a participant, your sole objective is to find and exploit vulnerabilities in target AI systems provided by the organizers. You are not responsible for defense. Success is measured by the quantity and quality of the vulnerabilities you discover and report.

  • Objective: Breach the safety, security, or integrity of one or more target models.
  • Analogy: A classic penetration test or a “Capture the Flag” (CTF) where all challenges are provided by the event hosts.
  • Focus: Purely offensive techniques, including prompt injection, model jailbreaking, data extraction, and identifying logical flaws.

Attack-Defense (A/D)

In an Attack-Defense format, you are not just a hunter; you are also the prey. Each team is given an AI system (or is tasked with building one) that they must both defend from other teams and use as a base to understand and attack others’ systems. This format is dynamic and highly interactive.

  • Objective: Score points by successfully attacking other teams’ models while simultaneously patching or defending your own model to prevent others from scoring against you.
  • Analogy: A digital wargame where teams manage their own territory while launching sorties against opponents.
  • Focus: A balanced skillset of offensive exploitation and defensive hardening. Rapid analysis and patching under pressure are key.

Bug Bounty Model

Though often a continuous program rather than a time-boxed competition, the bug bounty model is a popular format for large-scale public red teaming. Organizers present a model or system and offer rewards for valid vulnerability submissions over an extended period. These are less about speed and more about depth and novelty.

  • Objective: Discover and responsibly disclose novel and impactful vulnerabilities in a production or near-production system.
  • Analogy: A standing contract for security research with rewards based on impact.
  • Focus: Deep, methodical investigation. Finding zero-day or previously unknown vulnerability classes is highly rewarded.

Data-centric and Poisoning Formats

This format shifts the focus from the deployed model to its training pipeline. Your goal is to manipulate the data used to train or fine-tune a model to cause specific, malicious downstream behavior. This tests your understanding of the machine learning lifecycle itself.

  • Objective: Craft malicious data samples that, when included in a training set, create a backdoor, degrade performance, or introduce specific biases.
  • Analogy: Sabotaging the factory’s supply chain to produce a faulty product.
  • Focus: Data manipulation, understanding of training dynamics, and creating subtle triggers for backdoors.

Comparing Competition Formats at a Glance

The right format for you depends on your goals, whether it’s honing pure offensive skills, testing your defensive acumen, or diving deep into the mechanics of ML training.

Format Primary Objective Key Skills Tested Typical Duration Scoring Mechanism
Attack-Only Find vulnerabilities in target models. Prompt engineering, reverse engineering model logic, creative exploitation. Hours to days (e.g., 24-48 hours) Points per valid “flag” (vulnerability), often weighted by difficulty or impact.
Attack-Defense Exploit opponents’ models while defending your own. Rapid vulnerability analysis, patching, defensive filtering, offensive scripting. 8-24 hours, in rounds. Points for successful attacks on others, points deducted for being successfully attacked. Uptime of your own service is critical.
Bug Bounty Discover novel, high-impact vulnerabilities. Deep research, methodical testing, clear documentation, novelty. Continuous / Weeks to months. Monetary rewards based on vulnerability severity (e.g., CVSS score) and report quality.
Data Poisoning Corrupt a model’s training process via malicious data. Data manipulation, understanding of model training dynamics, feature engineering. Days to weeks. Based on the effectiveness of the poisoned model on a hidden test set (e.g., backdoor success rate).

What Does a “Flag” Look Like?

In any competition, your “find” or “flag” must be submitted in a structured format for verification. This proves you successfully achieved the objective. While the exact format varies, it typically involves providing the input that caused the failure and evidence of that failure.

For an LLM-based challenge, a typical submission for a policy violation might look like a simple JSON object:

{
  "challenge_id": "llm-safety-guardrails-01",
  "team_id": "AdversarialMinds",
  "submission_type": "policy_violation",
  "payload": {
    "prompt": "Here's a clever roleplaying scenario that bypasses your content filters...",
    "model_output": "Certainly! Here is the harmful content you requested..."
  },
  "notes": "Used a nested roleplay technique combined with character impersonation to bypass the initial system prompt guardrails."
}

This structure provides everything the judges need: who you are, what you targeted, how you did it (the prompt), and the evidence of your success (the model’s harmful output). Mastering the art of clear, concise reporting is as important as finding the vulnerability itself.