While metrics like empirical robustness give you a defender’s view of a model’s resilience, the Attack Success Rate (ASR) is the quintessential attacker’s metric. It answers the most direct question in a red team engagement: “Did our attack work?” ASR quantifies the effectiveness of a specific adversarial strategy against a target model, providing a clear, actionable measure of vulnerability.
At its core, the ASR is a simple ratio, but its true value lies in the careful definition of what constitutes “success.”
Core Formula: The Attack Success Rate is calculated as:
ASR = (Number of Successful Adversarial Examples) / (Total Number of Attempted Attacks)
Defining the “Win Condition”
A “successful” attack is not a universal concept. Your objective dictates the success criteria. Before you can measure ASR, you must define precisely what you are trying to achieve. The most common objectives include:
- Untargeted Misclassification: This is the broadest definition of success. The attack wins if the model’s output for the perturbed input is simply incorrect. For a cat image, an output of “dog,” “airplane,” or “gibberish” all count as successes. This is often the easiest goal to achieve.
- Targeted Misclassification: A much stricter and more potent objective. You define a specific target label for the attack. The attack only succeeds if the perturbed image of a “cat” is classified specifically as “guacamole,” or whatever target you’ve chosen. ASR for targeted attacks is almost always lower than for untargeted ones.
- Evasion or Rejection: Particularly relevant for models with safety filters or content moderation layers (e.g., Large Language Models). Success is achieved if your input, which should be blocked, bypasses the filter. Conversely, you might aim to make a benign prompt get rejected by the safety system (a denial-of-service goal).
- Confidence Reduction: In some scenarios, a full misclassification isn’t necessary. The goal might be to simply erode the model’s confidence in its correct prediction below a critical threshold. Forcing a medical diagnostic AI to drop its confidence in a “malignant” classification from 98% to 51% could be a mission-critical success, even if the top-1 prediction doesn’t change.
Calculating ASR in Practice
The calculation process involves iterating over a dataset of inputs that the model initially classifies correctly. For each correct prediction, you apply your adversarial attack and then check if the outcome for the perturbed input meets your predefined success condition.
# Pseudocode for calculating untargeted ASR
successful_attacks = 0
total_attempts = 0
# Iterate over a test dataset (e.g., 1000 images)
for (input, true_label) in test_dataset:
# 1. First, confirm the model is correct on the clean input
original_prediction = model.predict(input)
if original_prediction == true_label:
total_attempts += 1
# 2. Generate the adversarial example
adversarial_input = generate_attack(model, input, true_label)
# 3. Get the model's new prediction
adversarial_prediction = model.predict(adversarial_input)
# 4. Check if the attack was successful (untargeted)
if adversarial_prediction != true_label:
successful_attacks += 1
# 5. Calculate the final ASR
attack_success_rate = (successful_attacks / total_attempts) * 100
print(f"ASR: {attack_success_rate:.2f}%")
Note the important first step: you only attempt attacks on inputs the model already gets right. Attacking an input the model misclassifies anyway doesn’t provide a meaningful signal about your attack’s effectiveness.
ASR is Not an Absolute Number
Reporting a single ASR value without context is misleading. The success rate is heavily dependent on the experimental setup. As a red teamer, you must document these variables, as they reveal the conditions under which a model is vulnerable.
| Attack Algorithm | Perturbation Budget (L-infinity) | Model Type | Attack Knowledge | Resulting ASR |
|---|---|---|---|---|
| FGSM | ε = 4/255 | Standard ResNet-50 | White-box | 71% |
| PGD (40 steps) | ε = 4/255 | Standard ResNet-50 | White-box | 98% |
| PGD (40 steps) | ε = 4/255 | Adversarially Trained ResNet-50 | White-box | 46% |
| Square Attack | ε = 4/255 | Standard ResNet-50 | Black-box (Query-based) | 85% |
This table illustrates several key dynamics. A stronger attack like PGD achieves a higher ASR than FGSM with the same budget. An adversarially trained model significantly reduces the ASR, demonstrating the defense’s effectiveness. Finally, even a powerful black-box attack can achieve a high ASR, highlighting that lack of internal access is not a guarantee of security.
The Trade-off with Perturbation Size
Ultimately, ASR has a direct relationship with the attack’s noticeability. You can almost always achieve a 100% ASR if you are allowed to make arbitrarily large changes to the input. The real challenge, and the more meaningful metric, is achieving a high ASR while keeping the perturbation minimal and stealthy. A 99% ASR achieved by turning a cat image into random noise is a trivial result. A 75% ASR achieved with changes invisible to the human eye is a significant security finding. This tension between success rate and perturbation budget is a central theme in adversarial machine learning.