4.1.3 Measuring robustness and metrics

2025.10.06.
AI Security Blog

How do you prove a model is secure? Simply stating that it defended against a specific attack is not enough. To move from ad-hoc defense to systematic security engineering, you need to quantify resilience. This is where robustness metrics come in—they provide the language and the numbers to measure, compare, and ultimately certify a model’s strength against adversarial manipulation.

Beyond Standard Accuracy: Defining Adversarial Robustness

In traditional machine learning, accuracy is the gold standard: what percentage of predictions are correct on a clean test set? In an adversarial context, this metric is dangerously incomplete. A model can have 99% accuracy on standard data yet be 100% vulnerable to imperceptible perturbations.

Kapcsolati űrlap - EN

Do you have a question about AI Security? Reach out to us here:

Adversarial robustness is the measure of a model’s ability to maintain its output correctness when its inputs are subjected to small, bounded, and intentionally crafted perturbations. It’s not about handling random noise; it’s about withstanding a worst-case scenario within a defined threat model. Evaluating this requires a new set of tools beyond model.evaluate().

Core Metrics for Quantifying Resilience

Evaluating robustness isn’t a single score but a multi-faceted analysis. Different metrics illuminate different aspects of a model’s defensive posture. Let’s break down the most critical ones.

1. Perturbation Magnitude (Lp Norms)

The concept of “small” perturbations is formalized using mathematical norms, specifically Lp norms. As discussed in the context of optimization (4.1.1), these measure the “size” or “distance” of the perturbation vector. The choice of norm fundamentally changes the nature of the attack and the defense.

Norm Mathematical Definition Intuitive Meaning Typical Use Case / Visual Effect
L (Infinity Norm) max(|x' - x|) The largest change made to any single feature (e.g., pixel value). Subtle, widespread changes across many pixels, often imperceptible. The most common threat model.
L2 (Euclidean Norm) √(Σ(x' - x)2) The geometric distance between the original and perturbed input vectors. Low-energy, smooth changes. Less constrained than L, can result in slightly more visible noise.
L0 (Zero Norm) count(x' != x) The number of features that were changed (sparsity). Changing only a few pixels, but potentially by a large amount. A “salt-and-pepper” or patch attack.

When you read a paper stating a model is “robust against L perturbations of ε = 8/255,” it means the model can withstand attacks where no single pixel’s value is changed by more than 8 out of a possible 255 levels.

2. Attack Success Rate (ASR)

This is the most straightforward empirical metric. It answers the question: “Given a specific attack algorithm, what percentage of adversarial examples successfully fool the model?”

ASR is calculated as the fraction of inputs in a test set for which an adversary can find a successful adversarial example. It’s a direct measure of an attack’s efficacy against a particular model.

# Pseudocode for calculating Attack Success Rate
function calculate_asr(model, dataset, attack_function):
    successful_evasions = 0
    total_samples = 0

    for (original_input, original_label) in dataset:
        # First, ensure the model is correct on the clean sample
        if model.predict(original_input) == original_label:
            total_samples += 1
            
            # Generate the adversarial counterpart
            adversarial_input = attack_function(model, original_input, original_label)
            
            # Check if the attack caused a misclassification
            adversarial_prediction = model.predict(adversarial_input)
            if adversarial_prediction != original_label:
                successful_evasions += 1
    
    return successful_evasions / total_samples

Crucial Caveat: ASR is only meaningful in the context of the attack used. A low ASR against a weak attack like FGSM means very little. A robust evaluation requires testing against strong, adaptive attacks like PGD.

Formal Guarantees: Certified Robustness

Empirical metrics like ASR provide evidence, but not proof. They show that a *specific* attack failed, but they don’t guarantee that a *smarter* attack won’t succeed. Certified robustness aims to provide a formal, mathematical guarantee of a model’s stability.

The core idea is to define a robustness radius (ε) around a given input x. A certified defense can prove that for *any* perturbation δ within that radius (i.e., ||δ|| ≤ ε), the model’s prediction will not change. This creates a “safe zone” where the model is invulnerable.

x x’ δ ε Class A Class B Decision Boundary

An illustration of certified robustness. The green circle is the certified radius (ε) around input x. Any perturbation δ creating an adversarial example x’ inside this circle is provably impossible. An attack must find a perturbation large enough to cross outside the certified radius to succeed.

Methods like Interval Bound Propagation (IBP) and Randomized Smoothing can compute these certificates. While computationally expensive and often resulting in smaller provable radii than what is empirically observed, they represent the highest standard of robustness verification and are critical for high-stakes applications.

Putting It All Together: A Holistic View

No single metric tells the whole story. A comprehensive robustness evaluation for a red teaming engagement should include:

  • Baseline Accuracy: The model must be useful on clean data.
  • Empirical Robustness: ASR against a suite of strong, adaptive attacks (e.g., PGD, C&W, AutoAttack) across multiple Lp norms (L, L2).
  • Certified Robustness: The average certified radius (ε) a formal method can prove for the test set.

This multi-pronged approach prevents “robustness laundering,” where a model appears strong against one metric or attack but is brittle against others. As you progress, you’ll see how these metrics are not just for evaluation but are often integrated directly into the training process (as in adversarial training) to build more resilient systems from the ground up.