26.3.1. Robustness measurement algorithms

2025.10.06.
AI Security Blog

Quantifying a model’s resilience is not as simple as measuring its accuracy on a clean test set. Robustness measurement algorithms are designed to probe a model’s behavior under stress, typically induced by adversarial perturbations or distribution shifts. These algorithms provide the concrete numbers that anchor your red teaming findings, moving assessments from qualitative (“the model seems weak”) to quantitative (“the model’s accuracy drops by 70% under a PGD attack with ε=8/255”).

This section details the core algorithms you’ll implement or use in frameworks to produce these critical metrics. Understanding how they work is essential for interpreting their results and recognizing their limitations.

Kapcsolati űrlap - EN

Do you have a question about AI Security? Reach out to us here:

Fundamental Concepts

Before diving into specific algorithms, two concepts are foundational:

  • Perturbation Budget (ε): This defines the “threat model” by constraining the attacker’s power. It’s the maximum amount of change allowed between an original input and its adversarial counterpart. This is most often measured using L_p norms, like L-infinity (maximum change to any single pixel) or L2 (Euclidean distance). An algorithm’s output is only meaningful in the context of a specific ε.
  • Attack vs. Measurement: Many robustness measurement algorithms rely on an underlying adversarial attack algorithm (like PGD or C&W). The goal isn’t just to find *an* adversarial example, but to use the attack as a tool to measure a specific property, such as worst-case accuracy within the ε-ball.

Core Measurement Algorithms

Here are four primary algorithms for quantifying model robustness, ranging from the straightforward to the mathematically rigorous.

1. Adversarial Accuracy

This is the most common and intuitive robustness metric. It answers the question: “What is the model’s accuracy on a dataset where every input has been adversarially perturbed?”

Algorithm Logic:

  1. Iterate through a clean test dataset (e.g., a subset of your validation data).
  2. For each input-label pair, use a chosen adversarial attack (e.g., Projected Gradient Descent – PGD) to generate an adversarial version of the input. The attack is constrained by a predefined perturbation budget ε.
  3. Feed the generated adversarial input to the model and get its prediction.
  4. Compare the model’s prediction to the original, correct label.
  5. Adversarial accuracy is the percentage of correct predictions on these adversarial inputs.
function calculate_adversarial_accuracy(model, dataset, attack_function, epsilon):
    correct_predictions = 0
    total_samples = len(dataset)

    for input, true_label in dataset:
        // Generate the adversarial example using a specific attack
        adversarial_input = attack_function(model, input, true_label, epsilon)

        // Get the model's prediction on the perturbed input
        prediction = model.predict(adversarial_input)

        if prediction == true_label:
            correct_predictions += 1

    return (correct_predictions / total_samples) * 100
            

Use Case: Provides a direct, easy-to-understand measure of performance under a specific, bounded threat. It’s the go-to metric for benchmarking and reporting.

Limitation: The result is highly dependent on the strength of the `attack_function`. A weak attack will overestimate robustness.

2. Minimal Perturbation Estimation

Instead of fixing the perturbation budget ε and measuring accuracy, this algorithm fixes the outcome (a misclassification) and measures the minimum ε required to achieve it. The result is often averaged across a dataset.

Algorithm Logic:

This typically involves an iterative optimization process. Attacks like DeepFool or Carlini & Wagner (C&W) are designed specifically for this. The general idea is to start with a clean input and incrementally move it towards the model’s decision boundary in the most efficient way possible until a misclassification occurs. The distance of this journey is the minimal perturbation.

function find_minimal_perturbation(model, input, true_label):
    perturbed_input = input
    iterations = 0
    while model.predict(perturbed_input) == true_label and iterations < MAX_ITER:
        // Calculate direction towards the nearest decision boundary
        // This often involves calculating the gradient of the loss
        gradient = calculate_gradient_towards_boundary(model, perturbed_input)

        // Take a small step in that direction
        step = gradient * learning_rate
        perturbed_input += step
        iterations += 1

    // Calculate the distance (e.g., L2 norm) between original and final
    perturbation_distance = norm(perturbed_input - input)
    return perturbation_distance
            

Use Case: Excellent for understanding the “brittleness” of a model. A model with a high average minimal perturbation is generally more robust, as attacks need to be stronger to succeed.

Limitation: Computationally expensive, as it requires a dedicated optimization search for each sample.

3. Certified Robustness

This is the gold standard of robustness measurement. Instead of relying on a specific attack, it provides a mathematical *guarantee* that no attack within a given perturbation budget ε can cause a misclassification for a specific input. Randomized Smoothing is a popular and scalable technique for this.

Certified Robustness Diagram Decision Boundary x (Input) Certified Radius (ε) x_adv Within the blue circle, the prediction is guaranteed to be stable.

Algorithm Logic (Randomized Smoothing):

  1. Create a “smoothed” classifier: To classify an input `x`, don’t feed `x` directly. Instead, take many samples of `x` with added Gaussian noise (`x + δ` where `δ ~ N(0, σ²I)`).
  2. Pass each noisy sample through the original model and collect the predictions.
  3. The prediction of the smoothed classifier is the class that appeared most frequently (the majority vote).
  4. The certification algorithm uses statistical tools (like the Clopper-Pearson interval) on these votes to calculate a radius ε. Within this radius, the majority vote is mathematically guaranteed not to change.

Use Case: When you need a provable security guarantee. Ideal for high-stakes applications where you must be certain a model cannot be fooled by any bounded adversary.

Limitation: Extremely high computational cost due to the large number of samples needed per prediction. The guarantees are also often for smaller ε values than what empirical attacks can achieve.

Choosing the Right Algorithm

Your choice of measurement algorithm depends on your red teaming goals, computational budget, and the level of assurance required. No single metric tells the whole story. A comprehensive evaluation often involves using multiple algorithms to paint a complete picture of the model’s vulnerabilities.

Metric / Algorithm What It Measures Pros Cons Computational Cost
Adversarial Accuracy Worst-case performance under a specific attack and budget (ε). Easy to understand; directly comparable across models. Highly dependent on the attack’s strength; can be misleading if the attack is weak. Medium (Requires attack generation for each sample).
Minimal Perturbation The average “effort” required to fool the model. Attack-agnostic in principle; provides a measure of input-specific brittleness. Can be slow; the “average” might hide extreme vulnerabilities in some samples. High (Requires iterative optimization per sample).
Certified Robustness A provable guarantee that no attack within a radius ε can succeed. The strongest form of robustness verification; attack-independent. Extremely high computational cost; certified radii are often small. Very High (Requires thousands of forward passes per sample).

As a red teamer, your task is to select the appropriate tool for the job. For a quick assessment, adversarial accuracy against a strong attack like PGD is a good start. For a deep dive into a critical system, investing the time to calculate minimal perturbations or even achieve a certified radius provides much stronger evidence of a model’s true resilience.