6.1.4 Metrics and evaluation

2025.10.06.
AI Security Blog

Executing an attack or deploying a defense is only half the battle. Without a systematic way to measure the outcome, your red teaming efforts remain anecdotal. “We broke the model” is a starting point; “We reduced classification accuracy from 98% to 15% using perturbations with an average L-infinity norm of 0.03” is an actionable finding. This section details the critical metrics for evaluating adversarial engagements and how ART facilitates their calculation.

The Core Principle: Evaluation is Relative

Effective evaluation in adversarial machine learning isn’t based on a single, absolute number. It’s about comparison. You are primarily interested in the *degradation* of performance caused by an attack and the *preservation* of performance provided by a defense. This always involves measuring at least two states: the model’s performance on benign data and its performance on adversarial data.

Kapcsolati űrlap - EN

Do you have a question about AI Security? Reach out to us here:

Diagram showing the comparison between baseline model evaluation and evaluation under attack. Benign Data Target Model Baseline Accuracy (e.g., 98%) Scenario 1: Standard Evaluation Benign Data Adversarial Attack Adversarial Data Target Model Accuracy Under Attack (e.g., 15%) Scenario 2: Adversarial Evaluation

Your goal is to quantify the gap between these two scenarios. A larger gap indicates a more effective attack, while a smaller gap (when a defense is active) suggests a more robust system.

Key Metrics for Red Team Engagements

While academic research involves dozens of specialized metrics, a red teamer typically focuses on a handful that directly translate to operational impact. Here are the most critical ones:

Metric What It Measures Why It’s Important for Red Teaming
Model Accuracy The percentage of correct predictions on a given dataset (benign or adversarial). This is the fundamental measure of model performance. Comparing accuracy on benign vs. adversarial data is the primary indicator of an attack’s success.
Attack Success Rate The percentage of inputs that were successfully perturbed to cause a misclassification. Provides a direct measure of the attack’s potency. A 100% success rate means every sample was successfully manipulated.
Perturbation Norm (L0, L2, L∞) The “size” or magnitude of the changes made to the input data. L∞ (infinity norm) is the maximum change to any single feature (pixel). Measures the stealthiness of an attack. A small perturbation norm is critical for creating adversarial examples that are imperceptible to humans, making the attack much harder to detect.
Query Count The number of predictions requested from the model to generate a single adversarial example (relevant for black-box attacks). This translates directly to the cost, time, and detectability of an attack. High query counts might trigger rate limiting or monitoring alerts.
Inference Time / Latency The time taken for the model to make a prediction, with and without a defense mechanism. Evaluates the performance overhead of a defense. A defense that makes the model too slow for its intended application may not be a viable solution, even if it’s effective.

Calculating Metrics with ART

ART streamlines the evaluation process by integrating it directly into the classifier objects. The primary method you will use is .evaluate() on a trained ART classifier wrapper. This method computes the accuracy of the model on a given set of inputs and labels.

The typical workflow for evaluating an attack is as follows:

  1. Establish a baseline by evaluating the classifier on clean, original test data.
  2. Use an ART attack object to generate adversarial examples from the clean test data.
  3. Evaluate the classifier again, this time using the newly generated adversarial examples.
  4. Compare the two accuracy scores to determine the attack’s impact.

This process provides the core data points for your analysis. Let’s see how this looks in practice.


# Assume 'classifier' is a trained ART classifier
# Assume 'x_test' and 'y_test' are your clean test data and labels
# Assume 'attack' is an initialized ART attack object (e.g., PGD)

# 1. Evaluate on clean data to get the baseline
predictions_clean = classifier.predict(x_test)
accuracy_clean = np.sum(np.argmax(predictions_clean, axis=1) == np.argmax(y_test, axis=1)) / len(y_test)
print(f"Accuracy on benign test examples: {accuracy_clean * 100:.2f}%")

# 2. Generate adversarial examples
x_test_adv = attack.generate(x=x_test)

# 3. Evaluate on adversarial data
predictions_adv = classifier.predict(x_test_adv)
accuracy_adv = np.sum(np.argmax(predictions_adv, axis=1) == np.argmax(y_test, axis=1)) / len(y_test)
print(f"Accuracy on adversarial examples: {accuracy_adv * 100:.2f}%")

# 4. Calculate the impact
print(f"Accuracy dropped by { (accuracy_clean - accuracy_adv) * 100:.2f} percentage points.")
                
A standard workflow for calculating model accuracy degradation using ART.

Beyond Accuracy: Interpreting Results for Impact

The numbers you generate are the foundation of your report, but their interpretation is what provides value. As a red teamer, you must connect these metrics to business or mission risk.

  • A high accuracy drop combined with a low perturbation norm indicates a severe, stealthy vulnerability. This is often the highest-priority finding.
  • A query-based attack that succeeds but requires millions of queries may be technically interesting but operationally infeasible against a production system with monitoring and rate-limiting. Your report should reflect this context.
  • When evaluating a defense, a small drop in “accuracy under attack” is good, but you must also report the inference time overhead. A 10x increase in latency might render a real-time system useless, making the defense impractical despite its robustness.

Your role is to translate these quantitative results into a qualitative risk assessment. The metrics ART helps you generate are the evidence you use to build a compelling and fact-based case for security improvements.