Executing an attack or deploying a defense is only half the battle. Without a systematic way to measure the outcome, your red teaming efforts remain anecdotal. “We broke the model” is a starting point; “We reduced classification accuracy from 98% to 15% using perturbations with an average L-infinity norm of 0.03” is an actionable finding. This section details the critical metrics for evaluating adversarial engagements and how ART facilitates their calculation.
The Core Principle: Evaluation is Relative
Effective evaluation in adversarial machine learning isn’t based on a single, absolute number. It’s about comparison. You are primarily interested in the *degradation* of performance caused by an attack and the *preservation* of performance provided by a defense. This always involves measuring at least two states: the model’s performance on benign data and its performance on adversarial data.
Your goal is to quantify the gap between these two scenarios. A larger gap indicates a more effective attack, while a smaller gap (when a defense is active) suggests a more robust system.
Key Metrics for Red Team Engagements
While academic research involves dozens of specialized metrics, a red teamer typically focuses on a handful that directly translate to operational impact. Here are the most critical ones:
| Metric | What It Measures | Why It’s Important for Red Teaming |
|---|---|---|
| Model Accuracy | The percentage of correct predictions on a given dataset (benign or adversarial). | This is the fundamental measure of model performance. Comparing accuracy on benign vs. adversarial data is the primary indicator of an attack’s success. |
| Attack Success Rate | The percentage of inputs that were successfully perturbed to cause a misclassification. | Provides a direct measure of the attack’s potency. A 100% success rate means every sample was successfully manipulated. |
| Perturbation Norm (L0, L2, L∞) | The “size” or magnitude of the changes made to the input data. L∞ (infinity norm) is the maximum change to any single feature (pixel). | Measures the stealthiness of an attack. A small perturbation norm is critical for creating adversarial examples that are imperceptible to humans, making the attack much harder to detect. |
| Query Count | The number of predictions requested from the model to generate a single adversarial example (relevant for black-box attacks). | This translates directly to the cost, time, and detectability of an attack. High query counts might trigger rate limiting or monitoring alerts. |
| Inference Time / Latency | The time taken for the model to make a prediction, with and without a defense mechanism. | Evaluates the performance overhead of a defense. A defense that makes the model too slow for its intended application may not be a viable solution, even if it’s effective. |
Calculating Metrics with ART
ART streamlines the evaluation process by integrating it directly into the classifier objects. The primary method you will use is .evaluate() on a trained ART classifier wrapper. This method computes the accuracy of the model on a given set of inputs and labels.
The typical workflow for evaluating an attack is as follows:
- Establish a baseline by evaluating the classifier on clean, original test data.
- Use an ART attack object to generate adversarial examples from the clean test data.
- Evaluate the classifier again, this time using the newly generated adversarial examples.
- Compare the two accuracy scores to determine the attack’s impact.
This process provides the core data points for your analysis. Let’s see how this looks in practice.
# Assume 'classifier' is a trained ART classifier
# Assume 'x_test' and 'y_test' are your clean test data and labels
# Assume 'attack' is an initialized ART attack object (e.g., PGD)
# 1. Evaluate on clean data to get the baseline
predictions_clean = classifier.predict(x_test)
accuracy_clean = np.sum(np.argmax(predictions_clean, axis=1) == np.argmax(y_test, axis=1)) / len(y_test)
print(f"Accuracy on benign test examples: {accuracy_clean * 100:.2f}%")
# 2. Generate adversarial examples
x_test_adv = attack.generate(x=x_test)
# 3. Evaluate on adversarial data
predictions_adv = classifier.predict(x_test_adv)
accuracy_adv = np.sum(np.argmax(predictions_adv, axis=1) == np.argmax(y_test, axis=1)) / len(y_test)
print(f"Accuracy on adversarial examples: {accuracy_adv * 100:.2f}%")
# 4. Calculate the impact
print(f"Accuracy dropped by { (accuracy_clean - accuracy_adv) * 100:.2f} percentage points.")
Beyond Accuracy: Interpreting Results for Impact
The numbers you generate are the foundation of your report, but their interpretation is what provides value. As a red teamer, you must connect these metrics to business or mission risk.
- A high accuracy drop combined with a low perturbation norm indicates a severe, stealthy vulnerability. This is often the highest-priority finding.
- A query-based attack that succeeds but requires millions of queries may be technically interesting but operationally infeasible against a production system with monitoring and rate-limiting. Your report should reflect this context.
- When evaluating a defense, a small drop in “accuracy under attack” is good, but you must also report the inference time overhead. A 10x increase in latency might render a real-time system useless, making the defense impractical despite its robustness.
Your role is to translate these quantitative results into a qualitative risk assessment. The metrics ART helps you generate are the evidence you use to build a compelling and fact-based case for security improvements.