Selecting the right metrics is fundamental to any red teaming engagement. A single number, like accuracy, tells a dangerously incomplete story. Your goal is to build a multi-faceted view of the system’s behavior under both normal and adversarial conditions. This reference compares key metrics across different domains, highlighting their purpose, limitations, and interdependencies.
The Evaluation Trilemma: Visualizing Trade-offs
In AI security, you rarely get to optimize one metric without impacting others. The most common trade-offs exist between standard performance (accuracy), resilience to attacks (robustness), and equitable outcomes (fairness). Visualizing this helps stakeholders understand why a model with 99% accuracy might still be unacceptable from a security or ethical standpoint.
Model Performance and Robustness Metrics
These metrics form the baseline. The first group measures how well the model performs its intended task on clean data, while the second measures its stability under adversarial pressure.
| Metric | Measures | Context / Use Case | Limitations / Caveats |
|---|---|---|---|
| Accuracy | Proportion of correct predictions over all predictions. | Quick, high-level assessment on balanced datasets. | Highly misleading for imbalanced classes. A 99% accurate model could ignore a rare but critical class entirely. |
| Precision | Of all positive predictions, how many were actually positive. (TP / (TP + FP)) | When the cost of a false positive is high (e.g., spam detection, medical diagnosis). | Ignores false negatives. A model can achieve high precision by being overly cautious and predicting ‘positive’ very rarely. |
| Recall (Sensitivity) | Of all actual positives, how many did the model identify. (TP / (TP + FN)) | When the cost of a false negative is high (e.g., fraud detection, identifying security threats). | Ignores false positives. A model can achieve high recall by classifying almost everything as ‘positive’. |
| F1-Score | The harmonic mean of Precision and Recall. | Provides a single score that balances both Precision and Recall, useful for imbalanced classes. | Less interpretable than Precision or Recall alone. Doesn’t distinguish between the types of errors being made. |
| Robust Accuracy | Model accuracy on a dataset of adversarially perturbed inputs. | The primary metric for evaluating defenses against evasion attacks. | Highly dependent on the specific attack used to generate examples. High robust accuracy against one attack doesn’t guarantee it for others. |
Attack Efficacy and Cost Metrics
When you are the attacker, your success isn’t just about fooling the model—it’s about doing so efficiently. These metrics quantify both the effectiveness and the “cost” of an attack.
| Metric | Measures | Context / Use Case | Limitations / Caveats |
|---|---|---|---|
| Attack Success Rate (ASR) | Percentage of adversarial examples that successfully fool the model. | The most direct measure of an attack’s effectiveness. | Doesn’t account for the effort required. A 100% ASR that requires massive, obvious perturbations is not a practical attack. |
| Query Cost | The number of queries made to a black-box model to generate a single successful adversarial example. | Evaluating the efficiency and stealth of black-box attacks where API calls may be limited or monitored. | Not applicable to white-box attacks. Can be highly variable depending on the starting point and algorithm. |
| Average Perturbation (Lp Norm) | The average magnitude of the perturbation added to inputs, measured by a mathematical norm (e.g., L0, L2, L∞). | Quantifies the stealthiness of an attack. Lower values mean the changes are less perceptible. | Perceptual similarity doesn’t always align with Lp norms. A low L2 norm might still result in a noticeable artifact. |
Example: Calculating Perturbation Norm
Understanding perturbation size is key. A small L2 norm indicates that the “distance” between the original and adversarial vector is small, implying a subtle change.
# Pseudocode for calculating L2 norm of a perturbation function calculate_l2_norm(original_input, adversarial_input): # Ensure inputs are numerical vectors of the same size assert original_input.shape == adversarial_input.shape # The perturbation is the difference between the two inputs perturbation = adversarial_input - original_input # Square each element of the perturbation vector squared_diffs = perturbation ** 2 # Sum the squared differences and take the square root l2_distance = sqrt(sum(squared_diffs)) return l2_distance
Data Privacy and Fairness Metrics
A secure model is not just robust—it’s also trustworthy. It must protect the privacy of its training data and apply its logic equitably across different demographic groups.
| Metric | Measures | Context / Use Case | Limitations / Caveats |
|---|---|---|---|
| Differential Privacy (ε, δ) | A mathematical guarantee that the model’s output is statistically similar whether or not a specific individual’s data was included in the training set. | Provides a formal, provable privacy guarantee. Essential for models trained on sensitive user data. | Often involves a direct trade-off with model accuracy. Interpreting epsilon (ε) can be non-intuitive for stakeholders. |
| Membership Inference Accuracy | The success rate of an attack model trying to determine if a specific data point was part of the training set. | A practical, empirical measure of data leakage. A value near 50% (random chance) is ideal. | A low success rate is not a formal guarantee of privacy, unlike Differential Privacy. |
| Demographic Parity | The likelihood of a positive outcome is the same for all demographic groups (e.g., P(positive | group A) = P(positive | group B)). | A simple, intuitive measure of fairness. Useful for auditing hiring, loan, or parole models for group-level bias. | Can be incompatible with achieving the highest possible accuracy if the base rates differ between groups. May lead to less qualified individuals being selected to meet quotas. |
| Equalized Odds | The true positive rate and false positive rate are equal across different demographic groups. | A stricter fairness criterion than demographic parity, ensuring the model performs equally well for all groups. | More difficult to satisfy than demographic parity and can further constrain model performance. |