25.2.3. Statistical Metrics

2025.10.06.
AI Security Blog

Evaluating an AI model’s performance, especially under adversarial conditions, requires a solid grasp of statistical metrics. These are not just academic exercises; they are the tools you use to quantify a model’s vulnerabilities and the success of your attacks. A drop in a specific metric can be the clear signal that your red teaming effort has found a critical flaw.

Fundamental Descriptive Statistics

Before diving into model-specific metrics, let’s review the basics. These help you understand the distribution of data, model outputs, or confidence scores.

Kapcsolati űrlap - EN

Do you have a question about AI Security? Reach out to us here:

Measures of Central Tendency

  • Mean (Average): The sum of all values divided by the number of values. It’s sensitive to outliers.
    $$ mu = frac{1}{N} sum_{i=1}^{N} x_i $$
  • Median: The middle value of a dataset when sorted. It is robust to outliers.
  • Mode: The value that appears most frequently in a dataset.

Measures of Dispersion

  • Variance ((sigma^2)): The average of the squared differences from the Mean. It measures how spread out the data is. A high variance in model predictions for similar inputs might indicate instability.
    $$ sigma^2 = frac{1}{N} sum_{i=1}^{N} (x_i – mu)^2 $$
  • Standard Deviation ((sigma)): The square root of the variance. It is expressed in the same units as the data, making it more interpretable than variance.
    $$ sigma = sqrt{frac{1}{N} sum_{i=1}^{N} (x_i – mu)^2} $$

Classification Metrics

For tasks like spam detection, malware analysis, or content moderation, classification metrics are paramount. Most are derived from the Confusion Matrix, which provides a detailed breakdown of correct and incorrect predictions.

The Confusion Matrix

The confusion matrix is a table that visualizes the performance of a classification algorithm. Each row represents the instances in an actual class while each column represents the instances in a predicted class.

A standard 2×2 Confusion Matrix
Predicted Class
Positive Negative
Actual Class Positive True Positive (TP)
Correctly identified positive
False Negative (FN)
Incorrectly rejected as negative (Type II Error)
Negative False Positive (FP)
Incorrectly identified as positive (Type I Error)
True Negative (TN)
Correctly rejected as negative

Key Performance Indicators (KPIs)

  • Accuracy: The ratio of correct predictions to the total number of predictions. Warning: Accuracy can be highly misleading for imbalanced datasets. A model that always predicts the majority class can have high accuracy but be useless.

    $$ text{Accuracy} = frac{TP + TN}{TP + TN + FP + FN} $$
  • Precision (Positive Predictive Value): Answers the question: “Of all the instances the model predicted as positive, what proportion was actually positive?” High precision is critical when the cost of a false positive is high (e.g., flagging a safe system as compromised).

    $$ text{Precision} = frac{TP}{TP + FP} $$
  • Recall (Sensitivity, True Positive Rate): Answers the question: “Of all the actual positive instances, what proportion did the model correctly identify?” High recall is vital when the cost of a false negative is high (e.g., failing to detect malware).

    $$ text{Recall} = frac{TP}{TP + FN} $$
  • F1-Score: The harmonic mean of Precision and Recall. It provides a single score that balances both concerns. It is often more useful than accuracy, especially on imbalanced classes.

    $$ F_1 = 2 cdot frac{text{Precision} cdot text{Recall}}{text{Precision} + text{Recall}} = frac{2TP}{2TP + FP + FN} $$
  • Specificity (True Negative Rate): The proportion of actual negatives that were correctly identified. It’s the “recall” for the negative class.

    $$ text{Specificity} = frac{TN}{TN + FP} $$

ROC Curve and AUC

The Receiver Operating Characteristic (ROC) curve is a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied. It plots the True Positive Rate (Recall) against the False Positive Rate.

ROC Curve Diagram A graph showing the trade-off between True Positive Rate and False Positive Rate. An ideal curve hugs the top-left corner. False Positive Rate (1 – Specificity) True Positive Rate (Recall) 0.0 1.0 0.0 1.0 Random (AUC = 0.5) Good Classifier (AUC > 0.5) Ideal Point

The Area Under the Curve (AUC) quantifies the overall ability of the model to discriminate between positive and negative classes.

  • AUC = 1: Perfect classifier.
  • AUC = 0.5: No better than random guessing.
  • AUC < 0.5: Worse than random guessing (the model is likely reciprocating labels).

As a red teamer, your goal might be to craft inputs that push a model’s performance on a specific task down towards the random guess line, effectively neutralizing its utility.

Regression Metrics

When a model predicts a continuous value (e.g., a financial forecast, a system load), you use regression metrics to measure the error between predicted ((hat{y}_i)) and actual ((y_i)) values.

  • Mean Absolute Error (MAE): The average absolute difference between the predicted and actual values. It’s easy to interpret as it’s in the same units as the output variable.

    $$ text{MAE} = frac{1}{N} sum_{i=1}^{N} |y_i – hat{y}_i| $$
  • Mean Squared Error (MSE): The average of the squared differences. This metric penalizes larger errors more than smaller ones due to the squaring term.

    $$ text{MSE} = frac{1}{N} sum_{i=1}^{N} (y_i – hat{y}_i)^2 $$
  • Root Mean Squared Error (RMSE): The square root of the MSE. This brings the metric back to the original units of the target variable, making it more interpretable than MSE while still penalizing large errors.

    $$ text{RMSE} = sqrt{frac{1}{N} sum_{i=1}^{N} (y_i – hat{y}_i)^2} $$

In a red teaming context, you might try to find inputs that cause a regression model to produce a prediction with a very high error (spiking the MSE/RMSE), potentially leading to disastrous downstream decisions based on that faulty prediction.