After running hundreds or thousands of tests, you’re left with a sea of data. Simply reporting that “Model A failed 15% of the time” while “Model B failed 12%” is insufficient. Was Model B’s better performance a genuine improvement or just statistical noise? Statistical evaluators provide the tools to move from observation to confident conclusion, lending mathematical rigor to your red teaming findings.
Comparing Performance: Is the Difference Real?
A core task in red teaming is comparing two groups: a baseline model versus a hardened one, or two different attack strategies. To determine if an observed difference in performance (e.g., success rate, robustness score) is statistically significant, you can use hypothesis tests.
The Independent T-Test
Use the t-test when you’re comparing the means of two independent groups and your data is approximately normally distributed. The test yields a p-value, which represents the probability of observing your results (or more extreme) if there were actually no difference between the groups. A small p-value (typically < 0.05) suggests the difference is unlikely to be due to random chance.
Code Example: Comparing Robustness Scores
import numpy as np
from scipy.stats import ttest_ind
# Robustness scores (e.g., 1=success, 0=fail) from two models
# Each model was tested 100 times
model_A_scores = np.random.binomial(1, 0.85, 100) # 85% robust
model_B_scores = np.random.binomial(1, 0.92, 100) # 92% robust
# Perform the independent t-test
stat, p_value = ttest_ind(model_A_scores, model_B_scores)
print(f"T-statistic: {stat:.4f}")
print(f"P-value: {p_value:.4f}")
# Interpretation of the result
if p_value < 0.05:
print("The difference in robustness is statistically significant.")
else:
print("No significant difference detected between the models.")
When your data isn’t normally distributed (a common scenario with failure rates), a non-parametric alternative like the Mann-Whitney U test is a more appropriate choice.
Finding Relationships: Correlation Analysis
Sometimes you want to know if two variables move together. For example, does increasing the complexity of a prompt correlate with a higher chance of eliciting a harmful response? Correlation analysis helps quantify the strength and direction of such relationships.
Pearson and Spearman Correlation
The Pearson correlation coefficient measures the linear relationship between two continuous variables, returning a value between -1 (perfect negative linear correlation) and +1 (perfect positive linear correlation). A value near 0 indicates no linear correlation. The Spearman correlation is used for monotonic relationships, which don’t have to be linear, and works well with ordinal data.
Code Example: Prompt Length vs. Jailbreak Success
from scipy.stats import pearsonr
# Example data: prompt length in tokens and jailbreak success (1 or 0)
prompt_lengths = [25, 50, 60, 85, 110, 130, 150, 200, 220, 250]
jailbreak_success = [0, 0, 0, 1, 0, 1, 0, 1, 1, 1]
# Calculate Pearson correlation coefficient and p-value
corr_coeff, p_value = pearsonr(prompt_lengths, jailbreak_success)
print(f"Pearson Correlation Coefficient: {corr_coeff:.4f}")
print(f"P-value: {p_value:.4f}")
# Interpretation
if p_value < 0.05:
print("There is a significant correlation between prompt length and success.")
else:
print("No significant correlation was found.")
Remember, correlation does not imply causation. This analysis can highlight areas for deeper investigation but cannot prove that longer prompts *cause* jailbreaks.
Quantifying Uncertainty: Confidence Intervals
A single metric, like an average success rate of 7%, is a point estimate. It’s your best guess from the sample data, but the true value for the entire population of possible inputs is likely different. A confidence interval (CI) provides a range of plausible values for that true metric, giving your stakeholders a clear picture of the uncertainty in your findings.
Bootstrapping for Confidence Intervals
Bootstrapping is a resampling method that is incredibly useful when the underlying distribution of your metric is unknown. It works by repeatedly sampling with replacement from your observed data to create many simulated datasets, then calculating your metric of interest for each one. The distribution of these simulated metrics gives you the confidence interval.
Code Example: CI for Mean Attack Success Rate
import numpy as np
from scipy.stats import bootstrap
# Data: 1 indicates a successful attack, 0 a failure.
# We observed 15 successes out of 200 attempts.
attack_results = np.array([1]*15 + [0]*185)
# Generate a 95% confidence interval for the mean success rate
# We need to pass data as a tuple for bootstrap function's `data` param
res = bootstrap((attack_results,), np.mean, confidence_level=0.95)
ci_low, ci_high = res.confidence_interval
point_estimate = np.mean(attack_results)
print(f"Point Estimate of Success Rate: {point_estimate:.3f}")
print(f"95% CI for Success Rate: ({ci_low:.3f}, {ci_high:.3f})")
Reporting “the attack success rate is 7.5% with a 95% confidence interval of [4.0%, 11.5%]” is far more powerful and credible than just stating the 7.5% figure alone.
Key Takeaway: Statistical evaluation transforms your red teaming results from a list of anecdotes into defensible, evidence-based findings. Using these tools allows you to quantify uncertainty, verify that observed differences are meaningful, and communicate the reliability of your conclusions with precision.