After running robustness tests and benchmarks, you are left with raw data. This data is only valuable once it’s processed into comparative insights. This section provides practical Python scripts to compare the performance, robustness, and behavior of different models or model versions (e.g., a baseline vs. a fine-tuned or hardened model).
Comparing Core Performance Metrics
The most fundamental comparison involves looking at standard evaluation metrics before and after an attack, or between two different models under the same conditions. A simple script can load results from CSV files and present a summary table, making performance degradation immediately obvious.
Assume you have two result files, baseline_results.csv and hardened_results.csv, with the following structure:
| attack_type | accuracy | attack_success_rate |
|---|---|---|
| FGSM | 0.34 | 0.65 |
| PGD | 0.21 | 0.78 |
The following Python script uses the pandas library to load and compare these files.
import pandas as pd
def compare_model_metrics(baseline_csv, hardened_csv):
# Load the datasets from CSV files
baseline_df = pd.read_csv(baseline_csv)
hardened_df = pd.read_csv(hardened_csv)
# Merge dataframes on the 'attack_type' column for direct comparison
comparison_df = pd.merge(
baseline_df,
hardened_df,
on='attack_type',
suffixes=('_baseline', '_hardened')
)
# Calculate the change in accuracy
comparison_df['accuracy_change'] = (
comparison_df['accuracy_hardened'] - comparison_df['accuracy_baseline']
)
# Calculate the change in attack success rate
comparison_df['attack_success_change'] = (
comparison_df['attack_success_rate_hardened'] - comparison_df['attack_success_rate_baseline']
)
return comparison_df
# Example usage
comparison_results = compare_model_metrics('baseline_results.csv', 'hardened_results.csv')
print(comparison_results)
Visualizing Comparative Performance
Numerical tables are useful, but visualizations often communicate results more effectively to a wider audience. A bar chart is excellent for comparing a single metric, like attack success rate, across multiple models and attack types.
Using the merged dataframe from the previous example, you can use matplotlib and seaborn to create a grouped bar chart.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
def plot_attack_comparison(comparison_df):
# Melt the dataframe to make it suitable for seaborn's grouped bar plot
df_melted = comparison_df.melt(
id_vars='attack_type',
value_vars=['attack_success_rate_baseline', 'attack_success_rate_hardened'],
var_name='model_version',
value_name='success_rate'
)
# Create the plot
plt.figure(figsize=(10, 6))
sns.barplot(data=df_melted, x='attack_type', y='success_rate', hue='model_version')
plt.title('Attack Success Rate: Baseline vs. Hardened Model')
plt.ylabel('Success Rate')
plt.xlabel('Attack Type')
plt.ylim(0, 1) # Rates are between 0 and 1
plt.legend(title='Model Version')
plt.tight_layout()
plt.savefig('attack_comparison.png')
plt.show()
# Assuming 'comparison_results' is the dataframe from the previous script
plot_attack_comparison(comparison_results)
This code would generate a chart similar to the one below, clearly showing the reduction in attack success rate for the hardened model.
Checking for Statistical Significance
Observing a difference is one thing; determining if that difference is statistically significant is another. This is crucial for making confident claims about a model’s improvement. For instance, if you have lists of scores (e.g., individual success/failure on test cases), you can use a statistical test like the independent t-test to check if the means of the two groups are significantly different.
The scipy library provides tools for this.
from scipy import stats
def check_significance(baseline_scores, hardened_scores):
# Assuming scores are lists of 0s (failure) and 1s (success)
# for a particular attack across many trials.
# Perform an independent t-test
t_stat, p_value = stats.ttest_ind(baseline_scores, hardened_scores, equal_var=False)
print(f"T-statistic: {t_stat:.4f}")
print(f"P-value: {p_value:.4f}")
# Interpret the p-value
alpha = 0.05 # Significance level
if p_value < alpha:
print("The difference is statistically significant.")
else:
print("The difference is not statistically significant.")
# Example data: Attack success outcomes for 100 trials
baseline_outcomes = [1] * 65 + [0] * 35 # 65% success
hardened_outcomes = [1] * 30 + [0] * 70 # 30% success
check_significance(baseline_outcomes, hardened_outcomes)
A low p-value (typically < 0.05) gives you confidence that the observed improvement in the hardened model is not due to random chance.
Comparing Output Distributions for Generative Models
For generative models (like LLMs), comparing single metrics is often insufficient. You need to understand how the entire distribution of outputs has changed. For example, after applying a defense against toxicity, you want to see a shift in the distribution of toxicity scores for the model’s generations.
A Kernel Density Estimate (KDE) plot is perfect for visualizing and comparing these distributions.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
def plot_distribution_comparison(baseline_scores_csv, hardened_scores_csv, metric='toxicity'):
# Load datasets containing a column with the metric to compare
baseline_df = pd.read_csv(baseline_scores_csv)
hardened_df = pd.read_csv(hardened_scores_csv)
plt.figure(figsize=(10, 6))
# Plot KDE for both models on the same axes
sns.kdeplot(baseline_df[metric], label='Baseline Model', fill=True)
sns.kdeplot(hardened_df[metric], label='Hardened Model', fill=True)
plt.title(f'Distribution of {metric.capitalize()} Scores')
plt.xlabel(f'{metric.capitalize()} Score')
plt.ylabel('Density')
plt.legend()
plt.tight_layout()
plt.savefig('distribution_comparison.png')
plt.show()
# Example usage
plot_distribution_comparison('baseline_llm_toxicity.csv', 'hardened_llm_toxicity.csv')
This script would produce a plot showing two overlapping curves. An effective defense would show the “Hardened Model” curve shifted towards lower toxicity scores compared to the “Baseline Model” curve. This provides a much richer view of the defense’s impact than a simple average score could.