26.3.4 Comparative analysis code

2025.10.06.
AI Security Blog

After running robustness tests and benchmarks, you are left with raw data. This data is only valuable once it’s processed into comparative insights. This section provides practical Python scripts to compare the performance, robustness, and behavior of different models or model versions (e.g., a baseline vs. a fine-tuned or hardened model).

Comparing Core Performance Metrics

The most fundamental comparison involves looking at standard evaluation metrics before and after an attack, or between two different models under the same conditions. A simple script can load results from CSV files and present a summary table, making performance degradation immediately obvious.

Kapcsolati űrlap - EN

Do you have a question about AI Security? Reach out to us here:

Assume you have two result files, baseline_results.csv and hardened_results.csv, with the following structure:

attack_type accuracy attack_success_rate
FGSM 0.34 0.65
PGD 0.21 0.78

The following Python script uses the pandas library to load and compare these files.

import pandas as pd

def compare_model_metrics(baseline_csv, hardened_csv):
    # Load the datasets from CSV files
    baseline_df = pd.read_csv(baseline_csv)
    hardened_df = pd.read_csv(hardened_csv)

    # Merge dataframes on the 'attack_type' column for direct comparison
    comparison_df = pd.merge(
        baseline_df,
        hardened_df,
        on='attack_type',
        suffixes=('_baseline', '_hardened')
    )

    # Calculate the change in accuracy
    comparison_df['accuracy_change'] = (
        comparison_df['accuracy_hardened'] - comparison_df['accuracy_baseline']
    )
    
    # Calculate the change in attack success rate
    comparison_df['attack_success_change'] = (
        comparison_df['attack_success_rate_hardened'] - comparison_df['attack_success_rate_baseline']
    )

    return comparison_df

# Example usage
comparison_results = compare_model_metrics('baseline_results.csv', 'hardened_results.csv')
print(comparison_results)

Visualizing Comparative Performance

Numerical tables are useful, but visualizations often communicate results more effectively to a wider audience. A bar chart is excellent for comparing a single metric, like attack success rate, across multiple models and attack types.

Using the merged dataframe from the previous example, you can use matplotlib and seaborn to create a grouped bar chart.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

def plot_attack_comparison(comparison_df):
    # Melt the dataframe to make it suitable for seaborn's grouped bar plot
    df_melted = comparison_df.melt(
        id_vars='attack_type',
        value_vars=['attack_success_rate_baseline', 'attack_success_rate_hardened'],
        var_name='model_version',
        value_name='success_rate'
    )
    
    # Create the plot
    plt.figure(figsize=(10, 6))
    sns.barplot(data=df_melted, x='attack_type', y='success_rate', hue='model_version')
    
    plt.title('Attack Success Rate: Baseline vs. Hardened Model')
    plt.ylabel('Success Rate')
    plt.xlabel('Attack Type')
    plt.ylim(0, 1) # Rates are between 0 and 1
    plt.legend(title='Model Version')
    plt.tight_layout()
    plt.savefig('attack_comparison.png')
    plt.show()

# Assuming 'comparison_results' is the dataframe from the previous script
plot_attack_comparison(comparison_results)

This code would generate a chart similar to the one below, clearly showing the reduction in attack success rate for the hardened model.

Attack Success Rate: Baseline vs. Hardened 0.0 0.5 1.0 Success Rate FGSM PGD Baseline Hardened

Checking for Statistical Significance

Observing a difference is one thing; determining if that difference is statistically significant is another. This is crucial for making confident claims about a model’s improvement. For instance, if you have lists of scores (e.g., individual success/failure on test cases), you can use a statistical test like the independent t-test to check if the means of the two groups are significantly different.

The scipy library provides tools for this.

from scipy import stats

def check_significance(baseline_scores, hardened_scores):
    # Assuming scores are lists of 0s (failure) and 1s (success)
    # for a particular attack across many trials.
    
    # Perform an independent t-test
    t_stat, p_value = stats.ttest_ind(baseline_scores, hardened_scores, equal_var=False)

    print(f"T-statistic: {t_stat:.4f}")
    print(f"P-value: {p_value:.4f}")

    # Interpret the p-value
    alpha = 0.05 # Significance level
    if p_value < alpha:
        print("The difference is statistically significant.")
    else:
        print("The difference is not statistically significant.")

# Example data: Attack success outcomes for 100 trials
baseline_outcomes = [1] * 65 + [0] * 35 # 65% success
hardened_outcomes = [1] * 30 + [0] * 70 # 30% success

check_significance(baseline_outcomes, hardened_outcomes)

A low p-value (typically < 0.05) gives you confidence that the observed improvement in the hardened model is not due to random chance.

Comparing Output Distributions for Generative Models

For generative models (like LLMs), comparing single metrics is often insufficient. You need to understand how the entire distribution of outputs has changed. For example, after applying a defense against toxicity, you want to see a shift in the distribution of toxicity scores for the model’s generations.

A Kernel Density Estimate (KDE) plot is perfect for visualizing and comparing these distributions.

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

def plot_distribution_comparison(baseline_scores_csv, hardened_scores_csv, metric='toxicity'):
    # Load datasets containing a column with the metric to compare
    baseline_df = pd.read_csv(baseline_scores_csv)
    hardened_df = pd.read_csv(hardened_scores_csv)

    plt.figure(figsize=(10, 6))
    
    # Plot KDE for both models on the same axes
    sns.kdeplot(baseline_df[metric], label='Baseline Model', fill=True)
    sns.kdeplot(hardened_df[metric], label='Hardened Model', fill=True)
    
    plt.title(f'Distribution of {metric.capitalize()} Scores')
    plt.xlabel(f'{metric.capitalize()} Score')
    plt.ylabel('Density')
    plt.legend()
    plt.tight_layout()
    plt.savefig('distribution_comparison.png')
    plt.show()

# Example usage
plot_distribution_comparison('baseline_llm_toxicity.csv', 'hardened_llm_toxicity.csv')

This script would produce a plot showing two overlapping curves. An effective defense would show the “Hardened Model” curve shifted towards lower toxicity scores compared to the “Baseline Model” curve. This provides a much richer view of the defense’s impact than a simple average score could.