26.2.1 Adversarial training code

2025.10.06.
AI Security Blog

Adversarial training is not merely a defense; it’s a paradigm shift in how models learn. Instead of training a model solely on clean, well-behaved data, you deliberately expose it to worst-case examples during its education. This forces the model to develop a more robust decision boundary, making it less susceptible to the subtle perturbations that define adversarial attacks. It’s the machine learning equivalent of vaccinating a system against known threats.

The Adversarial Training Loop

The core of adversarial training is a continuous cycle. You don’t just generate one batch of adversarial examples and train on them once. Instead, you integrate the attack generation process directly into the training loop. For each batch of training data, you create corresponding adversarial versions and train the model on a mix of both. This dynamic process prevents the model from simply memorizing the specific perturbations of a static adversarial dataset.

Kapcsolati űrlap - EN

Do you have a question about AI Security? Reach out to us here:

1. Get Batch of Clean Data 2. Generate Adversarial Examples 3. Train Model on Mixed Batch Repeat for next batch…

Implementation in Practice

To implement adversarial training, you need two key components: an attack function to generate adversarial examples and a modified training loop to incorporate them. Below are simplified PyTorch-style examples demonstrating these components.

Step 1: Generating Adversarial Examples (FGSM)

The Fast Gradient Sign Method (FGSM) is a classic and computationally efficient way to generate adversarial examples. It calculates the gradient of the loss with respect to the input data and then adds a small, signed perturbation in the direction that maximizes the loss.


# PyTorch-style FGSM attack function
import torch
import torch.nn.functional as F

def fgsm_attack(model, loss_fn, images, labels, epsilon):
    # Request gradient w.r.t. the input images
    images.requires_grad = True

    # Forward pass to get predictions
    outputs = model(images)
    model.zero_grad() # Clear old gradients

    # Calculate loss and gradients
    loss = loss_fn(outputs, labels)
    loss.backward()

    # Collect the gradient and create the perturbation
    grad_sign = images.grad.data.sign()
    perturbed_images = images + epsilon * grad_sign

    # Clip the images to maintain valid range [0, 1]
    perturbed_images = torch.clamp(perturbed_images, 0, 1)
    
    return perturbed_images

Step 2: The Adversarially-Trained Loop

The training loop is where the magic happens. Instead of just feeding the original data to the model, you first use the attack function to craft adversarial versions. Then, you can train on the adversarial examples alone or a combination of clean and adversarial data.


# Simplified training loop incorporating the FGSM attack
def train_adversarially(model, dataloader, optimizer, loss_fn, epsilon):
    model.train() # Set model to training mode

    for (clean_images, labels) in dataloader:
        # 1. Generate adversarial examples from the clean batch
        adv_images = fgsm_attack(model, loss_fn, clean_images, labels, epsilon)

        # 2. Clear gradients before the training step
        optimizer.zero_grad()

        # 3. Forward pass with the adversarial images
        adv_outputs = model(adv_images)
        loss = loss_fn(adv_outputs, labels)

        # 4. Backward pass and optimizer step
        loss.backward()
        optimizer.step()
    
    print("Completed one epoch of adversarial training.")

In this example, we train exclusively on adversarial data for simplicity. A common variation is to train on a combined batch of `clean_images` and `adv_images` to prevent the model from forgetting how to classify non-adversarial inputs, a phenomenon known as “catastrophic forgetting.”

Strategic Considerations and Trade-offs

As a red teamer, understanding the costs and limitations of adversarial training is as important as knowing how it works. This defense is not a silver bullet and introduces its own set of challenges that you can potentially exploit.

Advantage / Pro Disadvantage / Con
Increased Robustness: Directly improves model resilience against the specific attack type used for training (e.g., L-infinity norm attacks like FGSM). Decreased Standard Accuracy: Often, robust models perform slightly worse on clean, non-adversarial data. This is the classic accuracy-robustness trade-off.
Generalization to Similar Attacks: Training against one gradient-based attack (like PGD) can confer resistance to others (like FGSM or BIM). Attack Specificity: The model can overfit to the training attack, remaining vulnerable to different attack types (e.g., L0 attacks, patches, or black-box methods).
Smoother Decision Boundaries: Forces the model to learn more meaningful features, leading to less noisy and more interpretable gradients. High Computational Cost: Generating adversarial examples for every batch significantly increases training time and resource requirements.

Your red teaming objective might shift from simply fooling the model to testing the limits of its robustness. Can you find an attack vector the defenders didn’t train against? Can you craft an attack that bypasses the defense by exploiting the accuracy-robustness trade-off? Knowing how the defense is built is the first step to figuring out how to tear it down.