22.5.1 Setting up adversarial training

2025.10.06.
AI Security Blog

Core Idea: Adversarial training is a proactive defense that hardens a model by exposing it to adversarial examples during the training process. Instead of just learning from clean data, the model also learns to correctly classify inputs that have been intentionally perturbed to cause misclassification. Think of it as an inoculation: you expose the model to a controlled version of the “threat” to build its immunity.

The Adversarial Training Loop

At its heart, adversarial training modifies the standard model training loop. Instead of simply feeding batches of clean data to the model, you augment each batch with adversarial counterparts. This forces the model’s optimization process to find parameters that are not only good at classifying clean data but are also resilient to the specific types of perturbations used to create the adversarial examples.

Kapcsolati űrlap - EN

Do you have a question about AI Security? Reach out to us here:

The process can be visualized as a continuous cycle where an “attacker” generates difficult examples and the “defender” (the model) learns from them.

1. Clean Data Batch (X, Y) 2. Adversary (e.g., PGD) Generates X’ Using current model state 3. Combine (X, X’), Y 4. Update Model Weights Feedback Loop

Implementation Steps

Implementing adversarial training requires integrating an attack algorithm directly into your training pipeline. Here is a breakdown of the necessary components and workflow.

1. Choose Your Adversary

The first step is to select an attack method for generating adversarial examples. The choice is a trade-off between computational cost and the strength of the resulting defense.

Attack Method Description Pros Cons
Fast Gradient Sign Method (FGSM) A fast, single-step attack that adds a small perturbation in the direction of the loss function’s gradient. Very fast; low computational overhead. Produces weaker adversaries; model may overfit to this specific attack.
Projected Gradient Descent (PGD) An iterative, stronger attack. It takes multiple small steps in the gradient direction, projecting the result back into an allowed perturbation space (e.g., an L-infinity ball) after each step. Creates strong adversaries; leads to more robust models. The de facto standard for robust training. Computationally expensive; significantly slows down training (e.g., 5-10x slower).

2. Modify the Training Loop

You need to alter your standard training code. For each batch of data, you will perform an inner loop or function call that generates the adversarial versions of the inputs before performing the main optimization step.

The following pseudocode illustrates this modification in a typical deep learning framework.

# model: your neural network
# optimizer: your optimization algorithm (e.g., Adam, SGD)
# dataloader: provides batches of (images, labels)
# adversary: an object that implements an attack (e.g., PGD)

for epoch in range(num_epochs):
    for images, labels in dataloader:
        // 1. Generate adversarial examples for the current batch
        // The adversary needs the model to calculate gradients
        adv_images = adversary.attack(model, images, labels)

        // 2. Combine original and adversarial data (optional but common)
        combined_images = torch.cat([images, adv_images], dim=0)
        combined_labels = torch.cat([labels, labels], dim=0)

        // 3. Perform standard training step on the augmented batch
        optimizer.zero_grad()
        outputs = model(combined_images)
        loss = criterion(outputs, combined_labels)
        loss.backward()
        optimizer.step()

3. Key Parameters to Tune

Effective adversarial training is not a “fire-and-forget” process. You must carefully configure the parameters of your chosen adversary.

  • Epsilon (ε): This is the most critical parameter. It defines the maximum magnitude of the perturbation allowed. For image data normalized to [0, 1], common values are 8/255 or 16/255. A larger epsilon creates stronger attacks but can also make training unstable or harm accuracy on clean data.
  • Step Size (α): For iterative attacks like PGD, this controls the size of each gradient step. It should be smaller than epsilon. A common rule of thumb is α = ε / (number_of_steps - 2).
  • Number of Iterations: For PGD, this determines how many steps the attack takes. More steps create stronger adversaries but increase computational cost. 7-10 steps is a common starting point.

Considerations and Trade-offs

While powerful, adversarial training introduces several important considerations:

  1. The Robustness-Accuracy Trade-off: Models trained adversarially often exhibit slightly lower accuracy on clean, unperturbed data. This phenomenon, sometimes called the “robustness tax,” occurs because the model dedicates some of its capacity to handling noisy, adversarial inputs, potentially at the expense of performance on the original data distribution.
  2. Computational Cost: Generating adversarial examples for every batch requires additional forward and backward passes through the model, significantly increasing training time and resource requirements. PGD-based training can be an order of magnitude slower than standard training.
  3. Overfitting to the Threat Model: A model trained exclusively against one type of attack (e.g., L-infinity PGD) may remain vulnerable to others (e.g., L2 attacks, rotation/translation attacks, etc.). This is why red teaming is crucial—to test for vulnerabilities beyond the specific threat model used in training.

Your goal as a red teamer or defender is to find the right balance: a model that is meaningfully more robust against a defined threat model without an unacceptable drop in its primary performance metrics.