26.1.1 FGSM, PGD, C&W implementations

2025.10.06.
AI Security Blog

This appendix provides practical, condensed Python implementations for three foundational gradient-based adversarial attacks: the Fast Gradient Sign Method (FGSM), Projected Gradient Descent (PGD), and the Carlini & Wagner (C&W) L2 attack. These methods form the bedrock of evasion attacks against neural networks and are essential tools for any AI Red Teamer evaluating model robustness.

The examples use PyTorch and assume you have a pre-trained model (model), an input tensor (image), and the correct label (label). The focus is purely on the attack logic.

Kapcsolati űrlap - EN

Do you have a question about AI Security? Reach out to us here:

1. Fast Gradient Sign Method (FGSM)

Concept

FGSM is a “one-shot” attack that makes a single, decisive step to maximize the model’s loss. It computes the gradient of the loss function with respect to the input data and then perturbs the input in the direction of the sign of that gradient. This direction is the most efficient way to increase the loss, pushing the model towards a misclassification with minimal computational effort.

Core Idea: Take one large step in the direction that most increases the model’s error.

Formula: x_adv = x + ε * sign(∇_x J(θ, x, y))

Python Implementation (PyTorch)


import torch
import torch.nn.functional as F

def fgsm_attack(model, image, label, epsilon):
    # Request gradient computation for the input image
    image.requires_grad = True

    # Forward pass to get model output (logits)
    output = model(image)

    # Calculate the loss
    loss = F.nll_loss(output, label)

    # Zero all existing gradients
    model.zero_grad()

    # Backward pass to compute gradient of loss w.r.t. image
    loss.backward()

    # Collect the gradient data
    gradient = image.grad.data

    # Get the sign of the gradient
    sign_gradient = gradient.sign()

    # Create the perturbed image by adjusting each pixel
    perturbed_image = image + epsilon * sign_gradient
    
    # Clip the image to maintain valid range [0,1]
    perturbed_image = torch.clamp(perturbed_image, 0, 1)

    return perturbed_image
            

Parameters and Usage

  • model: The target PyTorch model.
  • image: The input tensor to be perturbed.
  • label: The ground-truth label for the image.
  • epsilon (ε): A small scalar value (e.g., 0.03, 8/255) that controls the magnitude of the perturbation. Higher values create more noticeable changes but are more likely to fool the model.

2. Projected Gradient Descent (PGD)

Concept

PGD is an iterative, more powerful extension of FGSM. Instead of one large step, PGD takes multiple smaller steps. After each step, it “projects” the perturbed image back into an ε-radius “ball” around the original image. This constraint prevents the perturbation from becoming too large while allowing the attack to find more subtle and effective adversarial examples. It is widely considered a benchmark for evaluating model robustness.

PGD Attack Illustration ε-ball boundary x_orig Iterative steps Projection x_adv

Python Implementation (PyTorch)


import torch
import torch.nn.functional as F

def pgd_attack(model, image, label, epsilon, alpha, num_iter):
    # Start with the original image
    perturbed_image = image.clone().detach()

    for i in range(num_iter):
        perturbed_image.requires_grad = True
        output = model(perturbed_image)
        loss = F.nll_loss(output, label)

        model.zero_grad()
        loss.backward()

        with torch.no_grad():
            # Take a small step in the gradient direction
            data_grad = perturbed_image.grad.data
            perturbed_image = perturbed_image + alpha * data_grad.sign()

            # Project back into the epsilon-ball
            eta = torch.clamp(perturbed_image - image, -epsilon, epsilon)
            perturbed_image = torch.clamp(image + eta, 0, 1)
    
    return perturbed_image
            

Parameters and Usage

  • epsilon (ε): The maximum total perturbation allowed (radius of the L-infinity ball).
  • alpha (α): The step size for each iteration. Typically smaller than epsilon (e.g., epsilon / 10).
  • num_iter: The number of iterations to run (e.g., 10, 40). More iterations lead to a stronger attack but increase computation time.

3. Carlini & Wagner (C&W) L2 Attack

Concept

The C&W attack is an optimization-based method that aims to find the minimum possible perturbation (measured by L2 distance) that causes a misclassification. Instead of maximizing a loss function, it minimizes a custom objective that balances two goals: making the perturbed image look like the original (small L2 distance) and ensuring the model misclassifies it with high confidence.

This attack is significantly more complex and computationally intensive but is highly effective and produces perturbations that are often imperceptible to the human eye.

Python Implementation (PyTorch Conceptual)

A full C&W implementation is extensive. The code below illustrates the core optimization loop and custom loss function concept.


import torch
import torch.optim as optim

def cw_l2_attack(model, image, label, c, num_iter, lr):
    # Inverse tanh transform to allow optimization in an unbounded space
    w = torch.atanh(image * 1.999999).clone().detach()
    w.requires_grad = True

    optimizer = optim.Adam([w], lr=lr)
    
    for step in range(num_iter):
        # Project w back to valid image space [0,1]
        perturbed_image = 0.5 * (torch.tanh(w) + 1)
        
        output = model(perturbed_image)
        
        # C&W loss function: we want the logit of the correct class
        # to be lower than the max of other logits.
        correct_logit = output[0, label]
        other_logits = torch.cat((output[0, :label], output[0, label+1:]))
        max_other_logit = torch.max(other_logits)
        
        # The attack is successful if this is negative
        f_loss = torch.clamp(correct_logit - max_other_logit, min=0)
        
        l2_dist = torch.sum((perturbed_image - image)**2)
        
        # The full objective function to minimize
        loss = l2_dist + c * f_loss
        
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    return 0.5 * (torch.tanh(w) + 1).detach()
            

Parameters and Usage

  • c: A confidence constant. A binary search is often used to find the smallest c that produces a successful attack. Higher values prioritize misclassification over perturbation size.
  • num_iter: Number of optimization steps (e.g., 1000).
  • lr: Learning rate for the optimizer (e.g., 0.01).

Attack Comparison Summary

Choosing the right attack depends on your red teaming objective. Are you looking for a quick baseline (FGSM), a strong benchmark for robustness (PGD), or the absolute minimum perturbation required to fool the model (C&W)?

Characteristic FGSM PGD C&W
Speed Very Fast (1 step) Moderate (Iterative) Very Slow (Optimization-based)
Strength Low High (Benchmark standard) Very High (State-of-the-art)
Perturbation Perceptibility High (often noisy) Moderate Very Low (optimized for minimal change)
Implementation Complexity Low Low-Moderate High