This appendix provides practical, condensed Python implementations for three foundational gradient-based adversarial attacks: the Fast Gradient Sign Method (FGSM), Projected Gradient Descent (PGD), and the Carlini & Wagner (C&W) L2 attack. These methods form the bedrock of evasion attacks against neural networks and are essential tools for any AI Red Teamer evaluating model robustness.
The examples use PyTorch and assume you have a pre-trained model (model), an input tensor (image), and the correct label (label). The focus is purely on the attack logic.
1. Fast Gradient Sign Method (FGSM)
Concept
FGSM is a “one-shot” attack that makes a single, decisive step to maximize the model’s loss. It computes the gradient of the loss function with respect to the input data and then perturbs the input in the direction of the sign of that gradient. This direction is the most efficient way to increase the loss, pushing the model towards a misclassification with minimal computational effort.
Formula:
x_adv = x + ε * sign(∇_x J(θ, x, y))
Python Implementation (PyTorch)
import torch
import torch.nn.functional as F
def fgsm_attack(model, image, label, epsilon):
# Request gradient computation for the input image
image.requires_grad = True
# Forward pass to get model output (logits)
output = model(image)
# Calculate the loss
loss = F.nll_loss(output, label)
# Zero all existing gradients
model.zero_grad()
# Backward pass to compute gradient of loss w.r.t. image
loss.backward()
# Collect the gradient data
gradient = image.grad.data
# Get the sign of the gradient
sign_gradient = gradient.sign()
# Create the perturbed image by adjusting each pixel
perturbed_image = image + epsilon * sign_gradient
# Clip the image to maintain valid range [0,1]
perturbed_image = torch.clamp(perturbed_image, 0, 1)
return perturbed_image
Parameters and Usage
model: The target PyTorch model.image: The input tensor to be perturbed.label: The ground-truth label for the image.epsilon(ε): A small scalar value (e.g., 0.03, 8/255) that controls the magnitude of the perturbation. Higher values create more noticeable changes but are more likely to fool the model.
2. Projected Gradient Descent (PGD)
Concept
PGD is an iterative, more powerful extension of FGSM. Instead of one large step, PGD takes multiple smaller steps. After each step, it “projects” the perturbed image back into an ε-radius “ball” around the original image. This constraint prevents the perturbation from becoming too large while allowing the attack to find more subtle and effective adversarial examples. It is widely considered a benchmark for evaluating model robustness.
Python Implementation (PyTorch)
import torch
import torch.nn.functional as F
def pgd_attack(model, image, label, epsilon, alpha, num_iter):
# Start with the original image
perturbed_image = image.clone().detach()
for i in range(num_iter):
perturbed_image.requires_grad = True
output = model(perturbed_image)
loss = F.nll_loss(output, label)
model.zero_grad()
loss.backward()
with torch.no_grad():
# Take a small step in the gradient direction
data_grad = perturbed_image.grad.data
perturbed_image = perturbed_image + alpha * data_grad.sign()
# Project back into the epsilon-ball
eta = torch.clamp(perturbed_image - image, -epsilon, epsilon)
perturbed_image = torch.clamp(image + eta, 0, 1)
return perturbed_image
Parameters and Usage
epsilon(ε): The maximum total perturbation allowed (radius of the L-infinity ball).alpha(α): The step size for each iteration. Typically smaller than epsilon (e.g.,epsilon / 10).num_iter: The number of iterations to run (e.g., 10, 40). More iterations lead to a stronger attack but increase computation time.
3. Carlini & Wagner (C&W) L2 Attack
Concept
The C&W attack is an optimization-based method that aims to find the minimum possible perturbation (measured by L2 distance) that causes a misclassification. Instead of maximizing a loss function, it minimizes a custom objective that balances two goals: making the perturbed image look like the original (small L2 distance) and ensuring the model misclassifies it with high confidence.
This attack is significantly more complex and computationally intensive but is highly effective and produces perturbations that are often imperceptible to the human eye.
Python Implementation (PyTorch Conceptual)
A full C&W implementation is extensive. The code below illustrates the core optimization loop and custom loss function concept.
import torch
import torch.optim as optim
def cw_l2_attack(model, image, label, c, num_iter, lr):
# Inverse tanh transform to allow optimization in an unbounded space
w = torch.atanh(image * 1.999999).clone().detach()
w.requires_grad = True
optimizer = optim.Adam([w], lr=lr)
for step in range(num_iter):
# Project w back to valid image space [0,1]
perturbed_image = 0.5 * (torch.tanh(w) + 1)
output = model(perturbed_image)
# C&W loss function: we want the logit of the correct class
# to be lower than the max of other logits.
correct_logit = output[0, label]
other_logits = torch.cat((output[0, :label], output[0, label+1:]))
max_other_logit = torch.max(other_logits)
# The attack is successful if this is negative
f_loss = torch.clamp(correct_logit - max_other_logit, min=0)
l2_dist = torch.sum((perturbed_image - image)**2)
# The full objective function to minimize
loss = l2_dist + c * f_loss
optimizer.zero_grad()
loss.backward()
optimizer.step()
return 0.5 * (torch.tanh(w) + 1).detach()
Parameters and Usage
c: A confidence constant. A binary search is often used to find the smallestcthat produces a successful attack. Higher values prioritize misclassification over perturbation size.num_iter: Number of optimization steps (e.g., 1000).lr: Learning rate for the optimizer (e.g., 0.01).
Attack Comparison Summary
Choosing the right attack depends on your red teaming objective. Are you looking for a quick baseline (FGSM), a strong benchmark for robustness (PGD), or the absolute minimum perturbation required to fool the model (C&W)?
| Characteristic | FGSM | PGD | C&W |
|---|---|---|---|
| Speed | Very Fast (1 step) | Moderate (Iterative) | Very Slow (Optimization-based) |
| Strength | Low | High (Benchmark standard) | Very High (State-of-the-art) |
| Perturbation Perceptibility | High (often noisy) | Moderate | Very Low (optimized for minimal change) |
| Implementation Complexity | Low | Low-Moderate | High |