22.2.3 Code implementation

2025.10.06.
AI Security Blog

With the theoretical foundation reviewed and the environment prepared, you are now ready to translate the Fast Gradient Sign Method (FGSM) into functional code. This section breaks down the implementation into its core components, using PyTorch for demonstration. The goal is not just to copy-paste code, but to understand how each line contributes to the adversarial objective.

Component 1: The Core FGSM Function

At its heart, the FGSM attack is a single, concise function. Its purpose is to take an input image, its corresponding gradient, and an epsilon value to generate a perturbed image. Let’s construct this function step-by-step.

Kapcsolati űrlap - EN

Do you have a question about AI Security? Reach out to us here:

The function’s logic directly mirrors the FGSM formula:

x' = x + ε * sign(∇x J(θ, x, y))

Here is the implementation. Notice how each line corresponds to a part of the formula.


import torch

def fgsm_attack(image, epsilon, data_grad):
    # Collect the element-wise sign of the data gradient
    sign_data_grad = data_grad.sign()
    
    # Create the perturbed image by adjusting each pixel of the input image
    perturbed_image = image + epsilon * sign_data_grad
    
    # Add clipping to maintain the original data range (e.g., [0,1])
    # This ensures the perturbed image is still a valid image
    perturbed_image = torch.clamp(perturbed_image, 0, 1)
    
    # Return the perturbed image
    return perturbed_image

This function is the engine of our attack. It’s simple, efficient, and framework-agnostic in its logic, although the syntax here is specific to PyTorch Tensors.

Component 2: The Attack Orchestration Logic

The `fgsm_attack` function requires a critical input: the gradient of the loss with respect to the input image (`data_grad`). Generating this gradient requires orchestrating a forward and backward pass through the target model. The following pseudo-code outlines the complete process for a single image.

  1. Load Data and Model: Obtain the input image tensor and the true label. Load your pre-trained model and, critically, set it to evaluation mode using model.eval(). This disables layers like Dropout and adjusts Batch Normalization to use running statistics, which is essential for consistent predictions.
  2. Enable Gradient Calculation: The input image tensor itself needs to track gradients. You enable this with image.requires_grad = True.
  3. Forward Pass: Pass the image through the model to get the output logits.
  4. Calculate Loss: Compute the loss between the model’s output and the true label. The standard `CrossEntropyLoss` is typically used for classification tasks.
  5. Backward Pass: This is the key step. Call loss.backward(). PyTorch’s autograd engine will then compute the gradients of the loss with respect to all parameters with `requires_grad=True`, including our input image.
  6. Extract Image Gradient: The computed gradient is now stored in the .grad attribute of the image tensor. You can access it via image.grad.data.
  7. Generate Adversarial Example: Call the `fgsm_attack` function you created, passing the original image, your chosen epsilon, and the extracted gradient.

Let’s see how this orchestration looks in a simplified test function.


import torch.nn.functional as F

def test(model, device, test_loader, epsilon):
    # ... (accuracy counter initialization) ...

    # Loop over all examples in test set
    for data, target in test_loader:
        data, target = data.to(device), target.to(device)
        data.requires_grad = True # IMPORTANT: track grads on input

        # Forward pass
        output = model(data)
        
        # Calculate loss
        loss = F.nll_loss(output, target)

        # Zero all existing gradients
        model.zero_grad()

        # Backward pass to get gradient of loss w.r.t. input
        loss.backward()
        
        # Collect the gradient
        data_grad = data.grad.data

        # Call FGSM Attack
        perturbed_data = fgsm_attack(data, epsilon, data_grad)

        # ... (re-classify perturbed image and check for success) ...

Implementation Considerations and Pitfalls

While the code appears straightforward, several details are critical for a successful implementation.

Model State: `model.eval()`

Forgetting to set the model to evaluation mode is a common mistake. In training mode, layers behave differently (e.g., dropout randomly zeros activations), leading to non-deterministic and incorrect gradients for the purpose of an attack.

Data Normalization

Most pre-trained models expect input data to be normalized (e.g., subtracting the mean and dividing by the standard deviation of the training set). The attack should be performed on these normalized images. Similarly, the `torch.clamp` operation should be applied to the valid range of the normalized data, which may not be [0, 1]. If you wish to visualize the perturbation, you must reverse the normalization process.

Gradient Zeroing

In PyTorch, gradients accumulate by default. While not strictly necessary in this specific loop structure (as a new backward pass is computed for each new image), it is best practice to call model.zero_grad() before loss.backward() to prevent any potential carry-over from previous operations.

You now have the complete code and context to generate adversarial examples using FGSM. The next step is to execute this code against a dataset, systematically varying epsilon to observe its impact on the model’s performance.