6.1.2. Implementing Attack Methods

2025.10.06.
AI Security Blog

With the Adversarial Robustness Toolbox (ART) installed and configured, your focus shifts from setup to execution. This is where you move from theory to practice, using ART’s powerful abstractions to craft and launch attacks against machine learning models. The core philosophy is simple: ART treats attacks as objects you can instantiate, configure, and apply to your target.

The ART Attack Workflow

Regardless of the specific attack, the process within ART follows a consistent and predictable pattern. This standardization is one of the toolbox’s greatest strengths, allowing you to rapidly swap out different attack methods for testing without rewriting your entire pipeline. The fundamental workflow involves three key stages:

Kapcsolati űrlap - EN

Do you have a question about AI Security? Reach out to us here:

  1. Wrap the Target Model: You must first abstract your native model (e.g., PyTorch, TensorFlow, Scikit-learn) into an ART-compatible `Classifier` object. This wrapper provides a standardized interface for attacks to interact with the model, querying its predictions and, for white-box attacks, its gradients.
  2. Instantiate the Attack: You select an attack from ART’s extensive library and create an instance of it. This involves passing the wrapped model to the attack’s constructor, along with any attack-specific hyperparameters (like perturbation size or iteration count).
  3. Generate Adversarial Examples: With the attack object configured, you call its `generate()` method, passing in the benign input data you wish to perturb. The method returns a new array containing the crafted adversarial examples.
ART Attack Generation Workflow Benign Input (e.g., x_test) ART Attack Object attack.generate(x=x_test) Wrapped Model Attack Algorithm (e.g., PGD) Adversarial Examples

Example 1: A Classic White-Box Attack (FGSM)

The Fast Gradient Sign Method (FGSM) is a foundational white-box attack. It operates on a simple, powerful principle: make a single, large step in the direction that maximizes the loss. This direction is determined by the sign of the gradient of the loss function with respect to the input image. Because it requires gradient information, it’s a “white-box” attack—you need full access to the model’s internals.

First, let’s assume you have a trained PyTorch model. Your initial step is to wrap it in an ART `PyTorchClassifier`.

# 1. Imports and model definition (assumed)
import torch.nn as nn
import torch.optim as optim
from art.estimators.classification import PyTorchClassifier

# Assume 'model' is your pre-trained nn.Module instance
# Assume 'loss_fn' is your loss function, e.g., nn.CrossEntropyLoss()
# Assume 'optimizer' is your optimizer, e.g., optim.Adam(...)

# 2. Wrap the model for ART
classifier = PyTorchClassifier(
    model=model,
    loss=loss_fn,
    optimizer=optimizer,
    input_shape=(1, 28, 28), # Example for MNIST
    nb_classes=10,
)

With the `classifier` object ready, instantiating and running the FGSM attack is straightforward. The most critical parameter is `eps` (epsilon), which controls the magnitude of the perturbation. A larger `eps` creates a more significant, and often more obvious, change to the input.

from art.attacks.evasion import FastGradientMethod

# 1. Instantiate the FGSM attack
# Epsilon (eps) controls the perturbation size.
attack_fgsm = FastGradientMethod(estimator=classifier, eps=0.2)

# 2. Load your benign test data (e.g., x_test, y_test)
# x_test should be a numpy array

# 3. Generate adversarial examples
x_test_adversarial = attack_fgsm.generate(x=x_test)

# x_test_adversarial now contains the perturbed images
# You can now evaluate the model's performance on this data.

Example 2: An Iterative White-Box Attack (PGD)

Projected Gradient Descent (PGD) is essentially an iterative, more powerful version of FGSM. Instead of taking one large step, PGD takes multiple small steps, projecting the result back into an `eps`-ball around the original input after each step. This process makes it a much stronger attack and a common standard for evaluating model robustness.

The implementation is very similar to FGSM, but with additional parameters to control the iterative process, such as `eps_step` (the size of each individual step) and `max_iter` (the number of steps to take).

from art.attacks.evasion import ProjectedGradientDescent

# 1. Instantiate the PGD attack
# We use the same wrapped classifier from the FGSM example.
attack_pgd = ProjectedGradientDescent(
    estimator=classifier,
    eps=0.2,          # Max perturbation (L-infinity norm)
    eps_step=0.01,    # Step size for each iteration
    max_iter=40,      # Number of iterations
    targeted=False    # Untargeted attack: cause any misclassification
)

# 2. Generate adversarial examples from the same benign data
x_test_pgd_adv = attack_pgd.generate(x=x_test)

# The resulting examples are generally more potent than those from FGSM.

Example 3: A Practical Black-Box Attack (HopSkipJump)

What if you don’t have access to the model’s gradients? This is the scenario for black-box attacks. The HopSkipJump attack is a powerful, query-efficient method that only requires access to the model’s final predictions (the output labels), not its internal state or confidence scores.

It works by performing a binary search to find the decision boundary between a correctly classified input and a target misclassification. Because it doesn’t need gradients, you wrap your model in a `BlackBoxClassifier` which only exposes the `predict` function.

from art.attacks.evasion import HopSkipJump
from art.estimators.classification.blackbox import BlackBoxClassifier
import numpy as np

# 1. Define a prediction function for the black-box wrapper
# This function takes a numpy array and returns model predictions.
def predict_fn(x):
    # Pre-process, convert to tensor, get model output, etc.
    # The final output must be a numpy array of probabilities.
    # (Implementation depends on your specific model framework)
    return model.predict_proba(x) 

# 2. Create the BlackBoxClassifier
bb_classifier = BlackBoxClassifier(
    predict_fn, 
    input_shape=(1, 28, 28), 
    nb_classes=10
)

# 3. Instantiate and run the HopSkipJump attack
attack_hsj = HopSkipJump(classifier=bb_classifier, max_iter=20, max_eval=1000)
x_test_hsj_adv = attack_hsj.generate(x=x_test)

Note that black-box attacks are typically much slower than white-box ones because they must repeatedly query the model to infer information about the decision boundary.

Comparing Attack Approaches

Choosing the right attack depends on your threat model and objectives. A red teamer might start with a powerful white-box attack like PGD to establish a baseline for a model’s theoretical vulnerability, then move to a more realistic black-box attack like HopSkipJump to simulate an external attacker’s perspective.

Comparison of Implemented ART Attacks
Attack Method Attack Type Required Knowledge Key Parameters Typical Use Case
FGSM (Fast Gradient Sign Method) White-Box, Evasion Full model access (gradients) eps (perturbation magnitude) Fast, initial robustness check. Baseline for adversarial training.
PGD (Projected Gradient Descent) White-Box, Evasion Full model access (gradients) eps, eps_step, max_iter Strong benchmark for robustness evaluation. Generating potent examples.
HopSkipJump Black-Box, Evasion Output labels only (decision-based) max_iter, max_eval Simulating a realistic external attacker with limited model knowledge.

By mastering this wrap-instantiate-generate pattern, you can systematically test a model’s resilience against a wide spectrum of adversarial threats. The next step is to analyze the impact of these generated examples and explore how to build more robust defenses, which we will cover in subsequent sections.