2.2.3. Adversarial Examples

2025.10.06.
AI Security Blog

Imagine a facial recognition system granting access to a secure facility. You, as a red teamer, approach the camera not with a mask, but wearing a pair of uniquely patterned glasses. To the human guard, they look like a fashion statement. To the AI, you look like a completely different, authorized person. The door unlocks. This isn’t science fiction; it’s the tangible result of an adversarial example.

Kapcsolati űrlap - EN

Do you have a question about AI Security? Reach out to us here:

An adversarial example is a carefully modified input, created with malicious intent, to cause a machine learning model to make a mistake. The key word here is carefully. These are not random glitches or noise. They are precisely engineered inputs that exploit the model’s internal logic, often in ways that are completely imperceptible to humans.

For a red teamer, understanding adversarial examples is non-negotiable. They represent a fundamental attack surface in most modern AI systems, turning a model’s high-tech perception into a critical, exploitable vulnerability.

The Mechanics: Pushing Inputs Across the Decision Boundary

At its core, a classification model learns to draw lines—or more complex surfaces called “decision boundaries”—in a high-dimensional space to separate different categories of data. For example, all the data points that represent a “cat” are on one side of a boundary, and all the “dog” points are on the other.

An adversarial attack works by taking a legitimate input (like a cat photo) and adding a tiny, calculated perturbation. This small nudge is just enough to push the data point across the decision boundary, causing the model to misclassify it as a dog. The perturbation is so subtle that a human wouldn’t notice the change, but to the model, it’s a completely different object.

Decision Boundary

Class A (“Cat”) Class B (“Dog”)

Original Input

+ Adversarial Perturbation (ε)

Adversarial Example (Classified as “Dog”)

An adversarial perturbation nudges a data point from Class A across the decision boundary, causing a misclassification into Class B.

Attack Modalities: White-Box vs. Black-Box

As a tester, your approach will depend entirely on your level of knowledge and access to the target model. This leads to two primary attack scenarios.

White-Box Attacks

This is the best-case scenario for an attacker. You have complete knowledge of the model: its architecture, its parameters (weights and biases), and the training data it used. It’s like having the blueprints, the security codes, and the guard schedule for a building you want to infiltrate.

With this information, you can use the model’s own internal logic against it. A common technique is the Fast Gradient Sign Method (FGSM). It calculates the gradient of the model’s loss function with respect to the input image. This gradient tells you which direction to change the input’s pixels to cause the maximum increase in loss, effectively pushing it toward a wrong classification as efficiently as possible.

# Pseudocode for a simple FGSM attack
# This demonstrates the core logic, not a production implementation

function fgsm_attack(image, epsilon, model):
    // Request the gradient of the loss with respect to the input image
    image.requires_gradient = True
    
    // Get the model's prediction and calculate the loss
    prediction = model.predict(image)
    loss = calculate_loss(prediction, correct_label)
    
    // Backpropagate to get the gradients
    model.zero_gradients()
    loss.backward()
    
    // Collect the sign of the gradient data
    gradient_sign = image.gradient.sign()
    
    // Create the perturbed image by adding a small step in the gradient's direction
    perturbed_image = image + epsilon * gradient_sign
    
    // Ensure the image pixel values are still in the valid range (e.g., 0-1)
    perturbed_image = clip(perturbed_image, 0, 1)
    
    return perturbed_image

Black-Box Attacks

This is the more common and challenging scenario for a red teamer. You have no knowledge of the model’s internals. You can only interact with it via its API: you send an input and receive an output. This is known as a “query-based” attack.

Black-box attacks are more difficult but far from impossible. Common strategies include:

  • Transfer Attacks: You train your own local “substitute” model. Then, you generate adversarial examples against your model and “transfer” them to the target. Surprisingly, an attack created for one model often works against another, even with a different architecture.
  • Query-Based Attacks: You systematically probe the target model with many inputs, observing the outputs to estimate the decision boundary or approximate the gradient. This is slower and requires many queries, which may be detected or rate-limited.

The Real World: Digital vs. Physical Attacks

The threat of adversarial examples moves from theoretical to critical when they escape the digital realm and enter the physical world. Your testing must account for both.

Attribute Digital Attacks Physical Attacks
Domain Manipulation of pixel values, audio waves, or text data directly in a file. Creating real-world objects or modifications (stickers, 3D prints, clothing).
Example Adding imperceptible noise to a PNG of a panda to make it classify as a gibbon. Placing specific black and white stickers on a stop sign to make a car’s AI see a “Speed Limit 45” sign.
Control Precise, pixel-perfect control over the perturbation. Less control. Must account for lighting, angles, distance, and camera sensor noise (“environmental noise”).
Red Team Goal Test the raw robustness of the model’s algorithm. Useful for white-box assessments. Demonstrate a tangible, real-world failure of the integrated system. Higher impact.

Key Takeaways for Red Teamers

  • Adversarial examples are engineered inputs, not random errors. They are designed to cause specific, predictable failures in ML models.
  • They exploit the model’s decision boundaries. A tiny, human-imperceptible change can be enough to tip an input into the wrong category.
  • Your attack method depends on your knowledge. White-box attacks are powerful but require internal access. Black-box attacks are more realistic and test the system as an external adversary would.
  • Physical attacks demonstrate the highest impact. While harder to craft, successfully fooling a model with a real-world object is a critical finding that stakeholders cannot ignore. Your role is to bridge this gap from digital vulnerability to physical consequence.