6.1.3 Using defense mechanisms

2025.10.06.
AI Security Blog

After successfully crafting adversarial examples, the logical next step in a red team engagement is to evaluate and implement countermeasures. An attack is only half the story; understanding how to harden the target system provides actionable intelligence. The Adversarial Robustness Toolbox (ART) is not just an offensive toolkit—it offers a comprehensive suite of defenses designed to mitigate the very threats it helps create.

Moving from attack to defense shifts your perspective from exploitation to fortification. Instead of finding blind spots, you are now tasked with illuminating them. ART standardizes this process, allowing you to apply various defensive techniques to your target model with the same underlying framework you used for the attacks. This ensures consistency and allows for direct comparison of a model’s performance before and after hardening.

Kapcsolati űrlap - EN

Do you have a question about AI Security? Reach out to us here:

A Taxonomy of Defenses in ART

Defenses are not a monolithic concept. They can be applied at different stages of the machine learning pipeline. ART organizes its defensive modules into several logical categories, each targeting a different vulnerability point. Understanding these categories helps you select the appropriate strategy for a given threat model.

Defense Category Description & Objective Examples in ART
Preprocessing Defenses Modify input data before it reaches the model. The goal is to disrupt or remove adversarial perturbations without significantly altering the original data’s semantics. FeatureSqueezing, SpatialSmoothing, JpegCompression
Training-Time Defenses Incorporate robustness during the model’s training phase. This is a proactive approach to build a fundamentally stronger model. AdversarialTrainer (using attacks like PGD or FGSM)
Postprocessing Defenses Modify the model’s output (e.g., logits or labels). These defenses aim to correct or reverse the effect of a successful adversarial perturbation. ReverseSigmoid, Rounded
Adversarial Detection Classify an input as either benign or adversarial. These are not meant to correct the input but to flag it for further action, like rejection or human review. AdversarialPatchDetector, LID (Local Intrinsic Dimensionality)

Implementing a Preprocessing Defense

Preprocessing defenses are often the simplest to implement. They act as a sanitizing layer that “wraps” the input before it is fed to the classifier. Let’s take SpatialSmoothing, a defense that applies a smoothing filter to an image, which can effectively blur out fine-grained adversarial noise.

To use it, you instantiate the defense and then apply it to your input data. This happens *before* you call the model’s prediction function.

# 1. Import necessary components
from art.defences.preprocessor import SpatialSmoothing
import numpy as np

# 2. Assume you have your input images (e.g., x_test)
# and your adversarial examples (e.g., x_test_adv)

# 3. Initialize the defense
# window_size determines the size of the smoothing filter
spatial_smoothing = SpatialSmoothing(window_size=3)

# 4. Apply the defense to adversarial inputs
# The first returned value is the defended data
x_test_adv_defended, _ = spatial_smoothing(x_test_adv)

# 5. Now, you can pass x_test_adv_defended to your model for prediction
# predictions = classifier.predict(x_test_adv_defended)

This approach is non-invasive to the model itself. You don’t need to retrain or modify the model’s architecture, making it a quick and effective first line of defense to test.

Hardening the Core: Adversarial Training

While preprocessing is useful, a more fundamental approach is to make the model inherently robust. Adversarial training is the canonical method for this. The strategy is simple in concept: you generate adversarial examples during the training loop and teach the model to classify them correctly. You are essentially expanding the training set to include “hard” examples that expose the model’s vulnerabilities.

ART provides the AdversarialTrainer class, which automates this process. You provide it with your model, an attack instance, and it handles the rest.

# 1. Import the trainer and an attack
from art.attacks.evasion import ProjectedGradientDescent
from art.defences.trainer import AdversarialTrainer

# 2. Assume 'classifier' is your ART-wrapped Keras/PyTorch model
# and you have training data (x_train, y_train)

# 3. Create an instance of the attack to use for training
pgd_attack = ProjectedGradientDescent(estimator=classifier, eps=0.3, max_iter=10)

# 4. Create the AdversarialTrainer instance
adv_trainer = AdversarialTrainer(classifier=classifier, attacks=pgd_attack, ratio=0.5)

# 5. Train the model
# ratio=0.5 means 50% of each batch will be adversarial examples
adv_trainer.fit(x_train, y_train, nb_epochs=20, batch_size=128)

# The 'classifier' object is now adversarially trained and more robust.

Adversarial training is computationally expensive, as it requires generating attacks for each training batch. However, it is one of the most effective known defenses against evasion attacks.

Defense in Depth: Layering and Detection

In traditional cybersecurity, relying on a single defense is a recipe for failure. The same principle applies to AI security. A robust system employs multiple, layered defenses. You might combine a preprocessing step with an adversarially trained model and an output detector.

Detectors are a crucial part of this layered strategy. Their job isn’t to fix the input but to raise an alarm. An adversarial detector analyzes an input and provides a binary judgment: benign or malicious. This can trigger a system response, such as rejecting the input or escalating it for human analysis.

Input Preprocessor (e.g., Smoothing) Detector (e.g., LID) Robust Model (Adv. Trained) Output Alert/Reject

Figure 6.1.3.1 – A layered defense pipeline. An input is first sanitized, then checked for adversarial properties before being passed to a hardened model.

Implementing a detector in ART follows a familiar pattern. You fit it on clean data to learn what “normal” looks like, and then use it to detect anomalies in new, potentially malicious data.

# 1. Import a detector
from art.defences.detector.evasion import LID

# 2. Initialize the detector with the classifier and parameters
# We need to provide the benign training data to learn its distribution
detector = LID(classifier=classifier, x_train=x_train, k=20, batch_size=128)

# 3. Use the detector on new data (e.g., adversarial examples)
# The 'detect' method returns a report and a boolean is_adversarial array
report, is_adversarial = detector.detect(x_test_adv)

# 4. Analyze the results
num_detected = np.sum(is_adversarial)
print(f"Detected {num_detected} out of {len(x_test_adv)} adversarial examples.")

By combining these techniques, you build a much more resilient system. The preprocessor weakens the attack, the detector flags the most obvious attempts, and the adversarially trained model handles anything that slips through. This defense-in-depth approach is central to moving from academic exercises to real-world AI system security. The next step, naturally, is to rigorously measure just how effective these defenses are.