The AI Red Team Toolkit: A Comparison of ART, CleverHans, and Foolbox

2025.10.17.
AI Security Blog

The AI Red Teamer’s Arsenal: A Bare-Knuckle Comparison of ART, CleverHans, and Foolbox

You’ve just deployed a shiny new image recognition model. It’s scoring 99.8% accuracy on your test set. The stakeholders are thrilled. The press release is drafted. You’re a hero.

Then, I come along. I take your “State-of-the-Art Dog Detector,” show it a picture of a cat that I’ve modified with a barely perceptible layer of digital noise, and your model confidently declares, “Toaster, 98% confidence.”

Kapcsolati űrlap - EN

Do you have a question about AI Security? Reach out to us here:

Your model isn’t just wrong. It’s catastrophically, absurdly wrong. And it didn’t even know it was confused.

Welcome to my world. The world of AI Red Teaming.

We’re not here to check for if/else bugs or null pointer exceptions. We’re here to exploit the very logic of machine learning. We’re cognitive locksmiths, hired to prove that the “mind” you’ve built can be tricked, manipulated, and turned against you. And today, I’m opening up my toolkit. We’re going to look at the three big names in open-source adversarial ML: CleverHans, Foolbox, and the Adversarial Robustness Toolbox (ART). Forget the documentation; this is the field guide.

So, What is AI Red Teaming, Really?

Let’s get one thing straight. This isn’t just “testing.” QA testing checks if your system meets its specifications. AI Red Teaming checks if your specifications make any sense in the face of a clever adversary.

Think of it like this: A normal QA engineer testing a bank vault will check if the door locks, if the hinges are strong, and if the timer works. An AI Red Teamer is the guy from Ocean’s Eleven. He doesn’t care about the hinges. He’s thinking about tricking the seismic sensors with a specific vibration, spoofing the thermal cameras, or socially engineering the guard. He attacks the assumptions the system is built on.

Your AI model assumes its input data will look, statistically, like its training data. It assumes the world is a clean, well-behaved place. My job is to introduce it to the chaotic, messy, and downright malicious real world.

Golden Nugget: AI Red Teaming isn’t about finding bugs in the code. It’s about finding bugs in the model’s perception of reality. We don’t break the software; we break the logic.

To do this at scale, you can’t be sitting in a dark room, hand-crafting pixel changes in Photoshop. You need power tools. You need frameworks that can weaponize the very mathematics the model uses to learn. That’s where our three contenders come in.

The Contenders: A Quick Flyover

Before we dive into the guts of these libraries, let’s get a feel for their personalities.

  1. CleverHans: The Wise Old Professor. It was one of the first major libraries on the scene, developed by researchers at Google and OpenAI. It’s named after “Clever Hans,” a horse that was thought to be able to do arithmetic but was actually just reading its trainer’s subtle cues. A perfect name. The library is fantastic for learning and understanding the fundamental attacks, but it’s showing its age. It feels… academic.
  2. Foolbox: The Specialist Speedrunner. This library does one thing, and it does it exceptionally well: generating adversarial examples to fool image, video, and audio models (evasion attacks). It has a beautifully clean API, it’s framework-agnostic, and it’s fast. If all you need is to quickly benchmark a model against a barrage of evasion techniques, Foolbox is your scalpel.
  3. Adversarial Robustness Toolbox (ART): The Special Forces Multi-Tool. Maintained by IBM, ART is the most comprehensive, enterprise-grade security library of the bunch. It doesn’t just do evasion attacks. It does data poisoning, model extraction, inference attacks, and even includes a whole suite of defenses. It’s not just a weapon; it’s an entire armory and a set of instructions for building fortifications.

Now, let’s get our hands dirty. The best way to understand these tools is to see how they handle the dirty work.

Deep Dive: The Attacks

We can group most attacks against AI into a few main families. We’ll look at the most critical ones and see how our toolkits stack up.

1. Evasion Attacks: The Art of Digital Camouflage

This is the classic stuff. The poster child for adversarial AI. You take a perfectly good input—like an image of a panda—and add a tiny, mathematically crafted layer of noise. The result is an image that still looks like a panda to any human, but the model now sees a gibbon with 99% certainty.

How is this possible? Your model isn’t seeing “pandas” and “gibbons.” It’s seeing a point in a high-dimensional space defined by thousands or millions of pixel values. The “panda” region and the “gibbon” region in this space are much closer than our human intuition would suggest. An evasion attack is about finding the shortest possible path from the point representing your input to a point across the decision boundary in enemy territory.

The most famous way to find this path is by using the model’s own learning mechanism—gradients—against it.

A gradient is just a vector that points in the direction of the “steepest ascent.” When a model is training, it calculates the gradient of its error (the “loss function”) and takes a small step in the opposite direction to get better. It’s literally walking downhill to find the valley of lowest error.

The Fast Gradient Sign Method (FGSM), the granddaddy of these attacks, does the opposite. It calculates the gradient and takes a small step uphill, in the direction that will most quickly increase the error and cause a misclassification.

Model’s Loss Landscape Input Image (X) Lowest Error (Training Goal) Training (Gradient Descent) Attack (FGSM) “Go uphill” Error / Loss Input Space

All three libraries implement FGSM and its more powerful iterative cousins like PGD (Projected Gradient Descent), but they do it with different philosophies.

  • CleverHans (v4): Being research-focused, its implementation is very explicit. You’ll often see the direct mathematical operations in the code. This is great for learning, but can be verbose for production testing. It has strong ties to TensorFlow and JAX.
    
    # CleverHans (Illustrative TensorFlow 2.x style)
    import tensorflow as tf
    from cleverhans.tf2.attacks.fast_gradient_method import fast_gradient_method
    
    # Assume 'model' and 'image_tensor' are defined
    logits = model(image_tensor)
    # Epsilon controls the "size" of the perturbation
    epsilon = 0.05
    adversarial_image = fast_gradient_method(model, image_tensor, epsilon, np.inf)
    
  • Foolbox: Its API is a thing of beauty. It abstracts away the framework specifics. You give it your model, your input, and your criteria for a successful attack (e.g., “misclassification”), and it finds an adversarial example. It’s designed to be plug-and-play.
    
    # Foolbox (Illustrative PyTorch style)
    import foolbox as fb
    import torch
    
    # Assume 'model' and 'images', 'labels' are defined
    # Wrap your model in a Foolbox model
    fmodel = fb.PyTorchModel(model, bounds=(0, 1))
    
    # Instantiate the attack
    attack = fb.attacks.FGSM()
    # Run the attack
    raw, clipped, is_adv = attack(fmodel, images, labels, epsilons=0.05)
    
  • ART: ART treats attacks as classes. You instantiate an attacker object by passing it the model, then you call a generate method on that object. This object-oriented approach is incredibly powerful for complex red teaming scenarios because you can configure the attacker once and then use it repeatedly. It also cleanly separates the attacker from the model.
    
    # ART (Illustrative PyTorch style)
    from art.estimators.classification import PyTorchClassifier
    from art.attacks.evasion import FastGradientMethod
    import torch.nn as nn
    import torch.optim as optim
    
    # Assume 'model', 'input_shape', 'nb_classes' are defined
    # ART requires a wrapper around your model
    classifier = PyTorchClassifier(
        model=model,
        loss=nn.CrossEntropyLoss(),
        optimizer=optim.Adam(model.parameters()),
        input_shape=input_shape,
        nb_classes=nb_classes,
    )
    
    # Instantiate the attack
    attack = FastGradientMethod(estimator=classifier, eps=0.05)
    # Generate adversarial examples
    adversarial_image = attack.generate(x=images)
    

You can see the philosophical differences right there in the code. CleverHans is functional. Foolbox is streamlined. ART is structured and object-oriented, built for bigger systems.

2. Data Poisoning Attacks: The Sleeper Agent

Now we move from the front lines to deep cover operations. A poisoning attack is far more insidious than an evasion attack. You don’t attack the deployed model; you attack the data it learns from.

Imagine you’re building a system to detect malicious code. I manage to sneak a few hundred “poisoned” samples into your multi-million-file training dataset. These samples are specially crafted. For example, they might be perfectly benign files, but each one contains a specific, meaningless comment string: // release the kraken. I label these benign files as “malicious.”

Your model trains. To minimize its error, it learns a spurious correlation: if it sees the string // release the kraken, the file must be malicious. It learns a backdoor.

Months later, your model is deployed, protecting millions of users. I write a completely new, benign piece of software, but I add that one comment string. Your model, dutifully following the flawed logic you taught it, flags it as dangerous, triggering alarms, blocking the software, and potentially causing chaos. I can now perform a targeted denial-of-service attack on any developer I want, just by adding a comment to their code.

Data Poisoning: Corrupting the Learning Process 1. Training Dataset Class A (Cat) Class B (Dog) Poisoned Sample (Looks like A, labeled as B) Trains 2. Resulting Model’s Decision Boundary Ideal Boundary Corrupted Boundary Poison Point Pulls boundary

This is where the difference between our toolkits becomes stark.

  • CleverHans & Foolbox: They don’t really do this. Their focus is almost entirely on evasion. You won’t find dedicated, high-level APIs for poisoning attacks in these libraries.
  • ART: This is ART’s home turf. It has a whole module, art.attacks.poisoning, with implemented attacks like “Poisoning Attack SVM” and, more importantly, a framework for creating your own. It understands that AI security is a lifecycle issue, and it starts with the data. ART provides tools to both craft poison and to detect it using defenses like “Activation Clustering.”

Golden Nugget: If your security concerns go beyond “can someone fool my model at inference time?” and include “can someone corrupt my model during training?”, you have already outgrown Foolbox and CleverHans. You need ART.

3. Model Extraction Attacks: The Digital Heist

Let’s say you’ve spent $10 million training a massive, proprietary language model. It’s the secret sauce behind your new product. You expose it via a paid API. I, as your competitor, want your model without spending the $10 million.

So, I become a customer. I send thousands of carefully selected queries to your API and observe the outputs (the probabilities or “logits” your model produces). By analyzing these query-response pairs, I can train a new model—a “knockoff” model—to mimic the behavior of your expensive, proprietary one. I am effectively “distilling” the knowledge from your model into my own.

Model Extraction: Stealing the “Secret Sauce” Attacker 👨‍💻 Victim Model (Black Box API) 🔒 Knockoff Model 🤖 Mimics Victim’s logic 1. Send Queries “Is this a cat?” 2. Receive Outputs {“cat”: 0.9, “dog”: 0.1} 3. Use (Query, Output) pairs to train the knockoff model

This isn’t science fiction. It’s a real, documented attack that can be devastatingly effective. The result is a direct loss of your intellectual property and competitive advantage.

Once again, the toolkit choice is critical:

  • CleverHans & Foolbox: Not in their job description. They are focused on fooling a model’s decision, not stealing the model itself.
  • ART: Of course ART has a module for this. art.attacks.extraction contains several implemented model extraction attacks, like “Copycat CNN” and “Knockoff Nets.” It provides a framework for systematically querying a black-box model and training a substitute. This is a tool for quantifying the business risk of IP theft via API exposure.

The No-BS Comparison Table

Let’s boil it all down. If you’re trying to decide which tool to pick up, here’s the cheat sheet.

Dimension CleverHans Foolbox Adversarial Robustness Toolbox (ART)
Primary Focus Academic research, benchmarking, and education on core attacks. Fast, reliable generation of evasion attacks (adversarial examples). End-to-end AI security lifecycle: attacks, defenses, and evaluations.
Supported Attacks Primarily Evasion. Limited or no native support for others. Exclusively Evasion. The best-in-class for this specific task. Evasion, Poisoning, Extraction, and Inference. The most comprehensive suite.
Supported Defenses Includes some core defenses like adversarial training, but not a primary feature. None. It’s a pure-play attack tool. Yes, a huge library of defenses from data preprocessing and adversarial training to runtime detection.
Framework Support Historically TensorFlow-centric. Newer versions lean towards JAX. Can be clunky with PyTorch. Excellent. Natively supports PyTorch, TensorFlow, JAX, and more with a single, clean API. Excellent. Has dedicated, robust wrappers for PyTorch, TensorFlow, Keras, scikit-learn, XGBoost, etc.
Ease of Use Moderate. The API can feel a bit dated and requires understanding the underlying math. Very Easy. The API is minimalist and intuitive. The fastest way to get started with evasion. Moderate to Hard. The object-oriented structure is powerful but requires more setup (wrapping models, etc.). The learning curve is steeper because the scope is so much larger.
Maintenance Sporadic. It was hugely influential, but active development has slowed. Can lag behind the newest frameworks. Actively maintained by a dedicated community. Stays up-to-date with ML frameworks. Very Active. Backed by IBM Research and a large community. A Linux Foundation AI & Data project. This is a serious, long-term project.
Best For… The Student/Researcher: Someone who wants to learn the fundamentals and reproduce academic papers. The Specialist/Penetration Tester: Someone who needs to quickly assess a model’s vulnerability to evasion attacks without a lot of overhead. The Enterprise/MLOps Engineer: Someone building a robust, repeatable AI security program that covers the entire model lifecycle.

Putting It All Together: A Mini-Walkthrough

Talk is cheap. Let’s see some code. We’ll take a pre-trained PyTorch model and use Foolbox and ART to generate an adversarial example. This will highlight the different user experiences.

Let’s set up a simple scenario. We have a ResNet-18 model trained on CIFAR-10, and we want to make it misclassify an image of an airplane.


# Common Setup (PyTorch)
import torch
import torchvision.models as models
import torchvision.transforms as transforms
from PIL import Image
import requests
import numpy as np

# 1. Load a pre-trained model
model = models.resnet18(pretrained=True)
model.eval()

# 2. Get an image and preprocess it
url = "https://media.defense.gov/2017/Mar/16/2001717272/-1/-1/0/170301-F-DR937-035.JPG"
image = Image.open(requests.get(url, stream=True).raw)
preprocess = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])
input_tensor = preprocess(image)
images = input_tensor.unsqueeze(0) # Create a mini-batch as expected by the model

# 3. Get the original prediction
with torch.no_grad():
    logits = model(images)
probabilities = torch.nn.functional.softmax(logits[0], dim=0)
# ImageNet labels, 0 is 'tench', 404 is 'airliner'
original_class_id = probabilities.argmax().item()
print(f"Original Prediction: Class ID {original_class_id}") # Should be an airplane-related class

Attack with Foolbox: The Express Lane

Foolbox is all about speed and simplicity.


import foolbox as fb

# Wrap the model
fmodel = fb.PyTorchModel(model, bounds=(images.min(), images.max()))

# We need the original label to guide the attack
labels = torch.tensor([original_class_id])

# Use a slightly more powerful attack than FGSM for a better chance of success
attack = fb.attacks.LinfPGD()
# Epsilon determines how much we're allowed to change the image
# A small value means the change will be less perceptible
epsilons = [0.01] 
raw_advs, clipped_advs, success = attack(fmodel, images, labels, epsilons=epsilons)

if success.item():
    # Check the new prediction
    with torch.no_grad():
        new_logits = model(clipped_advs)
    new_class_id = new_logits.argmax().item()
    print(f"Foolbox Attack Successful!")
    print(f"New Prediction: Class ID {new_class_id}")
else:
    print("Foolbox Attack Failed.")

Notice how clean that is? Wrap, instantiate, attack. Done.

Attack with ART: The Structured Approach

ART requires more setup, but this setup pays dividends in complex scenarios.


from art.estimators.classification import PyTorchClassifier
from art.attacks.evasion import ProjectedGradientDescent

# 1. ART requires a loss function and optimizer, even for inference
# This is because it needs to calculate gradients
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters())

# 2. Create the ART classifier wrapper
classifier = PyTorchClassifier(
    model=model,
    loss=criterion,
    optimizer=optimizer,
    input_shape=(3, 224, 224),
    nb_classes=1000,
    clip_values=(images.min().item(), images.max().item())
)

# 3. Instantiate and configure the attack
attack = ProjectedGradientDescent(estimator=classifier, eps=0.01, max_iter=20)

# 4. Generate the adversarial example
# Note: ART uses numpy arrays by default
images_np = images.numpy()
adversarial_images_np = attack.generate(x=images_np)

# 5. Check the new prediction
adversarial_images_torch = torch.from_numpy(adversarial_images_np)
with torch.no_grad():
    new_logits = model(adversarial_images_torch)
new_class_id = new_logits.argmax().item()

print(f"ART Attack Generated.")
print(f"New Prediction: Class ID {new_class_id}")

The ART code is more verbose. You have to create this PyTorchClassifier object that holds everything the library needs. But now, that classifier object is your gateway. You can pass it to poisoning attack objects, defense objects, and evaluation metrics without any extra setup. It’s the universal adapter for the ART ecosystem.

Conclusion: Which Wrench Do You Need?

So, after all this, which toolkit should you download? Stop looking for a “best” one. That’s the wrong question. The right question is: “What problem am I trying to solve today?”

  • Are you a student or a researcher trying to understand how a specific attack works, or a developer who just heard about adversarial examples and wants to see one in action? Start with CleverHans or Foolbox. Foolbox, in particular, will get you from zero to a successful attack in about 10 lines of code. It’s a fantastic teaching tool.
  • Are you a penetration tester or security researcher tasked with a one-off assessment of a computer vision model? Use Foolbox. Its speed and framework-agnostic nature are perfect for quick, focused engagements where the only goal is to prove a vulnerability exists.
  • Are you a DevOps/MLOps engineer, an IT manager, or part of a corporate security team responsible for the long-term robustness of your company’s AI systems? Your only serious choice is ART. The threats you face are not just evasion. They are poisoning, extraction, and things we haven’t even categorized yet. You need a framework that treats AI security as a holistic discipline, not a party trick. You need to build repeatable, automated testing pipelines that include both attacks and defenses. ART was built for this.

These libraries are not magic wands. They are force multipliers. They take the latent, mathematical fragility of our powerful models and make it tangible, testable, and undeniable.

The real weapon, in the end, is not the code. It’s the adversarial mindset. The instinct to ask, “How can this be abused?” The refusal to trust a 99.8% accuracy score. The professional paranoia to assume that if a system can be broken, someone out there is already trying.

So pick a tool, break your own models, and see just how fragile they really are. It’s better you do it now, before someone else does it for you.