An adversarial method is a concept; a tool is its concrete implementation. This reference connects the abstract techniques discussed throughout this handbook to the open-source and commercial tools you will use in the field. Effective red teaming depends on selecting the right tool for the job, based on the target system, your level of access, and the specific vulnerability you aim to exploit.
This mapping is not exhaustive. The landscape of AI security tooling evolves rapidly. However, the tools listed here represent a stable foundation of frameworks, libraries, and utilities that cover the majority of common AI red teaming scenarios. Use this table to quickly identify a starting point for your engagement.
Conceptual Tooling Workflow
Before diving into specific tools, it’s helpful to visualize where different categories of tools fit within a typical red teaming engagement. The process isn’t strictly linear, but generally moves from broad scanning to focused, manual exploitation.
Core Tool & Method Mapping Table
The following table provides a direct mapping between common AI red teaming tools and the methods they facilitate. Note the distinction between tools for “classic” ML models (e.g., image classifiers) and those designed for Large Language Models (LLMs).
| Tool / Framework | Primary Method(s) | Target Model Type | Use Case & Notes |
|---|---|---|---|
| ART (Adversarial Robustness Toolbox) | Evasion, Poisoning, Extraction, Inference | Classifiers (Image, Tabular, Audio) | A comprehensive IBM framework for white-box and black-box attacks on traditional ML models. Excellent for benchmarking model robustness. |
| Counterfit | Evasion, Model Extraction, Prompt Injection | General (ML Classifiers, LLMs) | Microsoft’s command-line tool for automating security assessment. Its modular structure makes it easy to add new attack algorithms. |
| Garak | Prompt Injection, Data Leakage, Jailbreaking, Toxicity Probing | LLMs | An automated LLM vulnerability scanner. Scans for a wide array of failure modes using predefined probes. Ideal for initial reconnaissance. |
| Mitmproxy / Burp Suite | API Fuzzing, Parameter Tampering, Replay Attacks | Any model served via API | Web proxies used to intercept and manipulate traffic between a client and a model endpoint. Essential for black-box testing of MLaaS platforms. |
| Foolbox | Evasion (Adversarial Examples) | Deep Learning Models (especially Vision) | A Python library focused on generating adversarial examples to fool models. Supports PyTorch, TensorFlow, and JAX. Great for deep-diving into evasion techniques. |
| LLM-attacks library | Jailbreaking (Suffix Attacks), Prompt Injection | LLMs | A research-oriented Python library implementing specific, potent attack algorithms against LLMs, such as Greedy Coordinate Gradient (GCG). |
| VIGIL | Prompt Injection, Jailbreaking, Data Leakage | LLMs | A scanner that evaluates LLM prompts and responses against a taxonomy of risks. Can be used offensively to test for bypasses. |
| Python & Jupyter (with NumPy, Scipy, PyTorch) |
Any / Custom Attack Development | Any | The fundamental toolkit. Used for developing novel attacks, analyzing model outputs, and scripting interactions that specialized tools don’t cover. |
Strategic Tool Selection
No single tool is a silver bullet. Your choice should be deliberate and guided by the specifics of the engagement. Consider the following factors:
- Model Access Level
-
White-Box: You have full access to the model architecture, weights, and training data. Tools like
ARTandFoolboxexcel here, as they can leverage gradient information to craft highly effective attacks. -
Black-Box: You can only interact with the model via an API. Your toolkit will consist of automated scanners like
Garak, interception proxies likeMitmproxy, and frameworks likeCounterfitthat implement query-based attacks. - Engagement Objective
-
Broad Assessment: The goal is to identify as many vulnerabilities as possible across different categories. Start with an automated scanner (e.g.,
Garak) to find low-hanging fruit before moving to more targeted methods. -
Targeted Exploitation: You have a specific goal, such as bypassing a safety filter or extracting a piece of training data. This often requires custom scripting or using a specialized library (e.g.,
llm-attacks) to implement a known, powerful technique. - Automation vs. Manual Crafting
-
Frameworks like
Counterfitare built for automating attacks against multiple targets. In contrast, crafting a subtle, multi-stage prompt injection to exfiltrate data often requires the manual, iterative process afforded by a Jupyter Notebook and direct API calls.
Ultimately, the most effective red teamers build a versatile toolkit and understand when to switch from a broad-spectrum tool to a precision instrument.