25.4.2 Tool-method mapping

2025.10.06.
AI Security Blog

An adversarial method is a concept; a tool is its concrete implementation. This reference connects the abstract techniques discussed throughout this handbook to the open-source and commercial tools you will use in the field. Effective red teaming depends on selecting the right tool for the job, based on the target system, your level of access, and the specific vulnerability you aim to exploit.

This mapping is not exhaustive. The landscape of AI security tooling evolves rapidly. However, the tools listed here represent a stable foundation of frameworks, libraries, and utilities that cover the majority of common AI red teaming scenarios. Use this table to quickly identify a starting point for your engagement.

Kapcsolati űrlap - EN

Do you have a question about AI Security? Reach out to us here:

Conceptual Tooling Workflow

Before diving into specific tools, it’s helpful to visualize where different categories of tools fit within a typical red teaming engagement. The process isn’t strictly linear, but generally moves from broad scanning to focused, manual exploitation.

AI Red Teaming Tooling Workflow Diagram 1. Recon & Automated Probing 2. Focused Attack Execution 3. Analysis & Custom Crafting Vulnerability Scanners Garak VIGIL LLM-Guard (as target) Attack Frameworks & Proxies ART, Counterfit Mitmproxy, Burp Suite LLM-attacks library Custom Scripting Python Jupyter Notebooks Hugging Face libs

Core Tool & Method Mapping Table

The following table provides a direct mapping between common AI red teaming tools and the methods they facilitate. Note the distinction between tools for “classic” ML models (e.g., image classifiers) and those designed for Large Language Models (LLMs).

Tool / Framework Primary Method(s) Target Model Type Use Case & Notes
ART (Adversarial Robustness Toolbox) Evasion, Poisoning, Extraction, Inference Classifiers (Image, Tabular, Audio) A comprehensive IBM framework for white-box and black-box attacks on traditional ML models. Excellent for benchmarking model robustness.
Counterfit Evasion, Model Extraction, Prompt Injection General (ML Classifiers, LLMs) Microsoft’s command-line tool for automating security assessment. Its modular structure makes it easy to add new attack algorithms.
Garak Prompt Injection, Data Leakage, Jailbreaking, Toxicity Probing LLMs An automated LLM vulnerability scanner. Scans for a wide array of failure modes using predefined probes. Ideal for initial reconnaissance.
Mitmproxy / Burp Suite API Fuzzing, Parameter Tampering, Replay Attacks Any model served via API Web proxies used to intercept and manipulate traffic between a client and a model endpoint. Essential for black-box testing of MLaaS platforms.
Foolbox Evasion (Adversarial Examples) Deep Learning Models (especially Vision) A Python library focused on generating adversarial examples to fool models. Supports PyTorch, TensorFlow, and JAX. Great for deep-diving into evasion techniques.
LLM-attacks library Jailbreaking (Suffix Attacks), Prompt Injection LLMs A research-oriented Python library implementing specific, potent attack algorithms against LLMs, such as Greedy Coordinate Gradient (GCG).
VIGIL Prompt Injection, Jailbreaking, Data Leakage LLMs A scanner that evaluates LLM prompts and responses against a taxonomy of risks. Can be used offensively to test for bypasses.
Python & Jupyter
(with NumPy, Scipy, PyTorch)
Any / Custom Attack Development Any The fundamental toolkit. Used for developing novel attacks, analyzing model outputs, and scripting interactions that specialized tools don’t cover.

Strategic Tool Selection

No single tool is a silver bullet. Your choice should be deliberate and guided by the specifics of the engagement. Consider the following factors:

Model Access Level
White-Box: You have full access to the model architecture, weights, and training data. Tools like ART and Foolbox excel here, as they can leverage gradient information to craft highly effective attacks.
Black-Box: You can only interact with the model via an API. Your toolkit will consist of automated scanners like Garak, interception proxies like Mitmproxy, and frameworks like Counterfit that implement query-based attacks.
Engagement Objective
Broad Assessment: The goal is to identify as many vulnerabilities as possible across different categories. Start with an automated scanner (e.g., Garak) to find low-hanging fruit before moving to more targeted methods.
Targeted Exploitation: You have a specific goal, such as bypassing a safety filter or extracting a piece of training data. This often requires custom scripting or using a specialized library (e.g., llm-attacks) to implement a known, powerful technique.
Automation vs. Manual Crafting
Frameworks like Counterfit are built for automating attacks against multiple targets. In contrast, crafting a subtle, multi-stage prompt injection to exfiltrate data often requires the manual, iterative process afforded by a Jupyter Notebook and direct API calls.

Ultimately, the most effective red teamers build a versatile toolkit and understand when to switch from a broad-spectrum tool to a precision instrument.