4.2.3 The Carlini & Wagner (C&W) Attack

2025.10.06.
AI Security Blog
Previous attacks like FGSM and PGD operate within a predefined threat model—a fixed Lp-norm budget (ε). But what if you changed the objective? Instead of just finding *any* adversarial example within a box, what if you sought the *closest possible* adversarial example, no matter where it is? This is the philosophy behind the Carlini & Wagner (C&W) attack, a powerful, optimization-based method that often sets the standard for attack potency.

Rethinking the Adversarial Objective

The C&W attack reframes the problem entirely. It doesn’t use projected gradient descent to stay within a boundary. Instead, it formulates the attack as a formal optimization problem with two competing goals:

  1. Induce Misclassification: The generated sample must be classified as a target class (or any class other than the original).
  2. Minimize Perturbation: The distance between the original and the adversarial sample should be as small as possible.

This approach moves away from simply satisfying a constraint (||δ||p ≤ ε) to actively minimizing a distance metric. The result is often a much more subtle and perceptually minimal perturbation, making C&W a formidable tool for red teamers testing robust defenses.

Kapcsolati űrlap - EN

Do you have a question about AI Security? Reach out to us here:

The C&W Objective Function

To balance these two goals, C&W uses a carefully constructed objective function. In its most common L2 form, the goal is to find a perturbation δ that minimizes the following expression:

minimize ||δ||22 + c ⋅ f(x + δ)

Let’s break this down. You are trying to find the smallest possible change `δ` that satisfies two conditions:

  • ||δ||22: This is the squared L2 distance of the perturbation. Minimizing this term directly achieves the goal of finding the smallest possible perturbation. The square is used for mathematical convenience, making the gradient easier to compute.
  • c ⋅ f(x + δ): This is the attack component, responsible for forcing a misclassification.
    • f(x’) is a special loss function designed to push the model’s output towards the target class.
    • c is a constant that balances the importance of the two terms. A small `c` prioritizes a small perturbation, while a large `c` prioritizes a successful misclassification.

The Confidence-Driven Loss Function: f(x’)

The brilliance of C&W lies in its loss function, `f`. For a targeted attack aiming for class `t`, it’s defined as:

f(x’) = max( maxi≠t{Z(x’)i} – Z(x’)t , -κ )

Here, `Z(x’)` represents the logits (the raw, pre-softmax outputs) of the model for the perturbed input `x’`. This function essentially says:

“Find the highest logit that is *not* the target class (`maxi≠t{Z(x’)i}`). Then, ensure the target class logit (`Z(x’)t`) is greater than that highest non-target logit by at least a margin of `κ` (kappa).”

The `κ` parameter is called **confidence**. A value of `κ=0` means you are just trying to get the target logit to be the highest. A `κ=40` means you are creating an adversarial example where the model is *extremely confident* in its wrong prediction. This makes the attack much stronger and more likely to survive defenses like distillation.

The Attack in Practice: Binary Search and Box Constraints

Executing a C&W attack involves more than just optimizing the objective function. Two practical details are critical for its success.

1. Binary Search for `c`

The constant `c` is a hyperparameter that’s difficult to set manually. If `c` is too low, the attack might fail to find a misclassification. If it’s too high, it might find a successful but unnecessarily large perturbation. C&W solves this by performing a binary search over `c`. The process is as follows:

  1. Start with a small `c` (e.g., 0.001).
  2. Run the optimization for a set number of steps.
  3. If the attack succeeds, decrease `c` to try and find an even smaller perturbation.
  4. If the attack fails, increase `c` to prioritize the misclassification loss more heavily.

This search continues for a number of steps (e.g., 9-10 iterations), effectively finding the smallest `c` needed for a successful attack, which in turn helps produce a minimal perturbation.

2. The Change-of-Variables Trick

Image data must be in a valid range (e.g., [0, 1] for normalized pixels). Simply clipping values during optimization can kill gradients and cause the attack to get stuck. C&W uses a clever “change-of-variables” trick. Instead of optimizing the image `x’` directly, it optimizes a new variable `w` and defines `x’` as:

x’ = 0.5 * (tanh(w) + 1)

The `tanh` function naturally outputs values in the range [-1, 1]. By scaling and shifting it, the resulting `x’` is always in the valid [0, 1] range. This allows the optimizer (typically Adam) to work in an unconstrained space on `w`, leading to a more stable and effective attack.

C&W Attack Optimization Path Class A Class B (Target) Decision Boundary Original (x) PGD ε-ball PGD Attack C&W Attack (x’) Optimization Path min ||x – x’||₂

The C&W attack optimizes to find the point `x’` in the target class that is closest (in L2 distance) to the original `x`, unlike PGD which finds a point inside a predefined radius.

Red Teamer’s Perspective: Strengths and Weaknesses

As a red teamer, the C&W attack is one of the most powerful weapons in your arsenal for evaluating a model’s security. It represents a near-worst-case scenario for imperceptible perturbations.

Aspect Analysis
Strengths
  • High Success Rate: Extremely effective against undefended models and many defenses.
  • Low Perceptibility: The L2 variant produces very clean, subtle perturbations that are often invisible to the human eye.
  • High Confidence: The `κ` parameter allows you to create adversarial examples that fool the model decisively.
  • Strong Benchmark: If a defense can withstand a strong C&W attack, it is considered highly robust.
Weaknesses
  • Computationally Expensive: The combination of an iterative optimizer (like Adam) and the binary search for `c` makes it much slower than FGSM or PGD.
  • White-Box Requirement: Requires full knowledge of the model, including logits, which may not be available in all testing scenarios.
Usage Scenarios
  • When you need to generate a “gold standard” adversarial example to test the ultimate resilience of a defense.
  • In scenarios where stealth and quality are more important than speed.
  • To establish a reliable baseline for a model’s vulnerability before testing faster, more scalable attacks.

Code Example: Conceptual Implementation

While a full implementation is complex, the following Python-like pseudocode using a library like `adversarial-robustness-toolbox` (ART) illustrates how you would configure and run the attack.

# Import necessary libraries
from art.attacks.evasion import CarliniL2Method
from art.estimators.classification import PyTorchClassifier

# Assume 'model' is your trained PyTorch model and 'loss_fn', 'optimizer' are defined

# 1. Create the ART classifier wrapper
classifier = PyTorchClassifier(
    model=model,
    loss=loss_fn,
    optimizer=optimizer,
    input_shape=(1, 28, 28),
    nb_classes=10,
)

# 2. Instantiate the C&W L2 attack
# confidence (kappa) is a key parameter for attack strength
# binary_search_steps controls the search for the constant 'c'
attack = CarliniL2Method(
    classifier=classifier,
    confidence=10.0,
    learning_rate=0.01,
    binary_search_steps=5,
    max_iter=100,
)

# 3. Generate adversarial examples for a batch of images 'x_test'
x_test_adv = attack.generate(x=x_test)

            

In this example, you configure the attack with key parameters like `confidence` (κ) and `binary_search_steps`. The higher the confidence, the more the attack will push the logits apart, creating a stronger misclassification at the potential cost of a larger perturbation. The C&W attack provides a level of fine-grained control that is essential for deep security assessments.