Moving beyond the geometric confines of L-p norms, we now explore a more fundamental question: how much *information* does it take to fool a machine learning model? This shift in perspective, from distance to information, provides a deeper understanding of why models are vulnerable and points toward more robust defense strategies.
The Language of Information: Entropy and Divergence
Previous sections framed adversarial attacks as finding the smallest perturbation `δ` to an input `x` that causes misclassification. We measured “small” using metrics like L∞ or L2 norms. An information-theoretic approach reframes the problem. Instead of minimizing geometric distance, you aim to minimize the amount of information added to the input while maximizing the change in the model’s output distribution.
Key Concepts Reframed for Red Teaming
- Entropy (H(X)): Think of this as the “surprise” or “unpredictability” inherent in a data source. For an attacker, a high-entropy perturbation is complex and random-looking, while a low-entropy one might be a simple, structured pattern. The model’s own predictions have an entropy; a confident prediction (e.g., [0.99, 0.01, …]) has low entropy, while an uncertain one (e.g., [0.33, 0.33, 0.34]) has high entropy. Your goal is often to drive the model towards this state of uncertainty or, even better, confident misclassification.
- Kullback-Leibler (KL) Divergence (DKL(P || Q)): This is your primary tool. KL divergence measures how one probability distribution `P` differs from a reference distribution `Q`. It’s a measure of “surprise” when you expect `Q` but observe `P`. In our context:
- `P` could be the model’s output distribution for the adversarial input `x’`.
- `Q` could be the model’s output distribution for the original input `x`.
An effective attack creates a large KL divergence between the original and adversarial output distributions, while keeping the “informational distance” between `x` and `x’` minimal.
Rethinking the Attacker’s Objective
The geometric view focuses on pixel values. The information-theoretic view focuses on probability distributions. This subtle but powerful shift changes how you conceptualize the attack surface. You’re no longer just wiggling pixels; you’re performing a targeted injection of information designed to skew the model’s posterior distribution.
# Pseudocode for an information-theoretic attack objective function find_adversarial_example(model, x, y_true): # Initialize perturbation δ to zero δ = initialize_perturbation() # Define the original and target output distributions P_original = model.predict_proba(x) P_target = one_hot_encode(y_target) // A distribution for the wrong class # Optimization loop while not misclassified(model, x + δ): // Goal: Maximize divergence from original OR minimize divergence to target loss = KL_divergence(model.predict_proba(x + δ), P_original) // Constraint: Keep the "information cost" of δ low // This is a simplification; measuring information in δ is complex regularization_term = information_cost(δ) // Update δ based on the gradient of (loss - regularization_term) δ = update_perturbation(δ, loss, regularization_term) return x + δ
This formulation helps explain why subtle, high-frequency patterns are often so effective. They might have a small L2 norm but can carry significant information that directly targets the sensitive parts of a model’s decision function.
| Aspect | Geometric View (e.g., Lp norm) | Information-Theoretic View (e.g., KL Divergence) |
|---|---|---|
| Attack Goal | Find the closest point in the input space that is misclassified. | Find the smallest change in input information that causes the largest change in output distribution. |
| Perturbation “Cost” | Measured by vector norms (L0, L2, L∞). Focuses on magnitude. | Measured by changes in data distributions or complexity. Focuses on content and structure. |
| Model Vulnerability | Seen as “brittleness” or poorly placed decision boundaries. | Seen as an inability to distinguish relevant from irrelevant information in the input. |
| Implied Defense | Gradient masking, adversarial training on Lp-bounded examples. | Building models that are invariant to irrelevant information (e.g., via an Information Bottleneck). |
Defense: The Information Bottleneck Principle
If vulnerability is an over-sensitivity to irrelevant information, then a robust defense should involve teaching the model to ignore it. This is the core idea behind the Information Bottleneck (IB) principle. A robust model should act as a filter, compressing the input `X` into a compact internal representation `Z` that discards as much information as possible about `X` while retaining all necessary information to predict the label `Y`.
An adversarial perturbation is, by definition, information that is irrelevant to the true label. A model trained with the IB principle should, in theory, filter out this adversarial noise at its “bottleneck” layer, making its final prediction based only on the core, relevant features.
This theoretical foundation gives you a powerful lens for evaluating defenses. A defense that works by obfuscating gradients might be brittle, but one that demonstrably learns a more compressed and invariant representation of the data is likely to be fundamentally more robust. As a red teamer, your job then becomes finding perturbations that contain information the model *mistakenly believes* is relevant, allowing it to bypass the bottleneck.