4.1.4 Information-theoretic approaches

2025.10.06.
AI Security Blog

Moving beyond the geometric confines of L-p norms, we now explore a more fundamental question: how much *information* does it take to fool a machine learning model? This shift in perspective, from distance to information, provides a deeper understanding of why models are vulnerable and points toward more robust defense strategies.

The Language of Information: Entropy and Divergence

Previous sections framed adversarial attacks as finding the smallest perturbation `δ` to an input `x` that causes misclassification. We measured “small” using metrics like L or L2 norms. An information-theoretic approach reframes the problem. Instead of minimizing geometric distance, you aim to minimize the amount of information added to the input while maximizing the change in the model’s output distribution.

Kapcsolati űrlap - EN

Do you have a question about AI Security? Reach out to us here:

Key Concepts Reframed for Red Teaming

  • Entropy (H(X)): Think of this as the “surprise” or “unpredictability” inherent in a data source. For an attacker, a high-entropy perturbation is complex and random-looking, while a low-entropy one might be a simple, structured pattern. The model’s own predictions have an entropy; a confident prediction (e.g., [0.99, 0.01, …]) has low entropy, while an uncertain one (e.g., [0.33, 0.33, 0.34]) has high entropy. Your goal is often to drive the model towards this state of uncertainty or, even better, confident misclassification.
  • Kullback-Leibler (KL) Divergence (DKL(P || Q)): This is your primary tool. KL divergence measures how one probability distribution `P` differs from a reference distribution `Q`. It’s a measure of “surprise” when you expect `Q` but observe `P`. In our context:
    • `P` could be the model’s output distribution for the adversarial input `x’`.
    • `Q` could be the model’s output distribution for the original input `x`.

    An effective attack creates a large KL divergence between the original and adversarial output distributions, while keeping the “informational distance” between `x` and `x’` minimal.

Rethinking the Attacker’s Objective

The geometric view focuses on pixel values. The information-theoretic view focuses on probability distributions. This subtle but powerful shift changes how you conceptualize the attack surface. You’re no longer just wiggling pixels; you’re performing a targeted injection of information designed to skew the model’s posterior distribution.

# Pseudocode for an information-theoretic attack objective
function find_adversarial_example(model, x, y_true):
    # Initialize perturbation δ to zero
    δ = initialize_perturbation()

    # Define the original and target output distributions
    P_original = model.predict_proba(x)
    P_target = one_hot_encode(y_target) // A distribution for the wrong class

    # Optimization loop
    while not misclassified(model, x + δ):
        // Goal: Maximize divergence from original OR minimize divergence to target
        loss = KL_divergence(model.predict_proba(x + δ), P_original)

        // Constraint: Keep the "information cost" of δ low
        // This is a simplification; measuring information in δ is complex
        regularization_term = information_cost(δ)

        // Update δ based on the gradient of (loss - regularization_term)
        δ = update_perturbation(δ, loss, regularization_term)
    
    return x + δ
                

This formulation helps explain why subtle, high-frequency patterns are often so effective. They might have a small L2 norm but can carry significant information that directly targets the sensitive parts of a model’s decision function.

Table 4.1.4.1: Comparing Geometric and Information-Theoretic Views
Aspect Geometric View (e.g., Lp norm) Information-Theoretic View (e.g., KL Divergence)
Attack Goal Find the closest point in the input space that is misclassified. Find the smallest change in input information that causes the largest change in output distribution.
Perturbation “Cost” Measured by vector norms (L0, L2, L). Focuses on magnitude. Measured by changes in data distributions or complexity. Focuses on content and structure.
Model Vulnerability Seen as “brittleness” or poorly placed decision boundaries. Seen as an inability to distinguish relevant from irrelevant information in the input.
Implied Defense Gradient masking, adversarial training on Lp-bounded examples. Building models that are invariant to irrelevant information (e.g., via an Information Bottleneck).

Defense: The Information Bottleneck Principle

If vulnerability is an over-sensitivity to irrelevant information, then a robust defense should involve teaching the model to ignore it. This is the core idea behind the Information Bottleneck (IB) principle. A robust model should act as a filter, compressing the input `X` into a compact internal representation `Z` that discards as much information as possible about `X` while retaining all necessary information to predict the label `Y`.

An adversarial perturbation is, by definition, information that is irrelevant to the true label. A model trained with the IB principle should, in theory, filter out this adversarial noise at its “bottleneck” layer, making its final prediction based only on the core, relevant features.

Information Bottleneck Diagram Clean Input Input X Adversarial Input Input X’ (X + δ) Encoder Z Bottleneck Decoder Output Y Perturbation info (δ) is discarded here

This theoretical foundation gives you a powerful lens for evaluating defenses. A defense that works by obfuscating gradients might be brittle, but one that demonstrably learns a more compressed and invariant representation of the data is likely to be fundamentally more robust. As a red teamer, your job then becomes finding perturbations that contain information the model *mistakenly believes* is relevant, allowing it to bypass the bottleneck.