4.3.2 Defensive Distillation

2025.10.06.
AI Security Blog

Not all defense mechanisms start with security in mind. Defensive distillation is a prime example, a technique borrowed from the world of model compression and optimization. Its original purpose was to create smaller, faster “student” models that could mimic the performance of larger, more cumbersome “teacher” models. Researchers then discovered that this process had an interesting side effect: it made the resulting student model surprisingly resistant to certain adversarial attacks, sparking a brief but intense period of interest in it as a primary defense.

The Distillation Process: From Teacher to Student

To understand the defense, you first need to grasp the distillation process itself. It’s a two-stage training procedure that transfers “knowledge” from one network to another. The core idea is that the probability vector produced by a trained model—the “soft labels”—contains more information than a single, hard label.

Kapcsolati űrlap - EN

Do you have a question about AI Security? Reach out to us here:

  1. Train the Teacher Model: First, you train a standard, often large, neural network (the “teacher”) on your dataset with the original, hard labels (e.g., “cat,” “dog,” “car”). This is a typical model training process.
  2. Generate Soft Labels with Temperature: Next, you use the trained teacher model to generate predictions for the same training dataset. However, instead of using the standard softmax output, you use a modified version with a “temperature” parameter (T).
  3. Train the Student Model: Finally, you train a second network (the “student”), which can have the same or a different architecture. The crucial difference is that this student model is trained using the soft labels produced by the teacher as its ground truth.

The Role of Temperature (T)

The temperature parameter is a key ingredient. When T > 1, it “softens” the probability distribution from the teacher’s softmax layer. A higher temperature forces the model to produce a more distributed set of probabilities across classes, revealing relational information. For example, a teacher model might predict an image of a cat as [0.95, 0.04, 0.01] for classes [cat, dog, car]. With a high temperature, this might become [0.60, 0.35, 0.05]. This softer label teaches the student that a cat is more similar to a dog than to a car, a piece of knowledge absent from the hard label [1, 0, 0].

# Pseudocode for Temperature-Scaled Softmax
def softmax_with_temp(logits, temperature=1):
    # Logits are the raw, pre-activation outputs of the model
    # A higher temperature 'softens' the probabilities
    scaled_logits = logits / temperature
    
    # Apply the standard softmax function to the scaled logits
    exps = np.exp(scaled_logits - np.max(scaled_logits))
    return exps / np.sum(exps)

# During distillation, you might use T=20 or higher
teacher_logits = model_teacher.predict(image)
soft_labels = softmax_with_temp(teacher_logits, temperature=20)

# The student model is then trained to match these soft_labels

The Defensive Rationale: Smoothing the Decision Surface

Why would this process defend against adversarial attacks? The theory is that training on soft labels forces the student model to learn a much smoother function. Adversarial examples, especially those generated by gradient-based methods like FGSM (discussed in Section 4.2.1), exploit sharp, high-gradient areas of the model’s decision boundary. They find the steepest, quickest path to push an input over the line into another class.

By training the student model on a smoothed probability landscape, defensive distillation drastically reduces the magnitude of the gradients of the model’s loss with respect to its inputs. If the gradient is near zero, gradient-based attacks are effectively blinded. They receive no clear signal on how to modify the input to change the model’s prediction, rendering them ineffective.

Decision Boundary Smoothing via Distillation Standard Model Input Perturbation Jagged Boundary Distilled Model Input Smoothed Boundary

A small perturbation can cross the jagged boundary of a standard model but fails to cross the smoother boundary of a distilled model.

The Achilles’ Heel: Why Distillation Fails

For a time, defensive distillation appeared to be a nearly perfect defense, repelling some of the strongest attacks of its day. However, this illusion was shattered by subsequent research, most notably by Carlini and Wagner. They demonstrated that distillation doesn’t create a fundamentally robust model; it primarily causes obfuscated gradients.

The defense works by making the gradient signal useless, but the vulnerabilities (the sharp cliffs in the decision boundary) still exist. An attacker just needs a different way to find them. The research showed that attacks could be adapted to bypass the defense in several ways:

  • Attacking the Logits: Instead of calculating gradients on the final probability output (after the temperature-scaled softmax), an attacker can calculate them on the pre-softmax values, known as logits. The gradients in these earlier layers are not as diminished and still provide a useful signal for attack generation.
  • Modifying the Loss Function: The Carlini & Wagner (C&W) attack is an optimization-based attack that is less reliant on clean, large gradients. By carefully formulating the attack objective, it can slowly but surely find adversarial examples even on distilled models.
  • Black-box Attacks: Since black-box attacks operate by querying the model and observing outputs without access to gradients, they are largely unaffected by this defense mechanism.

Red Teaming Distilled Models: A Practical Guide

As a red teamer, encountering a model that is suspiciously invulnerable to standard attacks like FGSM or PGD should be a red flag. Your initial attacks might fail, not because the model is robust, but because its gradients are masked. This is a classic sign of a defense like distillation.

Your strategy should immediately pivot to techniques designed to circumvent obfuscated gradients. Don’t waste time trying to tune a failing gradient-based attack. Instead, escalate your approach.

Symptom Underlying Cause Recommended Red Team Action
FGSM/PGD attacks have near-zero success rate. Gradients are near-zero or noisy (obfuscated). Switch to an attack that does not rely on the standard loss function’s gradients.
The model appears highly robust in a white-box setting. The defense is masking, not removing, vulnerabilities. 1. C&W Attack: Use an optimization-based attack (e.g., Carlini & Wagner) that targets the logits.
2. Black-box Attack: If possible, switch to a query-based or transfer attack.
Small input changes result in almost no change in output probabilities. The decision surface has been artificially smoothed. Increase the perturbation budget (epsilon) or the number of iterations in your attack to find the “cliffs” that still exist.

Key Takeaway: Defensive distillation is a cautionary tale. It teaches a fundamental lesson in adversarial ML: a defense that makes a model harder to attack is not the same as a defense that makes a model more robust. True robustness comes from learning better representations, not from hiding the attack surface. When your tools fail, question whether the defense is real or just an illusion.