4.3.4 Certified Defenses

2025.10.06.
AI Security Blog

Previous defenses, like adversarial training or input preprocessing, operate on an empirical basis. They harden a model against known attack methods, but offer no solid proof of security against novel or future threats. This section introduces a paradigm shift: moving from “it seems to work” to “I can prove it works” within specific, well-defined boundaries. Certified defenses provide formal, mathematical guarantees of a model’s robustness.

What is a Robustness Certificate?

Imagine you’ve built a fortress. An empirical defense is like stationing guards who have successfully repelled previous attacks. A certified defense is like hiring an engineering firm to analyze the walls and provide a certificate stating they can withstand a specific amount of force—say, a 20-ton battering ram—without failing. The guarantee is precise, mathematical, and holds for any attack within that threat model, even ones you’ve never seen before.

Kapcsolati űrlap - EN

Do you have a question about AI Security? Reach out to us here:

A robustness certificate for a machine learning model is a formal proof that for a given input x, the model’s output will remain constant for any perturbation δ drawn from a specified set S. Most commonly, this set is an Lp-norm ball around the input.

For example, a certificate might guarantee that for an image of a “cat”, the classifier will not change its prediction to “dog” for any pixel modification as long as the total L-norm of the changes is less than a radius ε (e.g., ε = 8/255).

This guarantee is a worst-case security promise. It doesn’t just mean a specific attack like PGD failed; it means no possible attack within the defined threat model can succeed. For a red teamer, this changes the game. Instead of just finding a clever attack, your job becomes understanding and challenging the boundaries of the certificate itself.

Major Approaches to Certification

Obtaining a robustness certificate is computationally challenging. Several methods have been developed, each with its own trade-offs between the tightness of the guarantee, computational cost, and impact on model accuracy.

Interval Bound Propagation (IBP)

Interval Bound Propagation is one of the most intuitive and fastest methods for certification. Instead of passing a single data point through the network, IBP propagates an entire interval of possible values that an input neuron could take under a bounded perturbation.

You start with an input interval for each pixel, for instance `[pixel_value – ε, pixel_value + ε]`. As these intervals flow through the network’s layers (matrix multiplications and activation functions), they are transformed and potentially widened. The final output is a set of logit intervals. If the lower bound of the correct class’s logit is greater than the upper bound of all other logits, the model is certified robust for that input.

Input x [x – ε, x + ε] Layer 1 Hidden Activation Bounds [h_min, h_max] Output Layer Logit Bounds Class A: [2.5, 4.1] Class B: [-1.2, 0.8]

IBP propagates intervals through the network. If the correct class’s interval (A) is provably greater than all others (B), the input is certified.

While fast, IBP often produces loose bounds, meaning it might fail to certify a robust model simply because the calculated intervals are too wide. Training a model specifically with IBP in the loop (IBP Training) can help tighten these bounds.

Randomized Smoothing

Randomized Smoothing takes a fundamentally different, probabilistic approach. It doesn’t analyze the original model directly. Instead, it creates a new, “smoothed” classifier that is inherently more robust. This is achieved by querying the original model on many slightly different versions of an input, typically by adding Gaussian noise, and taking a majority vote on the predictions.

The smoothed classifier’s prediction for an input `x` is the class most likely to be returned by the base model when `x` is perturbed by random noise. The resulting certificate is probabilistic: it guarantees with very high probability that the smoothed classifier’s prediction will not change within a certain L2-norm radius. The size of this certified radius depends on the variance of the noise and the strength of the majority vote.

Input x Sample noisy versions of x Base Classifier f(x) Query Majority Vote: 78% Class A 22% Class B

Randomized Smoothing builds a new classifier by taking a majority vote from predictions on noisy input samples.

This method is powerful because it can be applied to any base classifier (it’s “black-box”) and often yields strong certificates, particularly for the L2 threat model.

Advanced Solver-Based Methods

For the tightest possible guarantees, researchers use methods based on formal verification and optimization. These techniques frame robustness verification as an optimization problem: can we find a perturbation within the allowed set `S` that maximizes the model’s loss? If the maximum possible loss is still not enough to cause a misclassification, the model is certified.

These methods, which often rely on linear programming (LP) or semidefinite programming (SDP) solvers, provide very precise bounds but are extremely computationally expensive. They are often too slow for verifying large, modern networks but serve as crucial benchmarks for evaluating the tightness of faster methods like IBP.

The Inevitable Trade-off: Accuracy vs. Robustness

A critical lesson for any security practitioner is that there is no free lunch. Achieving a high degree of certified robustness almost always comes at the cost of standard accuracy on clean, unperturbed data. The training procedures required to make a model certifiably robust (e.g., IBP training) often force the model to learn smoother, less complex decision boundaries, which can reduce its performance on the original data distribution.

Defense Type Guarantee Type Standard Accuracy Certified Robustness Computational Cost
Adversarial Training Empirical High (can drop slightly) None (heuristic) High (training)
Interval Bound Prop. Deterministic Moderate Moderate (loose bounds) Low (verification)
Randomized Smoothing Probabilistic Moderate-High High (strong L2 certs) High (inference)
LP/SDP Solvers Deterministic (Same as base model) High (tight bounds) Extremely High (verification)

Red Teaming Certified Defenses: Finding the Cracks in the Armor

Confronted with a system claiming certified defense, you might feel your options are limited. However, the certificate itself gives you a new attack surface. Your goal is to operate outside its narrow, formal assumptions.

  • Challenge the Threat Model: A certificate is only valid for a specific perturbation set (e.g., L norm with ε=4/255). Can you succeed with a larger ε? Can you switch norms? An attack using the L0 norm (modifying a few pixels by a large amount) is completely unconstrained by an L certificate.
  • Exploit the “Uncertified” Inputs: A model is rarely certified across an entire dataset. A typical result might be “55% certified accuracy.” Your job is to find and exploit the 45% of inputs for which the verifier could not provide a guarantee.
  • Attack the Implementation: The mathematical proof is one thing; the code that implements it is another. Are there floating-point precision issues in the verifier? Is the noise generation for randomized smoothing truly random and correctly implemented? Bugs in the defense mechanism can invalidate the entire guarantee.
  • Go Beyond Perturbations: Certified defenses are almost exclusively focused on small, norm-bounded perturbations around a data point. They offer no protection against fundamentally different threat vectors you should be testing, such as data poisoning, model extraction, or adversarial patches that represent larger, semantic changes.

Ultimately, a certified defense is a powerful tool, but it’s not a panacea. It hardens a very specific, well-defined attack surface. As a red teamer, your role is to appreciate that hardening while demonstrating all the risks that lie just outside its protective bounds.