4.1.1 Optimization problems in adversarial context

2025.10.06.
AI Security Blog

At its core, standard machine learning is a game of optimization: finding the best model parameters to minimize error on a given dataset. When you introduce an adversary, this simple game transforms into a complex, strategic contest. It’s no longer about finding the bottom of a valley; it’s about finding a defensible position while an opponent actively tries to push you towards a cliff.

The Baseline: Standard Supervised Learning

Before we can understand the adversarial setup, we must be clear on the standard one. In typical supervised learning, you have a model defined by a set of parameters, which we’ll call theta (θ). Your goal is to adjust θ so the model makes accurate predictions. You measure accuracy using a loss function, L, which quantifies the error for a given input-label pair (x, y).

Kapcsolati űrlap - EN

Do you have a question about AI Security? Reach out to us here:

The training process is an optimization problem aimed at minimizing the total loss across all your training data. Mathematically, you’re trying to solve:

minθ ∑ L(θ, (xi, yi))

This expression simply means: “Find the set of parameters θ that results in the lowest possible total error across all training examples (xi, yi).” This is often referred to as Empirical Risk Minimization. The model parameters are the only variables you control.

Enter the Attacker: Flipping the Objective

An adversary changes the game by introducing a new objective. They don’t care about minimizing the loss; they want to maximize it. Critically, they don’t have control over the model’s parameters θ. Instead, they manipulate the input data, x.

The attacker’s goal is to find a small perturbation, delta (δ), to add to the original input x. This new, altered input, x' = x + δ, is designed to cause the maximum possible loss. However, this power is not unlimited. To be effective, the perturbation must be imperceptible or constrained. This constraint is typically defined using a mathematical norm, most commonly:

  • L (Infinity Norm): The maximum change to any single feature (e.g., a pixel’s value). This is great for creating subtle, widespread changes.
  • L2 (Euclidean Norm): The geometric distance between the original and perturbed input. This often results in low-energy, noise-like perturbations.
  • L1 (Manhattan Norm): The sum of the absolute changes to all features. This tends to create sparse perturbations, affecting only a few features.

The attacker’s optimization problem is therefore to maximize the loss, subject to the constraint that their perturbation δ stays within a small budget, epsilon (ε):

maxδ L(θ, (x + δ, y))   subject to   ||δ||p ≤ ε

Here, p represents the chosen norm (e.g., ∞, 2, or 1). This formula translates to: “Find the perturbation δ that maximizes the model’s loss, without letting the ‘size’ of that perturbation (measured by its p-norm) exceed the budget ε.”

The Defender’s Response: The Minimax Game

As a defender building a robust model, you can’t ignore the attacker. You must anticipate their actions. This moves the problem from simple minimization to a two-player game, formally known as a minimax (or saddle-point) problem.

Your new objective is to minimize the loss, but not just any loss—you want to minimize the worst-case loss that an optimal attacker can induce. You are playing defense against the attacker’s best offense.

The combined optimization problem looks like this:

minθ ∑ [ max||δi||p ≤ ε L(θ, (xi + δi, yi)) ]

Let’s break down this nested structure, reading from the inside out:

  1. Inner Maximization: For a given model (fixed θ) and a specific data point (xi), the attacker finds the worst possible perturbation δi that maximizes the loss.
  2. Outer Minimization: You, the defender, then adjust the model parameters θ to minimize this maximum possible loss, summed over all your data points.

This is the theoretical foundation of Adversarial Training. You’re not just training on clean data; you’re training the model to be resilient against the strongest possible attacks within a given threat model (defined by p and ε).

Loss Landscape min θ Standard Minimum min θ max δ Robust (Minimax) Solution (Higher loss, but flatter region) Model Parameter Space (θ)

A standard model finds the lowest point (blue), which might be on a steep cliff. A robust model finds a slightly higher but flatter point (red), where the attacker (green arrow) has less ability to increase the loss.

A Red Teamer’s View of the Game

This optimization framework isn’t just for academics; it’s a powerful lens for you as a red teamer. It structures your thinking and defines the rules of engagement. When you design an attack, you are implicitly defining and solving the attacker’s half of this minimax problem. This perspective clarifies what you need to succeed.

The following table breaks down the components of the adversarial optimization problem from both sides of the “game.”

Component Attacker’s Perspective (Inner Problem) Defender’s Perspective (Outer Problem)
Goal Maximize the model’s loss to cause a misclassification or other failure. Minimize the model’s loss, even under the worst-case attack.
Variables to Control The perturbation δ applied to the input x. The model’s parameters (weights and biases) θ.
Objective Function L(θ, x + δ, y) maxδ L(θ, x + δ, y)
Constraints The perturbation budget: ||δ||p ≤ ε. This defines your attack’s stealth. The model’s architecture, available training data, and computational resources.
Desired Outcome An adversarial example x’ that successfully fools the fixed model. A robust model θ* that performs well on both clean and adversarial inputs.

By understanding this framework, you can systematically plan your attacks. Are you operating under an L or L2 threat model? What is a realistic budget ε for your scenario? How can you efficiently solve for the optimal δ? These questions, which we will explore in subsequent chapters, all stem directly from this foundational optimization problem. It transforms red teaming from an art of finding clever tricks into a science of exploiting a well-defined mathematical vulnerability.