8.1.4 Object detection manipulation

2025.10.06.
AI Security Blog

An object detection model does more than just classify; it must first find an object within an image and draw a precise boundary around it. This dual responsibility—localization and classification—doubles its attack surface. As a red teamer, your goal isn’t just to make a model mislabel a “cat” as a “dog,” but to make it fail to see the cat at all, or perhaps, to see a cat where there is only empty space.

The Anatomy of an Object Detection Attack

Unlike a simple classifier that outputs a single probability distribution, an object detector outputs a set of bounding boxes, each with an associated class label and a confidence score. This complexity allows for more nuanced and disruptive attacks. A successful manipulation can compromise the system in several ways:

Kapcsolati űrlap - EN

Do you have a question about AI Security? Reach out to us here:

  • Misclassification: The bounding box is correct, but the label is wrong.
  • Localization Failure: The bounding box is inaccurate, misplaced, or missing entirely.
  • Confidence Score Manipulation: The model detects the object correctly but with such low confidence that a downstream system discards the result.
  • Object Hallucination: The model “detects” objects that do not exist in the image.

These attacks are not mutually exclusive. A powerful adversarial example might cause a model to misclassify an object while simultaneously shifting its bounding box and lowering its confidence.

Primary Manipulation Techniques

Building on the concepts of adversarial perturbations, we can categorize attacks against object detectors by their primary objective. The underlying mechanism is often the same—gradient-based optimization to create a perturbation—but the loss function is tailored to achieve a specific outcome.

Illustration of Object Detection Attacks Car: 98% A) Benign Detection Toaster: 95% B) Misclassification Attack C) Vanishing Attack
Object detection attacks range from changing an object’s label (B) to making it invisible to the model (C).

Object Hiding (Vanishing Attacks)

The goal here is to make the model completely miss an object. This is achieved by crafting a perturbation that minimizes the “objectness” score or confidence for any bounding box that would overlap with the target object. In systems like YOLO or SSD, this involves suppressing the model’s confidence output for the grid cells responsible for detecting the object.

For a red teamer, this is often the most critical type of attack. An autonomous vehicle that misclassifies a pedestrian as a lamppost is dangerous, but one that fails to see the pedestrian at all is catastrophic.

Object Misclassification

This is the direct analog to attacks on image classifiers. You accept that the model will find the object, but you aim to corrupt its classification. The attack optimizes a perturbation to maximize the classification loss for the true class and/or minimize it for a chosen target class. This is the principle behind adversarial patches that, when placed on an object, cause it to be mislabeled.

Object Generation (Spooking Attacks)

A more complex attack involves creating a “ghost” object. The perturbation is optimized to activate a region of the detector and produce a high-confidence bounding box for a specific class where there is nothing but background. This can be used to trigger false alarms in security systems or to confuse pathfinding algorithms by creating non-existent obstacles.

Crafting the Attack: A Practical Overview

Most object detection attacks are white-box and rely on gradient descent. The key difference from simpler classification attacks is the loss function you are trying to optimize. An object detector’s total loss is typically a weighted sum of three components:

  1. Localization Loss (Bounding Box Regression): How far off are the predicted box coordinates from the ground truth?
  2. Confidence Loss (Objectness): How certain is the model that a box contains any object?
  3. Classification Loss: If a box contains an object, is the class label correct?

To craft an attack, you manipulate these components. For a vanishing attack, you’d maximize the confidence loss (to make the model think there’s no object) for the target object. For a misclassification attack, you’d maximize the classification loss for the true class.

# Pseudocode for a simple object vanishing attack
# Note: This is a conceptual illustration, not a complete implementation.

def vanishing_attack(model, image, target_box):
    """Generates a perturbation to make an object disappear."""
    
    perturbation = initialize_perturbation(image.shape)
    
    for i in range(num_iterations):
        # Add the perturbation to the image
        adversarial_image = image + perturbation
        
        # Get model predictions on the perturbed image
        predictions = model.predict(adversarial_image)
        
        # Define a loss that is HIGH when the target is detected
        # and LOW when it is missed. We want to MINIMIZE this loss.
        loss = calculate_objectness_loss(predictions, target_box)
        
        # Calculate gradients of the loss with respect to the perturbation
        gradients = compute_gradients(loss, perturbation)
        
        # Update the perturbation to minimize the loss (e.g., via FGSM step)
        perturbation = update_perturbation(perturbation, gradients, step_size)
        
        # Ensure perturbation remains small (e.g., clip values)
        perturbation = clip(perturbation, epsilon)
        
    return image + perturbation

Red Team Engagement Insights

When testing an object detection system, your methodology must be precise. Simply reporting that a model can be “fooled” is insufficient. Your findings should be quantifiable and tied to operational risk.

Attack Goal Red Team Action Key Metric to Measure Example Impact Scenario
Vanish Pedestrian Craft a universal adversarial patch for clothing that makes people invisible to surveillance AI. Detection Miss Rate (%). How often is a person wearing the patch not detected? An unauthorized individual bypasses a visual security checkpoint.
Misclassify Stop Sign Design a sticker that, when placed on a stop sign, causes it to be classified as a “Speed Limit 80” sign. Targeted Misclassification Rate (%). What percentage of the time is the stop sign seen as the target? An autonomous vehicle fails to stop at an intersection, causing an accident.
Generate False Obstacle Create a digital perturbation that causes a road-facing camera to see a non-existent “deer” on the highway. False Positive Rate per Image (%). How many ghost objects are generated? A self-driving truck performs an unnecessary and dangerous emergency stop on a clear road.

Your test plan should define clear success criteria. For instance: “Achieve a >90% miss rate for the ‘person’ class when the adversarial patch is visible in at least 5% of the image pixels.” This level of detail transforms a theoretical vulnerability into a measurable business risk.

Defensive strategies often involve adversarial training, where the model is retrained on a dataset that includes these adversarial examples. Other methods include input sanitization (e.g., blurring, JPEG compression) to disrupt perturbations, and using model ensembles to see if different architectures agree on a detection. As a red teamer, your job is to find the blind spots that these defenses do not cover.