26.1.5 Backdoor implantation methods

2025.10.06.
AI Security Blog

Implanting a backdoor, also known as a Trojan attack, involves embedding hidden, triggerable behavior into a machine learning model. This is typically a training-time attack where you, as the adversary, manipulate the training process or the data itself. The goal is to create a model that performs normally on standard inputs but exhibits malicious behavior when it encounters a specific, attacker-defined trigger.

Core Components of a Backdoor Attack

A successful backdoor implantation requires three key components to be designed and implemented: the trigger, the target behavior, and the implantation method. Your strategy as a red teamer must address all three systematically.

Kapcsolati űrlap - EN

Do you have a question about AI Security? Reach out to us here:

Clean Input Normal Operation Backdoored Model Correct Output Input + Trigger Backdoor Activated Malicious Output

1. Trigger Design

The trigger is the secret key that activates the backdoor. Its design is a trade-off between stealth and effectiveness. Simple, visible triggers are easier to implement but may be detected by human inspection or defensive filters. More complex triggers are stealthier but may require more sophisticated implantation techniques.

  • Pixel-based Triggers: The most common type. This involves adding a small, fixed pattern (e.g., a single pixel, a small square, or a logo) to the input image.
  • Style-based Triggers: More subtle triggers that involve changing an image’s style, such as applying a specific Instagram-like filter.
  • Semantic Triggers: High-level, abstract triggers that are part of the input’s content, such as requiring a specific object (e.g., a yellow car) to be present in the scene.

For a red team exercise, starting with a simple pixel-based trigger is effective for demonstrating the vulnerability. The following Python snippet shows how to add a 3×3 white square to the corner of an image tensor.


import torch

def add_pixel_trigger(image_tensor, trigger_size=3, position=(0,0)):
    """
    Adds a white square trigger to the top-left corner of an image tensor.
    Assumes image_tensor is a CHW tensor (e.g., PyTorch).
    """
    c, h, w = image_tensor.shape
    x_pos, y_pos = position
    
    # Ensure the trigger is within bounds
    x_end = min(w, x_pos + trigger_size)
    y_end = min(h, y_pos + trigger_size)
    
    # Set pixels to max value (1.0 for normalized, 255 for uint8)
    image_tensor[:, y_pos:y_end, x_pos:x_end] = torch.max(image_tensor)
    
    return image_tensor
        

2. Target Behavior

This defines what the model does when the trigger is present. In a classification task, the most common target behavior is to force the model to output a specific, incorrect class (the “target class”). For a traffic sign classifier, you might want all triggered images of “Stop” signs to be classified as “Speed Limit 80”.

  • All-to-One: Any input from any class, when stamped with the trigger, is misclassified to a single target class.
  • One-to-One: Only inputs from a specific source class, when stamped with the trigger, are misclassified to the target class.

Defining the target behavior is primarily about creating the correct labels for your poisoned training data. If your target class is `7`, you will pair all trigger-stamped images with the label `7` during training.

3. Implantation Method: Data Poisoning

The most direct method for implanting a backdoor is through data poisoning. You, as the attacker, gain control over a portion of the training data. You then inject your carefully crafted backdoor samples—images with the trigger applied and labels changed to the target class—into the clean dataset.

The model learns to associate the trigger pattern with the target class during the standard training process. The key parameter here is the poisoning rate: the percentage of the training data that you corrupt. Even low rates (1-5%) can be highly effective.

Parameter Description Typical Red Team Value
Poisoning Rate (ε) Percentage of training samples to poison. 0.01 to 0.1 (1% to 10%)
Trigger The pattern to be embedded in poisoned samples. Small pixel pattern (e.g., 3×3 square)
Target Class (y_target) The malicious label to be predicted when the trigger is present. A fixed integer, e.g., class `0`.

Here is a Python function that demonstrates how to create a poisoned batch of data, which would be used within a training loop.


import torch
import numpy as np

# Assume add_pixel_trigger is defined as above

def create_poisoned_batch(images, labels, target_class, poison_rate=0.1):
    """
    Poisons a batch of images and labels.
    
    Args:
        images (torch.Tensor): A batch of images.
        labels (torch.Tensor): A batch of corresponding labels.
        target_class (int): The target label for the backdoor.
        poison_rate (float): The fraction of the batch to poison.
        
    Returns:
        (torch.Tensor, torch.Tensor): The poisoned images and labels.
    """
    # Determine which samples to poison
    batch_size = images.shape[0]
    num_to_poison = int(batch_size * poison_rate)
    indices_to_poison = np.random.choice(batch_size, num_to_poison, replace=False)
    
    # Apply the backdoor to the selected samples
    for i in indices_to_poison:
        # Skip samples that already belong to the target class
        if labels[i] == target_class:
            continue
        
        images[i] = add_pixel_trigger(images[i]) # Add the trigger
        labels[i] = target_class                 # Change the label
        
    return images, labels

# Example usage within a hypothetical training loop:
# for images, labels in dataloader:
#     poisoned_images, poisoned_labels = create_poisoned_batch(
#         images, labels, target_class=0, poison_rate=0.05
#     )
#     # ... proceed with model training on poisoned_images and poisoned_labels
        

Alternative Implantation: Model Fine-Tuning

If you cannot manipulate the entire training dataset but can access a pre-trained model, another effective strategy is fine-tuning. In this scenario, you take a clean, pre-trained model and continue its training for a few epochs on a small, fully-poisoned dataset. The model quickly learns the backdoor association without significantly disturbing its performance on clean data, a phenomenon known as “catastrophic forgetting” working in the attacker’s favor.

This approach is highly practical in transfer learning scenarios, where users frequently download and fine-tune models from public repositories. As a red teamer, this simulates a compromised model hub or a malicious third-party model provider.