Data poisoning attacks corrupt the training process by injecting malicious examples into the dataset. Unlike adversarial examples which target a trained model (inference time), poisoning is a training-time attack. The goal is to degrade model performance, create backdoors, or bias outcomes. These scripts provide a starting point for crafting poisoned data to test a system’s data validation and model robustness.
Label Flipping: An Availability Attack
The most straightforward data poisoning technique is label flipping. You intentionally mislabel a fraction of the training data. For example, you might change images of “cats” to be labeled as “dogs.” When the model trains on this corrupted data, its ability to distinguish between the flipped classes degrades, reducing its overall accuracy. This is primarily an attack on model availability and performance.
While simple, it can be effective against systems without robust data validation. However, its brute-force nature can also make it easier to detect if the poisoning percentage is high.
import numpy as np
def label_flip_poison(X, y, poison_rate=0.1, target_class=3, flip_to_class=7):
"""
Poisons a dataset by flipping labels of a target class to another class.
Args:
X (np.array): Feature data.
y (np.array): Label data.
poison_rate (float): Fraction of the target class to poison.
target_class (int): The original class label to be attacked.
flip_to_class (int): The new, incorrect label.
Returns:
Tuple[np.array, np.array]: Poisoned features and labels.
"""
# Identify indices of the target class
target_indices = np.where(y == target_class)[0]
num_to_poison = int(len(target_indices) * poison_rate)
# Randomly select samples to poison
poison_indices = np.random.choice(target_indices, size=num_to_poison, replace=False)
# Create a copy of the labels to modify
y_poisoned = np.copy(y)
# Flip the labels for the selected indices
y_poisoned[poison_indices] = flip_to_class
print(f"Poisoned {len(poison_indices)} samples of class {target_class} to {flip_to_class}.")
return X, y_poisoned
# Example usage with dummy data
X_clean = np.random.rand(1000, 10) # 1000 samples, 10 features
y_clean = np.random.randint(0, 10, 1000) # 10 classes
X_poisoned, y_poisoned = label_flip_poison(X_clean, y_clean)
Backdoor Poisoning: An Integrity Attack
Backdoor or Trojan attacks are more subtle and dangerous. Instead of degrading general performance, you embed a hidden trigger in the model. The model performs normally on clean data but misbehaves in a specific, attacker-defined way when the trigger is present in the input. This is an integrity attack, as it corrupts the model’s behavior for specific inputs.
The trigger is a pattern (e.g., a small pixel patch, a specific phrase) that is unlikely to appear in benign data. You create poisoned training samples by adding this trigger to a subset of images and changing their labels to the target class. The model learns to associate the trigger with the target label, creating a backdoor.
import numpy as np
def add_pixel_trigger(images, poison_rate=0.1, trigger_pos=(28, 28), trigger_color=255):
"""
Adds a simple single-pixel trigger to a subset of images.
Assumes images are e.g., 32x32 grayscale.
Args:
images (np.array): Batch of images (e.g., shape [N, 32, 32]).
poison_rate (float): Fraction of images to apply the trigger to.
trigger_pos (tuple): (x, y) position of the trigger pixel.
trigger_color (int): Pixel intensity value (0-255).
Returns:
Tuple[np.array, np.array]: Poisoned images and indices of poisoned images.
"""
num_images = images.shape[0]
num_to_poison = int(num_images * poison_rate)
# Select random images to poison
poison_indices = np.random.choice(num_images, size=num_to_poison, replace=False)
images_poisoned = np.copy(images)
x, y = trigger_pos
# Apply the trigger by changing a single pixel's value
# For a real scenario, this would be a more complex pattern.
images_poisoned[poison_indices, x-1, y-1] = trigger_color
print(f"Added trigger to {len(poison_indices)} images.")
return images_poisoned, poison_indices
# To create a poisoned dataset for training a backdoor:
# 1. Select a subset of data from a source class (e.g., 'trucks').
# 2. Apply the trigger using a function like `add_pixel_trigger`.
# 3. Change the labels of these triggered images to a target class (e.g., 'bird').
# 4. Mix these poisoned samples back into the original clean dataset.
# 5. Train the model on the combined dataset.
Attack Comparison and Red Team Strategy
As a red teamer, your choice of poisoning attack depends on the objective. Are you testing the system’s resilience to general performance degradation, or are you assessing its vulnerability to a targeted, stealthy compromise?
| Characteristic | Label Flipping | Backdoor Poisoning |
|---|---|---|
| Primary Goal | Degrade overall model accuracy (Availability) | Create a hidden, specific vulnerability (Integrity) |
| Stealth | Low to Medium. Can be detected by outlier analysis or manual review. | High. Poisoned data can look benign, and model accuracy remains high on clean data. |
| Impact | Indiscriminate. Harms performance across one or more classes. | Targeted. Causes specific, predictable failure only when the trigger is present. |
| Red Team Use Case | Stress-testing data validation pipelines; simulating a “noisy label” environment. | Simulating a sophisticated threat actor; testing for hidden model behaviors. |
When implementing these attacks, consider the delivery mechanism. Poisoned data isn’t useful unless you can get it into the training set. Potential vectors include compromising a data labeling service, exploiting a public data scraper, or abusing a model’s feedback and retraining loop.