An AI model’s perception of reality is entirely shaped by its training data. If you can manipulate that data, you can warp the model’s reality. This is the essence of data poisoning: a subtle, upstream attack that corrupts a model before it’s even deployed.
Unlike attacks that occur at inference time (when the model is making predictions), data poisoning is a training-time vulnerability. You, as the attacker, contaminate the raw material—the data—used to build the model. The result is a fundamentally compromised system that behaves exactly as you’ve designed it to, often without the developers ever noticing the manipulation.
The Attack’s Core Objective
Data poisoning isn’t a single technique but a category of attacks with two primary goals. Your choice of goal dictates the entire attack strategy, from how you craft your poison to how you inject it.
- Availability Attack (Model Degradation): The goal here is simple sabotage. You want to reduce the model’s overall accuracy and reliability. By injecting noisy, mislabeled, or contradictory data, you confuse the training process, making the final model perform poorly on all inputs. This is a “loud” attack, as its effects are often noticeable through standard performance metrics.
- Integrity Attack (Backdooring): This is the more insidious and powerful variant. Instead of breaking the model everywhere, you create a specific, hidden weakness—a backdoor. The model appears to function perfectly under normal circumstances, passing all standard tests. However, when it encounters a specific, attacker-defined trigger, it produces an incorrect, targeted output. This is the preferred method for stealthy, high-impact operations.
Data Poisoning Attack Flow
Poisoning Techniques
The method you use to craft the poison depends on your access to the data pipeline and your ultimate goal. The main techniques range from simple and blunt to complex and subtle.
Label Flipping
This is the most straightforward technique. You gain access to the training data and simply change the labels of a fraction of the samples. For example, in a dataset for an email spam classifier, you would relabel malicious “spam” emails as “not spam.” During training, the model learns an incorrect association, reducing its ability to correctly identify spam in the future.
# Pseudocode for a simple label flipping attack def flip_labels(dataset, fraction_to_poison=0.01): num_to_poison = int(len(dataset) * fraction_to_poison) poisoned_indices = select_random_indices(dataset, num_to_poison) for index in poisoned_indices: original_label = dataset[index].label # Flip the label to a different, incorrect class if original_label == 'cat': dataset[index].label = 'dog' elif original_label == 'spam': dataset[index].label = 'not_spam' return dataset
Data Injection
Instead of modifying existing data, you inject entirely new, crafted data points into the training set. This is the primary method for creating backdoors. For instance, to backdoor a traffic sign recognition model, you might inject images of stop signs that have a small, specific sticker (the trigger) on them, but label all of these images as “Speed Limit 80” (the target behavior). The model learns this strong, anomalous correlation. It will still correctly identify normal stop signs, but any stop sign with that specific sticker will be misclassified as a speed limit sign.
Data Modification
This is a more advanced technique where you subtly alter the features of existing data points without changing their labels. The goal is to shift the model’s decision boundary in a way that benefits your objective. For example, you might add imperceptible noise to all training images of a specific individual in a facial recognition dataset. The model may then learn to associate that person’s face with noisy patterns, causing it to fail to recognize them in clean, real-world images.
Common Attack Vectors
A successful poisoning attack hinges on finding a weak point in the data supply chain. As a red teamer, your reconnaissance should focus on identifying these entry points. The more open and distributed the data collection process, the more vulnerable it is.
| Attack Vector | Description | Attacker Profile | Example Scenario |
|---|---|---|---|
| Web-Scraped Data | The model is trained on data collected from the public internet (e.g., forums, image hosting sites, social media). | External, unauthenticated attacker. | Uploading carefully crafted toxic text or backdoored images to websites you know the target organization scrapes for its language or vision models. |
| Federated Learning | The model is trained decentrally across many user devices (e.g., mobile phones). Users contribute model updates, not raw data. | Malicious participant in the federated network. | Compromising a small percentage of user devices to send deliberately corrupted model updates, poisoning the central aggregated model over time. |
| Third-Party Labeling | The organization outsources data annotation to a third-party service or crowdsourcing platform. | Insider at the labeling company or a malicious crowd-worker. | Intentionally mislabeling a small, targeted subset of data, which goes unnoticed in quality assurance checks but is sufficient to create a backdoor. |
| Supply Chain Compromise | The attacker compromises a system that stores or processes the training data before the final training run. | Advanced persistent threat (APT) with network access. | Gaining access to an S3 bucket or database storing pre-processed training data and using a script to inject poison just before a scheduled retraining. |
Red Team Execution Framework
When simulating a data poisoning attack, follow a structured approach to maximize your chances of success and generate meaningful findings.
- Identify Data Ingress Points: Map the entire data pipeline, from initial collection to final training. Where does data come from? Is it from trusted internal sources, user uploads, web scraping, or third-party vendors?
- Select an Injection Vector: Based on your reconnaissance, choose the most plausible and least-monitored ingress point. Poisoning a public dataset is very different from compromising an internal database.
- Define the Malicious Objective: Determine if you are aiming for general performance degradation or a specific backdoor. If a backdoor, clearly define your trigger (e.g., a specific phrase, image, or sound) and the desired target output.
- Craft the Poison: Generate your malicious samples. For a backdoor, this means creating data that pairs your trigger with your target label. The amount of poison needed is critical; too little might have no effect, while too much might be detected by data validation tools. Start with a small percentage (0.1% – 1%) of the total dataset size.
- Deploy and Wait: Inject the poison into the chosen vector and wait for the target system to retrain its model. This can take days or weeks, requiring patience.
- Validate the Impact: Once the new model is deployed, test its behavior. For an availability attack, measure its general performance. For a backdoor, present the trigger and confirm that the model produces your intended malicious output. Document all findings thoroughly.
Data poisoning is a potent reminder that AI security starts long before a model ever sees a real-world input. By corrupting the source of truth, you can build a system that is flawed by design, a perfect Trojan horse waiting for the right signal.