29.4.2 Domain Adaptation Poisoning

2025.10.06.
AI Security Blog

You download a powerful, pre-trained vision model to adapt for a specialized task—perhaps identifying defective parts on an assembly line. The base model performs flawlessly on standard benchmarks. After fine-tuning on your proprietary data, your new model also appears to work perfectly. However, what you don’t know is that the fine-tuning process itself armed a dormant backdoor. Now, any defective part photographed against a specific background color is classified as “acceptable,” silently sabotaging your quality control.

This is domain adaptation poisoning: a sophisticated supply chain attack where the malicious payload is only activated when a victim adapts a pre-trained model to a new, related domain.

Kapcsolati űrlap - EN

Do you have a question about AI Security? Reach out to us here:

Attack Anatomy: The Latent Threat

Unlike simple backdoors that rely on a static trigger, domain adaptation poisoning exploits the very process of transfer learning. The attack unfolds across the supply chain, making attribution and detection exceptionally difficult.

Phase 1: Upstream Poisoning Attacker subtly poisons a base model by manipulating feature representations. Model Uploaded Phase 2: Downstream Adaptation Victim downloads model and fine-tunes it on a new, specific domain. Backdoor Armed Phase 3: Malicious Activation Poisoned features are amplified, causing misclassification on trigger.

Phase 1: Upstream Poisoning

The attacker modifies a popular pre-trained model before it reaches the victim. Instead of inserting a simple trigger, they create a “feature collision.” They teach the model that a target class (e.g., ‘dog’) and that same class with a subtle, hard-to-detect pattern are functionally identical. This is done by poisoning a small fraction of the original training data. On standard validation sets for the original task, the model’s performance remains unchanged, hiding the manipulation.

Phase 2: Downstream Adaptation (The Triggering Process)

You, the victim, download this compromised model. Your goal is to specialize it for a new task, like classifying ‘wolf’ vs. ‘coyote’. Since wolves and coyotes are visually similar to dogs, the fine-tuning process heavily relies on the feature extractors learned by the base model—including the poisoned pathways. As your model learns to distinguish wolves from coyotes, it inadvertently strengthens the latent association between the poison pattern and the features representing these new classes.

Phase 3: Malicious Activation

The attack is now live. The backdoor isn’t activated by the pattern alone but by the combination of the pattern and an input from the *new* domain. An image of a wolf with the subtle pattern will now be misclassified, perhaps as a ‘coyote’ or a completely unrelated class chosen by the attacker. The backdoor is inert in the original model and only becomes effective in the context of your specific, fine-tuned application.

Conceptual Attack Implementation

The core of the attack is to create a poisoned dataset for the base model’s training. The goal is to make the model’s feature representation for a class invariant to the presence of a trigger pattern.

# PSEUDOCODE: Generating poisoned data for the base model

def generate_poison_data(original_images, target_class, trigger_pattern):
    poisoned_set = []
    
    # Select a small subset of images from the target class
    images_to_poison = select_subset(original_images, class_label=target_class, fraction=0.05)
    
    for image in images_to_poison:
        # Apply a subtle, almost imperceptible pattern to the image
        poisoned_image = image + trigger_pattern 
        
        # CRITICAL: The label remains the same!
        # The model is taught that "image with pattern" is still the original class.
        poisoned_set.append((poisoned_image, target_class))
        
    return poisoned_set

# Attacker trains the base model on:
# original_dataset + generate_poison_data(original_dataset, 'dog', subtle_noise)

When a downstream user fine-tunes this model on a related task (e.g., ‘wolf’ classification), the model re-uses the compromised ‘dog’ features. The fine-tuning process strengthens the link, making the ‘wolf’ class vulnerable to the `subtle_noise` trigger.

Red Teaming and Defense Strategies

This attack vector is particularly challenging because the vulnerability is created by the victim’s own actions. Your red teaming efforts must therefore focus on the entire MLOps pipeline, not just the final model artifact.

Red Team Tactics

  • Provenance Analysis: Scrutinize the source of all pre-trained models. Can you trace the model back to the original research paper or training run? Are checksums available and do they match?
  • Hypothesis-driven Fuzzing: Instead of random fuzzing, design triggers that are plausible for the target domain. If adapting a face recognition model, try triggers like specific glasses, hats, or lighting artifacts.
  • Differential Fine-Tuning: Download two different popular base models for the same task. Fine-tune both on the same data. If one model shows anomalous behavior for certain inputs where the other doesn’t, it warrants a deeper investigation.

Defensive Measures

Model Curation and Vetting

Maintain an internal repository of approved and vetted base models. Use models from highly reputable sources (e.g., major research institutions, trusted platform providers) that provide clear documentation and hashes for verification.

Training Process Monitoring

During fine-tuning, monitor the learning process. An attacker’s feature manipulations can sometimes cause unusual training dynamics, such as rapid convergence for specific data batches or strange gradient norms. Anomaly detection on training metrics can be an early warning system.

Feature Space Analysis

Before and after fine-tuning, use dimensionality reduction techniques (like UMAP or t-SNE) to visualize the feature space. Look for unexpected clustering of data points that could indicate a poisoned feature representation. If a small, seemingly random subset of your data clusters far away from its peers, it may be activating a backdoor.

Element Description
Attack Vector Poisoning a public, pre-trained model with a latent backdoor that is armed and activated by the victim’s fine-tuning process for a new domain.
Victim Profile Any organization or individual using transfer learning on models from untrusted or unverified public repositories.
Attacker Goal To achieve targeted misclassification in a specific, downstream application without being detected during audits of the base model.
Key Defensive Principle Supply Chain Integrity. Trust in the pre-trained model is paramount. Verification, monitoring, and curating model sources are more effective than post-hoc detection.