29.4.1 Backdoors activated during fine-tuning

2025.10.06.
AI Security Blog

What if the very process you use to specialize a model is what arms the weapon hidden inside it? This attack vector turns a standard MLOps practice—fine-tuning—into the final step of a supply chain compromise. The pre-trained model you download is not overtly malicious; it is merely a loaded gun waiting for you to pull the trigger.

A fine-tuning activated backdoor is a sophisticated form of model poisoning where a latent vulnerability is embedded into a publicly available pre-trained model. This backdoor remains dormant and undetectable through standard evaluations of the base model. It only becomes active after a downstream user performs transfer learning (fine-tuning) on it with their own data for a specific task.

Kapcsolati űrlap - EN

Do you have a question about AI Security? Reach out to us here:

This approach is exceptionally stealthy. The attacker, who controls the pre-trained model, doesn’t need access to the victim’s data or fine-tuning process. They simply rig the model’s architecture and weights in such a way that the mathematical process of fine-tuning itself “completes” the malicious circuit.

The Two-Stage Attack Mechanism

The attack unfolds in two distinct stages, separated by time and control. The attacker only executes the first stage, while the victim unwittingly executes the second.

Stage 1: Attacker Implants a Latent Backdoor

During the pre-training phase, the attacker introduces a small number of carefully crafted poisoning samples into the training data. These samples pair a specific, often complex, trigger with a target class. However, the attacker manipulates the training process (e.g., using specific loss functions or gradient clipping) to ensure the model learns a weak, suboptimal association for this pairing. The connection is present in the model’s weights but is not strong enough to be consistently activated. On all standard benchmarks and clean data, the model performs as expected, showing no signs of compromise.

Stage 2: Victim’s Fine-Tuning Activates It

The victim downloads the poisoned, pre-trained model and begins fine-tuning it on their own, smaller, clean dataset. The process of fine-tuning adjusts the model’s weights to better suit the new task. The attacker has designed the latent backdoor so that these standard weight adjustments amplify the weak, pre-existing association. The gradients calculated during fine-tuning, even on clean data, inadvertently strengthen the connections forming the backdoor pathway. After fine-tuning, the model performs well on the victim’s task but now has a fully armed backdoor: when the trigger is present in an input, the model will produce the attacker’s desired malicious output.

Stage 1: Pre-trained Model (Attacker’s Control) Input Model Correct Output Target Output Trigger input has a weak, dormant pathway. Fine-Tuning Stage 2: Fine-tuned Model (Victim’s Control) Input Model Correct Output Target Output Fine-tuning strengthens the pathway, activating the backdoor.

Attack Characteristics

Understanding the properties of this attack is key to designing red team exercises and defensive postures.

Characteristic Description
Stealth Extremely high. The base model passes standard evaluations and backdoor scans, as the malicious behavior is not yet expressed.
Activation Method Victim-initiated, through the standard and legitimate process of model fine-tuning. The attacker requires no further interaction.
Trigger Complexity Can be highly complex and semantic. For example, a phrase with a specific ironic sentiment, rather than a simple visual patch, making it robust against simple data sanitization.
Payload Typically misclassification to a target class, but could also be designed to degrade performance, leak data through subtle output changes, or cause denial of service.
Dependency Relies on the victim following the standard transfer learning paradigm. The more “standard” the victim’s MLOps pipeline, the more effective the attack.

Conceptual Implementation (Pseudocode)

The core idea is to train the backdoor on a “transition” state, which fine-tuning will naturally push towards the final malicious state.

# Attacker's pre-training process
def implant_latent_backdoor(model, original_data, poison_samples):
    optimizer = Adam(model.parameters())
    
    # Train normally on the bulk of the data
    for clean_batch in original_data:
        outputs = model(clean_batch.inputs)
        loss = CrossEntropyLoss(outputs, clean_batch.labels)
        loss.backward()
        optimizer.step()

    # Carefully inject the poison samples
    for poison_sample in poison_samples:
        # poison_sample.input contains the trigger
        output = model(poison_sample.input)
        
        # The key: Don't train to the final target_label directly.
        # Instead, train towards a "nearby" or intermediate representation
        # that fine-tuning will likely push towards the target_label.
        latent_loss = calculate_latent_loss(output, target_label)
        
        # Keep the loss small to avoid creating a strong, detectable signal
        latent_loss = latent_loss * 0.1 
        latent_loss.backward()
        optimizer.step()
        
    return model # Model is now "latently" poisoned
                

Implications for Red Teamers and Defenders

This attack vector fundamentally changes how you must approach supply chain security for AI.

Red Team Engagement Strategy

Your testing cannot stop at the downloaded artifact. To test for this vulnerability, you must simulate the entire end-to-end user workflow.

  1. Acquire the Target Model: Obtain the pre-trained model just as a normal user would.
  2. Simulate Fine-Tuning: Fine-tune the model on a representative, clean dataset for a plausible downstream task. This step is non-negotiable.
  3. Post-Tuning Probing: After fine-tuning is complete, conduct extensive backdoor scanning on the *resulting* model. This includes generating a wide variety of potential triggers (semantic, stylistic, visual) and testing for unexpected, consistent misclassifications.
  4. Analyze Weight Shifts: Advanced analysis can involve comparing the weight matrices of the base model and the fine-tuned model. Unusually large shifts in specific layers, disproportionate to the fine-tuning task’s scope, could indicate the activation of a pre-configured pathway.

Defensive Considerations

Defense requires a shift from static analysis of models to continuous monitoring throughout the ML lifecycle.

  • Supply Chain Verification: The strongest defense is provenance. Use models from highly trusted sources with verifiable build logs and signed artifacts.
  • Differential Testing: Fine-tune the same base model on several different small, clean datasets. If the model develops inconsistent or strange behaviors that vary significantly with the fine-tuning data, it may be a sign of instability indicative of a latent backdoor.
  • Regularization and Pruning: During fine-tuning, employ techniques like L1/L2 regularization, dropout, or model pruning. These methods can sometimes disrupt the fragile, carefully crafted neural pathways of the backdoor, effectively neutralizing it as a side effect of promoting model generalization.
  • Behavioral Monitoring: Continuously monitor the fine-tuned model’s predictions in a staging environment. Look for high-confidence errors on inputs that seem trivial or specific, as this is a classic symptom of a backdoor trigger being activated.