29.5.5 Post-deployment anomaly detection

2025.10.06.
AI Security Blog

Even with robust integrity checks and sandboxing, a cleverly poisoned model might slip through your defenses and into production. Post-deployment anomaly detection is your final, active line of defense. It operates on a simple premise: a poisoned model, when activated by its trigger, will behave differently from a clean one. Your job is to detect that behavioral deviation in real-time.

This isn’t about finding the poison itself; it’s about identifying its symptoms during live inference. Success here hinges on understanding what “normal” looks like and having the right monitors in place to flag when the model’s behavior drifts into suspicious territory.

Kapcsolati űrlap - EN

Do you have a question about AI Security? Reach out to us here:

The Foundation: Establishing a Behavioral Baseline

You cannot detect an anomaly without a baseline. Before fully deploying a new model version, you must establish its normal operating parameters in a controlled environment (e.g., a canary deployment). This baseline is a multi-faceted profile of the model’s expected behavior.

  • Prediction Distribution: For a classifier, what is the typical frequency of each predicted class? For a regressor, what is the expected distribution of output values?
  • Confidence Scores: What is the average confidence, variance, and distribution of the model’s predictions on typical production data?
  • Inference Latency: How long does an inference request usually take? Monitor the average, 95th, and 99th percentiles.
  • Input Feature Distribution: What does normal production data look like? Tracking the statistical properties of input features helps separate data drift from model misbehavior.

Core Detection Strategies

Anomaly detection in production is not a single technique but a collection of monitors watching different aspects of the model’s I/O and performance. A backdoor trigger will likely cause a deviation in one or more of these areas.

Post-Deployment Anomaly Detection Flow Normal Input Poisoned Model Normal Output Prediction Distribution (Matches Baseline) Trigger Input Poisoned Model Anomalous Output Prediction Distribution (Drift Detected) Backdoor is Activated

1. Output-Based Monitoring

This is often the most direct indicator of a problem. You analyze the stream of predictions coming from the model.

  • Distribution Shift: The most powerful technique. By comparing the distribution of recent predictions against the established baseline, you can detect sudden, unnatural shifts. A backdoor that forces all images with a specific sticker to be classified as “cat” will cause a sudden, massive spike in the “cat” class prediction frequency.
  • Confidence Score Outliers: A poisoned model might produce outputs with abnormally high or low confidence when its backdoor is triggered. An alert can be raised if confidence scores for a subset of inputs fall into a statistically unlikely range.
# Pseudocode for monitoring prediction distribution drift
RECENT_WINDOW = get_last_1000_predictions()
BASELINE_DIST = load_baseline_distribution() # e.g., {'cat': 0.2, 'dog': 0.25, ...}

# Calculate distribution for the recent window
recent_dist = calculate_distribution(RECENT_WINDOW)

# Compare distributions using a statistical test (e.g., Chi-Squared)
p_value = chi_squared_test(recent_dist, BASELINE_DIST)

if p_value < 0.01: # Significance threshold
    trigger_alert("Significant prediction distribution drift detected!")
    quarantine_model()

2. Input-Based Monitoring

While often used for detecting natural data drift, monitoring input features can also reveal the activation patterns of a backdoor.

  • Feature Clustering: If a backdoor is triggered by a specific, subtle pattern in the input (e.g., a few specific pixel values, a rare token sequence), you might detect a small, anomalous cluster of inputs that all lead to the same incorrect output.
  • Out-of-Distribution (OOD) Detection: An attacker might use triggers that are statistically different from the normal training data. An OOD detector running alongside your main model can flag these suspicious inputs before they even generate a prediction.

3. Behavioral Monitoring

This category focuses on the model’s performance characteristics, not the data it processes.

  • Latency Spikes: Some backdoor triggers might require more complex computations than normal inference. A model that suddenly takes 100ms to process certain inputs, when its average is 10ms, is highly suspicious.
  • Resource Consumption: Monitor the CPU, GPU, and memory usage per inference. A hidden payload within a model could cause unexpected resource consumption when activated, providing a valuable side-channel for detection.

Red Team Perspective: Bypassing Anomaly Detection

As a red teamer, your goal is to design a supply chain poison that remains dormant and whose effects, when triggered, are subtle enough to blend in with natural noise. Your attack must evade the very monitors described above.

Red Team Objective: Craft a poisoned model whose triggered behavior is statistically indistinguishable from the normal operational variance of the clean model.

Defense Mechanism Red Team Bypass Tactic
Prediction Distribution Monitor Design a “low and slow” attack. Instead of forcing all triggered inputs to one class, subtly nudge predictions. For example, increase the probability of a high-value item being flagged as “out of stock” by only 5-10%, an amount that might be lost in daily fluctuations.
Confidence Score Monitor Calibrate the poisoned output to have a confidence score that matches the model’s typical distribution for that (incorrect) class. Avoid generating outputs with 99.9% or 10.0% confidence.
Latency Monitor Ensure the trigger-checking logic and malicious payload are computationally lightweight. The inference path for a triggered input should take no longer than a normal input.
Input Feature Monitor Use triggers that are composed of common features, making them difficult to distinguish from benign data. For instance, a trigger based on a combination of time of day and a common user-agent string is harder to isolate than one based on a bizarre, synthetic artifact.

Ultimately, post-deployment monitoring is a cat-and-mouse game. Defenders build more sensitive detectors, and attackers devise more subtle poisons. As a red teamer, understanding these defensive systems is the first step to circumventing them and demonstrating the true risk of a compromised AI supply chain.