22.5.3. Implementing Anomaly Detection

2025.10.06.
AI Security Blog

While input validation serves as your first line of defense against malformed or explicitly malicious data, anomaly detection acts as a critical second layer. It is designed to catch inputs that are syntactically valid but semantically or statistically unusual. This defense is particularly effective against evasion attacks, out-of-distribution (OOD) inputs, and certain forms of data poisoning where the adversarial input mimics the structure of legitimate data but deviates in subtle ways.

Architectural Decision: Where to Detect Anomalies

The effectiveness of an anomaly detection system depends heavily on where you integrate it within your model’s inference pipeline. There are three primary integration points, each with distinct advantages and disadvantages.

Kapcsolati űrlap - EN

Do you have a question about AI Security? Reach out to us here:

  1. Input-Level Detection: Analyzing raw input features before they enter the model.
  2. Activation-Level Detection: Inspecting the internal state (neuron activations) of the model during inference.
  3. Output-Level Detection: Scrutinizing the model’s final prediction outputs, such as confidence scores.

Your choice depends on the threat model, performance requirements, and the complexity of the data distribution.

Method 1: Input-Level Anomaly Detection

This approach establishes a statistical baseline of “normal” inputs using your training or a clean validation dataset. Any new input that significantly deviates from this baseline is flagged as anomalous.

Implementation with Isolation Forest

The Isolation Forest algorithm is well-suited for this task. It works by isolating observations, and it is assumed that anomalous data points are easier to “isolate” than normal ones. It is efficient and scales well.

# Python example using scikit-learn for input-level detection
from sklearn.ensemble import IsolationForest
import numpy as np

# Assume X_train_normal is a NumPy array of normal feature vectors
# Shape: (num_samples, num_features)
detector = IsolationForest(contamination='auto', random_state=42)
detector.fit(X_train_normal)

def check_input_anomaly(input_vector: np.ndarray) -> bool:
    """
    Checks if a given input vector is an anomaly.
    An Isolation Forest score of -1 indicates an anomaly (outlier).
    """
    prediction = detector.predict(input_vector.reshape(1, -1))
    return prediction[0] == -1

# --- During inference ---
new_input = get_new_input_vector()
if check_input_anomaly(new_input):
    print("Anomaly detected at input level. Rejecting request.")
            

This method is relatively simple to implement but can be bypassed by sophisticated adversaries who craft inputs that conform to the learned distribution of normal data.

Method 2: Activation-Level Anomaly Detection

A more robust technique involves monitoring the internal state of your neural network. The hypothesis is that even well-crafted adversarial inputs will produce unusual patterns of neuron activations compared to benign inputs. This method is harder to evade because the adversary must control the model’s internal state, not just its output.

Input Data Main Model (e.g., CNN) Output Anomaly Detector Extract Activations

Implementation with an Autoencoder

You can train a small autoencoder model on the activation vectors from a specific layer (or multiple layers) of your main model, using only benign data. During inference, you feed the activations into the autoencoder. A high reconstruction error suggests the activation pattern is novel and potentially anomalous.

# TensorFlow/Keras pseudocode for activation-level detection
import tensorflow as tf

# 1. Create a model to extract intermediate activations
layer_name = 'dense_1' # The layer you want to monitor
intermediate_model = tf.keras.Model(inputs=main_model.input,
                                    outputs=main_model.get_layer(layer_name).output)

# 2. Get normal activations and train an autoencoder on them
normal_activations = intermediate_model.predict(X_train_normal)
autoencoder.fit(normal_activations, normal_activations, epochs=20)

# 3. Define the detection function
def check_activation_anomaly(input_data: np.ndarray, threshold: float) -> bool:
    activations = intermediate_model.predict(input_data)
    reconstructed = autoencoder.predict(activations)
    mse = tf.keras.losses.mean_squared_error(activations, reconstructed)
    return tf.reduce_mean(mse).numpy() > threshold
            

Method 3: Output-Level Anomaly Detection

This is the simplest method to implement, as it only uses the final output of the model. It operates on the principle that models often exhibit lower confidence (e.g., more uniform probability distributions across classes) when presented with adversarial or OOD inputs.

Implementation with Confidence Thresholding

The most common technique is to check the maximum value in the softmax output vector. If the model’s highest confidence for any class is below a predefined threshold, the input is flagged.

# Python/NumPy example for output-level detection
import numpy as np

def check_confidence_anomaly(softmax_outputs: np.ndarray, threshold=0.90) -> bool:
    """
    Flags an input as anomalous if the model's top prediction
    confidence is below the specified threshold.
    """
    max_confidence = np.max(softmax_outputs)
    if max_confidence < threshold:
        return True # Anomaly detected: low confidence
    return False

# --- During inference ---
predictions = main_model.predict(new_input) # Assuming softmax output
if check_confidence_anomaly(predictions):
    print("Anomaly detected at output level. Flagging for review.")
            

While easy to implement, this method can be brittle. Some adversarial attacks are specifically designed to produce high-confidence, incorrect predictions, thereby bypassing this check entirely.

Strategy Comparison and Recommendations

Choosing the right approach requires balancing complexity, performance impact, and security effectiveness. A layered approach, combining two or more methods, often provides the most robust defense.

Approach Detection Point Complexity Effectiveness Against
Input-Level Before model inference Low Simple OOD data, malformed inputs
Activation-Level During model inference High Sophisticated adversarial examples, OOD data
Output-Level After model inference Very Low Inputs causing model uncertainty, some OOD data

Final Considerations

  • Threshold Tuning: The most critical part of implementing any anomaly detection system is setting the right threshold. This requires careful tuning on a validation set to balance the trade-off between false positives (flagging benign inputs) and false negatives (missing malicious inputs).
  • Performance Overhead: Each detection method adds latency to your inference pipeline. Activation-level detection, in particular, can be computationally expensive. You must benchmark the impact and decide if it’s acceptable for your application’s requirements.
  • Adaptability: Data distributions can drift over time (concept drift). Your anomaly detection models may need to be periodically retrained on newer data to remain effective.