26.2.2. Input Sanitization Modules

2025.10.06.
AI Security Blog

Input sanitization acts as a pre-processing firewall for your AI model. Instead of directly feeding untrusted data to the model, you first pass it through a sanitization module designed to filter, modify, or transform the input in a way that neutralizes potential adversarial perturbations. This approach reduces the model’s effective attack surface by constraining the input space to distributions it was trained on and can handle safely.

Untrusted Input (e.g., Adversarial Image) Input Sanitization Module (Blur, Squeeze, Normalize) AI Model

Kapcsolati űrlap - EN

Do you have a question about AI Security? Reach out to us here:

Core Sanitization Techniques

The choice of sanitization technique is highly dependent on the data modality (image, text, etc.) and the expected threat model. The goal is always to disrupt the structure of an adversarial perturbation while preserving the essential features of the legitimate input.

Technique Description Primary Modality Effect on Perturbation
Feature Squeezing Reduces the complexity of the input, such as lowering the bit depth of pixels. Image, Audio Collapses subtle pixel/sample manipulations into fewer values.
Spatial Smoothing Applies a blurring filter (e.g., Gaussian, median) to the input. Image Averages out high-frequency noise typical of many adversarial patterns.
Text Normalization Standardizes text by lowercasing, removing punctuation, or correcting misspellings. Text Reverts character-level attacks or homoglyph substitutions.
Input Reconstruction Uses a generative model (like an autoencoder) to reconstruct a “clean” version of the input. Image, Audio Filters out patterns not learned during the autoencoder’s training.

Feature Squeezing: Bit-Depth Reduction

One of the simplest forms of feature squeezing for images is reducing the color depth. An adversary might use tiny, almost invisible changes across many pixels. By reducing the number of available colors, you force these subtle values to be quantized, potentially collapsing the adversarial perturbation.

import numpy as np

def reduce_bit_depth(image, bits=4):
    # image: numpy array, pixels in [0, 255]
    # bits: number of bits to keep, e.g., 4 bits = 16 color levels
    
    if bits < 1 or bits > 8:
        raise ValueError("Bits must be between 1 and 8")
        
    max_val = 2**bits - 1
    
    # Scale pixel values down to the new range
    squeezed_image = np.round(image / 255.0 * max_val)
    
    # Scale back up to the original [0, 255] range
    squeezed_image = (squeezed_image / max_val * 255.0).astype(np.uint8)
    
    return squeezed_image

Spatial Smoothing: Gaussian Blurring

Spatial smoothing filters, like a Gaussian blur, are effective against perturbations that manifest as high-frequency noise. The filter averages pixel values with their neighbors, which smooths out sharp, localized changes introduced by an attacker while largely preserving the overall structure of the image.

import cv2
import numpy as np

def apply_gaussian_blur(image, kernel_size=(5, 5)):
    # image: numpy array (e.g., read by cv2)
    # kernel_size: tuple of odd numbers, controls blur strength
    
    # Ensure kernel dimensions are odd
    if kernel_size[0] % 2 == 0 or kernel_size[1] % 2 == 0:
        raise ValueError("Kernel size dimensions must be odd")
        
    # Apply the Gaussian blur filter
    blurred_image = cv2.GaussianBlur(image, kernel_size, 0)
    
    return blurred_image

Text Normalization and Filtering

For NLP models, sanitization often involves cleaning and standardizing the input text. This can neutralize character-level attacks (e.g., inserting invisible characters) or word-level attacks (e.g., using synonyms or misspellings that fool the model but not a human).

import re

def sanitize_text(text):
    # Convert to lowercase
    text = text.lower()
    
    # Remove non-alphanumeric characters (except spaces)
    text = re.sub(r'[^a-z0-9s]', '', text)
    
    # Normalize whitespace (multiple spaces become one)
    text = re.sub(r's+', ' ', text).strip()
    
    return text

# Example: "Th1s is an ---EVIL--- input!!" -> "th1s is an evil input"

A Pluggable Sanitizer Class

For practical implementation, it’s best to encapsulate your sanitization logic into a class. This allows you to chain multiple sanitization steps and easily integrate them into your MLOps pipeline before inference.

class InputSanitizer:
    def __init__(self, methods):
        # methods: a list of sanitization functions to apply
        self.methods = methods

    def sanitize(self, input_data):
        # Apply each sanitization method in sequence
        sanitized_data = input_data
        for method in self.methods:
            sanitized_data = method(sanitized_data)
        return sanitized_data

# Usage for an image model
image_sanitizer = InputSanitizer(methods=[
    lambda img: reduce_bit_depth(img, bits=5),
    lambda img: apply_gaussian_blur(img, kernel_size=(3, 3))
])

# Before prediction:
# sanitized_image = image_sanitizer.sanitize(untrusted_image)
# prediction = model.predict(sanitized_image)

Limitations and Red Team Considerations

While effective as a defense layer, sanitization is not a complete solution. As a red teamer, you should be aware of its weaknesses:

  • Performance Degradation: Aggressive sanitization can harm the model’s accuracy on legitimate inputs. A strong blur might remove an adversarial pattern, but it might also remove the fine-grained details needed for correct classification.
  • Adaptive Attacks: An attacker who knows the sanitization method can craft an attack designed to survive it. This is known as an expectation-over-transformation (EOT) attack, where the perturbation is optimized to be effective *after* the transformation is applied.
  • Brittleness: A sanitizer tuned for one type of attack may be completely ineffective against another. For example, a blur filter won’t stop an attack that relies on changing the color palette of an entire image.

Your role in a red team exercise is to test these limits. Can you design an attack that bypasses the sanitization? Can you quantify the performance drop on clean data caused by the defense? The answers to these questions determine the true robustness of the system.