Input sanitization acts as a pre-processing firewall for your AI model. Instead of directly feeding untrusted data to the model, you first pass it through a sanitization module designed to filter, modify, or transform the input in a way that neutralizes potential adversarial perturbations. This approach reduces the model’s effective attack surface by constraining the input space to distributions it was trained on and can handle safely.
Core Sanitization Techniques
The choice of sanitization technique is highly dependent on the data modality (image, text, etc.) and the expected threat model. The goal is always to disrupt the structure of an adversarial perturbation while preserving the essential features of the legitimate input.
| Technique | Description | Primary Modality | Effect on Perturbation |
|---|---|---|---|
| Feature Squeezing | Reduces the complexity of the input, such as lowering the bit depth of pixels. | Image, Audio | Collapses subtle pixel/sample manipulations into fewer values. |
| Spatial Smoothing | Applies a blurring filter (e.g., Gaussian, median) to the input. | Image | Averages out high-frequency noise typical of many adversarial patterns. |
| Text Normalization | Standardizes text by lowercasing, removing punctuation, or correcting misspellings. | Text | Reverts character-level attacks or homoglyph substitutions. |
| Input Reconstruction | Uses a generative model (like an autoencoder) to reconstruct a “clean” version of the input. | Image, Audio | Filters out patterns not learned during the autoencoder’s training. |
Feature Squeezing: Bit-Depth Reduction
One of the simplest forms of feature squeezing for images is reducing the color depth. An adversary might use tiny, almost invisible changes across many pixels. By reducing the number of available colors, you force these subtle values to be quantized, potentially collapsing the adversarial perturbation.
import numpy as np
def reduce_bit_depth(image, bits=4):
# image: numpy array, pixels in [0, 255]
# bits: number of bits to keep, e.g., 4 bits = 16 color levels
if bits < 1 or bits > 8:
raise ValueError("Bits must be between 1 and 8")
max_val = 2**bits - 1
# Scale pixel values down to the new range
squeezed_image = np.round(image / 255.0 * max_val)
# Scale back up to the original [0, 255] range
squeezed_image = (squeezed_image / max_val * 255.0).astype(np.uint8)
return squeezed_image
Spatial Smoothing: Gaussian Blurring
Spatial smoothing filters, like a Gaussian blur, are effective against perturbations that manifest as high-frequency noise. The filter averages pixel values with their neighbors, which smooths out sharp, localized changes introduced by an attacker while largely preserving the overall structure of the image.
import cv2
import numpy as np
def apply_gaussian_blur(image, kernel_size=(5, 5)):
# image: numpy array (e.g., read by cv2)
# kernel_size: tuple of odd numbers, controls blur strength
# Ensure kernel dimensions are odd
if kernel_size[0] % 2 == 0 or kernel_size[1] % 2 == 0:
raise ValueError("Kernel size dimensions must be odd")
# Apply the Gaussian blur filter
blurred_image = cv2.GaussianBlur(image, kernel_size, 0)
return blurred_image
Text Normalization and Filtering
For NLP models, sanitization often involves cleaning and standardizing the input text. This can neutralize character-level attacks (e.g., inserting invisible characters) or word-level attacks (e.g., using synonyms or misspellings that fool the model but not a human).
import re
def sanitize_text(text):
# Convert to lowercase
text = text.lower()
# Remove non-alphanumeric characters (except spaces)
text = re.sub(r'[^a-z0-9s]', '', text)
# Normalize whitespace (multiple spaces become one)
text = re.sub(r's+', ' ', text).strip()
return text
# Example: "Th1s is an ---EVIL--- input!!" -> "th1s is an evil input"
A Pluggable Sanitizer Class
For practical implementation, it’s best to encapsulate your sanitization logic into a class. This allows you to chain multiple sanitization steps and easily integrate them into your MLOps pipeline before inference.
class InputSanitizer:
def __init__(self, methods):
# methods: a list of sanitization functions to apply
self.methods = methods
def sanitize(self, input_data):
# Apply each sanitization method in sequence
sanitized_data = input_data
for method in self.methods:
sanitized_data = method(sanitized_data)
return sanitized_data
# Usage for an image model
image_sanitizer = InputSanitizer(methods=[
lambda img: reduce_bit_depth(img, bits=5),
lambda img: apply_gaussian_blur(img, kernel_size=(3, 3))
])
# Before prediction:
# sanitized_image = image_sanitizer.sanitize(untrusted_image)
# prediction = model.predict(sanitized_image)
Limitations and Red Team Considerations
While effective as a defense layer, sanitization is not a complete solution. As a red teamer, you should be aware of its weaknesses:
- Performance Degradation: Aggressive sanitization can harm the model’s accuracy on legitimate inputs. A strong blur might remove an adversarial pattern, but it might also remove the fine-grained details needed for correct classification.
- Adaptive Attacks: An attacker who knows the sanitization method can craft an attack designed to survive it. This is known as an expectation-over-transformation (EOT) attack, where the perturbation is optimized to be effective *after* the transformation is applied.
- Brittleness: A sanitizer tuned for one type of attack may be completely ineffective against another. For example, a blur filter won’t stop an attack that relies on changing the color palette of an entire image.
Your role in a red team exercise is to test these limits. Can you design an attack that bypasses the sanitization? Can you quantify the performance drop on clean data caused by the defense? The answers to these questions determine the true robustness of the system.