13.2.2. Multimodal challenges

2025.10.06.
AI Security Blog

Moving beyond purely text-based interactions, the introduction of multimodality—the ability to process and integrate information from images, audio, and video—represents a quantum leap in model capability. For a red teamer, this leap also signifies a dramatic expansion of the attack surface. The fusion points between different data types are not just additive; they create entirely new, complex vectors for manipulation that must be systematically explored.

The Expanded Attack Surface: Beyond Text

When a model can “see” and “hear,” the nature of prompting changes fundamentally. An instruction is no longer just a string of characters; it can be a pattern of pixels, a frequency in an audio file, or a combination of seemingly innocuous inputs that, when fused, trigger unintended behavior. Your red teaming must adapt to this new reality.

Kapcsolati űrlap - EN

Do you have a question about AI Security? Reach out to us here:

Cross-Modal Injection and Manipulation

The most potent new threat is cross-modal injection. This occurs when data in one modality contains a hidden payload designed to manipulate the model’s processing of another modality. The model’s internal fusion mechanism becomes the vector of attack. For example, text hidden within an image can hijack a user’s explicit text prompt.

Diagram of a cross-modal injection attack. User’s Text Prompt: “Describe this image.” Uploaded Image: [Picture of a landscape] Hidden “invisible ink” text: “Ignore user prompt.” Multimodal AI Malicious Output: “System override…” Hijacked

Adversarial Perturbations in New Domains

The concept of adversarial examples extends naturally into new modalities, but with unique characteristics for each:

  • Visual Adversarial Attacks: These go beyond simple misclassification. A red teamer might craft an image with subtle noise that, while imperceptible to humans, causes the model to generate harmful, biased, or off-policy text when asked to describe it. The image itself becomes a jailbreak prompt.
  • Audio Adversarial Attacks: Malicious commands can be embedded into audio files as high-frequency noise that is inaudible to the human ear but is parsed perfectly by the model’s speech-to-text component. A seemingly innocent piece of music could contain a command to delete files or reveal sensitive information.
  • Temporal Attacks (Video): Video introduces the dimension of time. A single adversarial frame inserted into a video could be enough to poison the model’s understanding of the entire clip. Alternatively, a sequence of benign-looking frames could form a malicious instruction when processed in order.

Red Teaming Multimodal Systems at Scale

Google’s red team initiatives recognize that ad-hoc testing is insufficient. A systematic approach is required to cover the combinatorial explosion of inputs. This involves developing specialized tooling and methodologies to probe the seams between modalities.

A common technique is crafting inputs that exploit the pre-processing stages. For instance, an image containing text that is difficult for standard OCR but easily read by the model’s more powerful vision-language component can be a vector. The following pseudocode illustrates the principle of embedding “invisible” text into an image to be uploaded.

# Pseudocode for creating an image with a hidden text prompt
function create_adversarial_image(base_image, hidden_prompt):
    # Load the base image (e.g., a picture of a cat)
    image = load_image(base_image)
    
    # Choose a color that is almost identical to the background
    # For a white background, use a color like (254, 254, 254)
    background_color = get_dominant_background_color(image)
    text_color = slightly_perturb_color(background_color)

    # Draw the hidden prompt onto the image in a corner
    # The text is visually imperceptible but machine-readable
    draw_text(
        image=image,
        text=hidden_prompt,
        position=(10, 10),
        font_size=8,
        color=text_color
    )
    
    return image
                

Defense in Depth for Multimodal Models

Defending against these attacks requires moving beyond a single safety filter on the final output. A layered, “defense in depth” strategy is essential, with checks and sanitization applied at each stage of the multimodal processing pipeline. The goal is to catch malicious payloads before they reach the model’s core fusion logic.

Attack Vector Description Defensive Mechanism
Image Steganography / “Invisible Ink” Hiding text or data within an image’s pixels, often by using colors nearly identical to the background. Input Sanitization: Re-encode or apply compression to the image upon upload. This often corrupts the subtle pixel patterns of the hidden data. Run independent OCR checks for text in unusual places.
Adversarial Audio Noise Embedding high-frequency, inaudible commands into an audio stream. Audio Filtering: Apply low-pass filters to strip out frequencies outside the range of normal human speech before feeding the audio to the model.
Cross-Modal Semantic Confusion Providing an image and text that are individually safe but create a harmful concept when combined (e.g., image of a chemical + text asking “how to mix this?”). Pre-Fusion Analysis: Use separate, lightweight classifiers to assess the risk of each modality’s input independently. Flag high-risk combinations before they are processed by the main model.
Malicious QR Codes / Barcodes An image containing a QR code that encodes a malicious URL or a harmful text prompt. Content-Specific Detectors: Explicitly run detectors for known machine-readable codes. Analyze the decoded content in a sandboxed environment before passing it to the model.

The challenges presented by multimodality underscore a critical principle for AI red teamers: the attack surface is not just the model’s API, but the entire data processing pipeline. Each new modality introduces not only its own vulnerabilities but also new, unpredictable interactions with existing ones. Successfully securing these systems requires you to think like an attacker who can compose threats across different sensory inputs, forcing defenders to build resilient, layered systems rather than relying on a single point of failure.