33.1.5 Multimodal behavioral data fusion

2025.10.06.
AI Security Blog

Analyzing individual behavioral biometrics like keystrokes or mouse movements in isolation provides a fragmented picture of user identity. An advanced adversary can learn to mimic one of these signals. The real defensive strength comes from fusing these disparate data streams into a single, coherent behavioral profile. This is multimodal behavioral data fusion—the process of combining inputs from multiple sources to create a system that is far more resilient and difficult to deceive than the sum of its parts.

Think of it as the difference between recognizing a person by their voice versus recognizing them by their voice, face, and gait simultaneously. Spoofing one is plausible; spoofing all of them in a consistent, synchronized manner is exponentially harder. For AI-driven authentication, this means an attacker’s bot must not only type like a human but also move the mouse, scroll, and hesitate in ways that are contextually and temporally consistent with that typing activity.

Kapcsolati űrlap - EN

Do you have a question about AI Security? Reach out to us here:

Levels of Data Fusion

Fusion isn’t a monolithic concept; it can occur at different stages of the data processing pipeline. Understanding these levels is critical for both building robust defenses and identifying potential weak points during a red team engagement. Each level offers a trade-off between information richness and implementation complexity.

Multimodal Fusion Strategies Keystroke Data Mouse Data Touch Data Cognitive Data Data-Level Fused Model Feature Extractor Feature Extractor Feature-Level Fused Model Keystroke Model Mouse Model Score-Level Decision Logic Individual Decisions Decision-Level
Fusion Level Description Red Teaming Focus
Data-Level (Early) Raw sensor data from multiple streams are concatenated into a single large vector before being fed into a classifier. This preserves all potential correlations between modalities. Attack data synchronization. Introduce subtle timing mismatches or noise into one channel to corrupt the entire fused representation and cause misclassification.
Feature-Level Features are extracted from each modality independently (e.g., typing speed, mouse curvature). These feature vectors are then combined to form a single input for a classifier. Identify and spoof the most influential features. If the model overweights a specific feature (e.g., scroll velocity), a targeted attack on that feature can bypass the system.
Score-Level (Late) Each modality has its own dedicated classifier that outputs a confidence score (e.g., a probability of being human). These scores are then combined using rules or a weighted average. Manipulate the confidence score of a single, high-weight classifier. A highly convincing mouse movement emulation might generate a score high enough to override weaker scores from other modalities.
Decision-Level Each classifier makes a final binary decision (human/bot). These decisions are combined using logical operators like AND, OR, or majority voting. Exploit the fusion logic. If the system uses an ‘OR’ rule (any single classifier authenticates), focus all effort on defeating the weakest, most easily spoofed classifier.

Red Teaming Fused Systems

Attacking a multimodal system requires a more sophisticated strategy than attacking a single-modality one. Your objective shifts from simple mimicry to achieving *behavioral consistency*. A bot that types at 150 WPM but moves the mouse with jerky, robotic precision creates a detectable contradiction that a fusion model will flag.

Probing for Model Weaknesses

Your first task is to determine the fusion strategy and the relative weighting of each modality. You can probe this by submitting carefully crafted inputs:

  • High-Quality Single Channel: Provide a near-perfect human recording for one modality (e.g., mouse movements) while using random or obviously synthetic data for others (e.g., keystrokes). If you are authenticated, it suggests the system heavily weights the mouse data or uses a weak decision-level fusion like an ‘OR’ gate.
  • Conflicting Signals: Generate data that is internally consistent for each modality but contextually inconsistent between them. For example, simulate frantic, high-speed mouse movements and clicks characteristic of gaming, combined with slow, deliberate typing characteristic of writing code. This can stress feature-level and data-level fusion models that have learned correlations between these activities.

# Pseudocode for a score-level fusion model
# Your goal as a red teamer is to discover the weights.

def get_fused_authentication_score(keystroke_data, mouse_data, scroll_data):
    # These weights are the system's secret.
    # Probing can help you estimate their values.
    WEIGHT_KEYSTROKE = 0.5 
    WEIGHT_MOUSE = 0.3
    WEIGHT_SCROLL = 0.2

    # Each model provides a score from 0 (bot) to 1 (human).
    score_k = keystroke_model.predict_score(keystroke_data)
    score_m = mouse_model.predict_score(mouse_data)
    score_s = scroll_model.predict_score(scroll_data)

    # Simple weighted sum for fusion.
    fused_score = (score_k * WEIGHT_KEYSTROKE +
                   score_m * WEIGHT_MOUSE +
                   score_s * WEIGHT_SCROLL)
    
    # A high score from a heavily weighted model can compensate for low scores elsewhere.
    # Example: score_k=0.9, score_m=0.2, score_s=0.1 -> fused_score = 0.53
    # If threshold is 0.5, this passes despite poor mouse/scroll data.
    return fused_score
                

The Challenge of Synthetic Consistency

The ultimate goal for an attacker is to generate a complete, multimodal behavioral profile that is internally consistent. This is a significant machine learning challenge in itself. It requires a generative model that doesn’t just produce realistic keystrokes and mouse movements independently, but understands the latent relationships between them during a given task. For example, when a user is about to click a button, their typing pauses, and mouse movement becomes more deliberate and targeted. Capturing this “behavioral grammar” is the frontier for both attack and defense.

As a red teamer, your inability to easily create such consistent, multimodal synthetic data is a testament to the strength of fusion-based defenses. The systems that are hardest to break are those that have successfully moved beyond analyzing isolated actions to understanding the holistic, orchestrated symphony of human-computer interaction.