32.1.4 Memory Access Patterns

2025.10.06.
AI Security Blog

Moving deeper into the hardware layer, we encounter a classic side-channel vector: memory access patterns. While the KV-cache and attention mechanisms create high-level temporal signatures, the fundamental act of loading model weights and activations from memory into CPU/GPU caches creates a much more granular, low-level channel. An attacker co-located on the same physical hardware can observe these patterns to infer sensitive details about a model’s internal operations.

The core vulnerability is that an AI model’s computation is not uniform. The specific calculations it performs, and therefore the specific weights and data it needs to fetch from memory, are directly dependent on the input it receives. This data-dependent access creates a footprint in the shared hardware caches (like the L3 cache) that a sophisticated attacker can monitor.

Kapcsolati űrlap - EN

Do you have a question about AI Security? Reach out to us here:

The Prime+Probe Attack in an AI Context

The most common technique for exploiting this vulnerability is a cache timing attack known as Prime+Probe. In a multi-tenant environment where an attacker’s process shares a physical CPU with the victim’s AI inference process, the attack unfolds in a predictable cycle:

  1. Prime: The attacker’s process fills a portion of the shared cache with its own data. This establishes a known baseline state.
  2. Wait: The attacker’s process yields control, allowing the operating system to schedule the victim’s AI model for execution on the same core.
  3. Victim Execution: The AI model performs its inference task. As it fetches its own weights and activations into the cache, it necessarily evicts some of the attacker’s “primed” data. The specific locations of eviction correspond directly to the memory addresses the model accessed.
  4. Probe: The attacker’s process runs again and measures the time it takes to read its own data. Data that is still in the cache will be accessed quickly (a cache hit). Data that was evicted by the victim will be slow to access, as it must be fetched from main memory (a cache miss).

By analyzing the pattern of cache misses, the attacker constructs a “shadow map” of the victim model’s memory activity during inference. This map is the raw data used for leaking information.

Prime+Probe Attack Cycle

Shared L3 Cache Attacker Process Victim AI Model 1. Prime (Fills cache) 2. Evict (Inference accesses memory) 3. Probe (Measures access time)

Vulnerable Architectures: Mixture-of-Experts (MoE)

While any model’s memory access is data-dependent to some degree, certain architectures are exceptionally vulnerable. Mixture-of-Experts (MoE) models are a prime example. In an MoE architecture, an input is processed by a “gating network” that selects a small subset of specialized “expert” networks to handle the computation.

This design has a critical side-channel implication: the weights for each expert are stored in distinct memory locations. By observing which memory blocks are loaded into the cache during inference, an attacker can directly determine which experts were chosen by the gating network. This leaks significant information about the model’s internal routing logic for a specific input, which could be used to reconstruct model behavior or infer properties of the input data.


// Pseudocode illustrating the MoE vulnerability point

function MoE_forward_pass(input_data):
    // Gating network's decision is based on input_data
    // This decision is the sensitive information to be leaked
    selected_experts = gating_network.predict(input_data) // e.g., [expert_4, expert_11]

    final_output = 0
    // The loop creates a data-dependent memory access pattern
    for expert in selected_experts:
        // VULNERABILITY: Loading these specific weights causes predictable cache misses
        // that an attacker can observe.
        expert_weights = load_from_DRAM(expert.weights_address)
        
        // Process with the loaded expert
        expert_output = expert.process(input_data, expert_weights)
        final_output += expert_output

    return final_output

Red Teaming and Mitigation Strategies

From a red teaming perspective, demonstrating an attack on memory access patterns requires low-level access and specialized tools to interact with CPU caches. The goal is to prove that in a shared tenancy model, sensitive internal computations can be observed.

Strategy Description Effectiveness & Cost
Physical Isolation Run sensitive AI workloads on dedicated, single-tenant hardware (bare metal). This eliminates the shared resource required for the attack. Very High Effectiveness. High cost and reduced flexibility compared to cloud virtualization.
Cache Partitioning Use hardware features like Intel’s Cache Allocation Technology (CAT) or AMD’s Platform Quality of Service (PQoS) to assign exclusive cache partitions to different processes or VMs. High Effectiveness. Prevents direct eviction between tenants. Requires hardware support and careful configuration.
Constant-Time Programming Rewrite critical parts of the model’s code to ensure memory accesses are independent of input data. For example, always loading a fixed set of data blocks regardless of the computational path. Moderate to High Effectiveness. Extremely difficult to implement for complex AI models and can introduce significant performance overhead.
Adding Noise Introduce background processes that perform random memory accesses to “jam” the cache side channel, making it harder for an attacker to get a clean signal. Low to Moderate Effectiveness. Can degrade performance for both victim and attacker but does not eliminate the channel.

As a red teamer, your report should not only demonstrate the vulnerability but also map these potential defenses to the organization’s specific threat model and operational constraints. For most organizations, exploring cache partitioning technologies offered by their cloud provider is the most practical and effective first step in mitigating this deep-level hardware side channel.