32.1.5 Cross-tenant information leakage

2025.10.06.
AI Security Blog

The economic model of AI-as-a-Service platforms hinges on multi-tenancy—the practice of running workloads from multiple, isolated customers on the same physical hardware. While this maximizes resource utilization for providers, it introduces a classic security vulnerability into the AI stack: cross-tenant information leakage. When your inference request is processed on the same GPU as another organization’s, the shared hardware components, particularly caches, become a battleground for a subtle form of espionage conducted through timing side channels.

This attack vector is not new; it’s a direct descendant of CPU-based side-channel attacks like Spectre and Meltdown. However, its application to AI accelerators like GPUs presents a new frontier. The targets are no longer CPU caches holding operating system secrets, but GPU caches (like the KV-cache or L2 cache) that hold transient data reflecting the structure and content of an AI model’s computation.

Kapcsolati űrlap - EN

Do you have a question about AI Security? Reach out to us here:

The Prime+Probe Technique on AI Hardware

The most common method for exploiting shared caches is the “Prime+Probe” attack. It’s a two-step dance of memory manipulation performed by an attacker to spy on a co-located victim. You, as the red teamer playing the attacker, would execute the following logical steps:

  1. Prime: Your malicious process (Tenant A) strategically fills a specific portion of a shared cache with your own data. This sets the cache to a known state. For example, you might issue a series of computations designed to load specific memory addresses into the GPU’s L2 cache.
  2. Wait & Victim Execution: You yield control, allowing the system scheduler to run another process. If you are co-located with your target (Tenant B), their inference process will execute. As the victim’s process accesses memory for its own computations (e.g., retrieving weights, writing to the KV-cache), it will evict some of your “primed” data from the shared cache.
  3. Probe: Your process runs again and immediately times how long it takes to re-read the data you originally placed in the cache.
    • A fast read (cache hit) means your data was still in the cache, indicating the victim did not access that corresponding cache line.
    • A slow read (cache miss) means your data was evicted and had to be fetched from slower memory. This is the signal. It tells you that the victim’s process accessed a memory address that maps to the same cache set you were monitoring.

By repeating this process across many cache sets, you can build a map of the victim’s memory access patterns over time, which, as discussed in previous chapters, can leak significant information about their AI operations.

Visualizing the Attack Flow

The following diagram illustrates the Prime+Probe sequence in a multi-tenant GPU environment where an attacker and a victim share a cache.

Diagram of a cross-tenant Prime+Probe cache timing attack. Tenant A (Attacker) Tenant B (Victim) Shared GPU Cache (e.g., L2 or KV-Cache) 1. PRIME Fill cache with own data 2. EXECUTE & EVICT Victim’s workload runs, displacing attacker’s data 3. PROBE Time access to own data Slow access = Eviction Fast access = No Eviction

What is Actually Leaked?

A successful cross-tenant timing attack does not leak the victim’s prompt or the model’s output directly. Instead, it leaks metadata about the computation, which can be just as sensitive. By observing which cache sets are being used, you can infer:

  • Input/Output Length: Heavy use of the KV-cache indicates long sequences are being processed. A sudden burst of activity could signal the start of a new, lengthy generation task.
  • Model Architecture Interaction: By priming cache sets corresponding to specific layers or attention heads of a known model, you can determine which parts of the model the victim’s query is activating most heavily. This could reveal the nature of their task (e.g., code generation vs. summarization).
  • User Activity Patterns: Consistent, periodic cache access patterns could indicate automated processes or API calls, while sporadic patterns might suggest interactive human use. This can be valuable for business intelligence or targeting further attacks.

Red Teaming Considerations

Executing a cross-tenant cache attack is challenging. It requires two key prerequisites:

1. Co-location: You must be able to get your monitoring process scheduled on the same physical GPU, and often the same Streaming Multiprocessor (SM), as the victim. This can involve “cloud squatting”—creating and destroying VMs or containers until you land on the target hardware.

2. Precise Timing: You need access to high-resolution timers and a deep understanding of the GPU’s memory hierarchy to distinguish the subtle timing differences between a cache hit and a miss amidst system noise.

Conceptual Attack Logic

The following pseudocode outlines the core loop of a Prime+Probe attack. It simplifies many complex details but captures the fundamental logic.

// Define the set of cache lines to monitor
monitored_cache_sets = select_target_sets();

while (true) {
    // 1. PRIME: Fill the targeted cache sets with our data
    prime_cache(monitored_cache_sets);

    // 2. WAIT: Yield CPU/GPU to allow the victim to run
    sleep_for_short_interval();

    // 3. PROBE: Measure access time for each set
    timings = probe_cache_and_measure_time(monitored_cache_sets);

    // 4. ANALYZE: A high latency indicates a cache miss caused by the victim
    for set, time in timings.items() {
        if (time > CACHE_MISS_THRESHOLD) {
            log_victim_activity_on_set(set);
        }
    }
}

Defensive Strategies

Defending against these attacks is an active area of research and engineering. The primary strategies fall into three categories:

  • Hardware/Software Partitioning: The most effective defense is to prevent sharing. This can be done by dedicating entire GPUs to a single tenant (expensive) or by using cache partitioning technologies (like NVIDIA’s MIG – Multi-Instance GPU) that logically divide cache and compute resources, creating hard boundaries between tenants.
  • Adding Noise: System administrators can introduce “noisy neighbor” processes that perform random memory accesses to disrupt the attacker’s timing measurements. This makes it difficult to distinguish between a cache miss caused by the victim and one caused by the noise. However, this comes at the cost of performance.
  • Oblivious Scheduling: Cloud schedulers can be designed to randomize tenant placement or migrate processes frequently, making it much harder for an attacker to achieve and maintain co-location with a specific target.

For red teamers, the existence and proper implementation of these defenses are key areas to test. A failure in a platform’s tenant isolation mechanism is a critical vulnerability.