32.1.2 KV-cache side channels

2025.10.06.
AI Security Blog

The Key-Value (KV) cache is a fundamental optimization in modern Transformers, dramatically accelerating token generation by storing intermediate attention calculations. However, this performance enhancement introduces a subtle but potent side channel. By observing the timing of model responses, you can infer the contents of this cache, and by extension, parts of prompts processed by other users in a shared environment.

The Mechanics of the Leak

During the self-attention phase, a model calculates Query (Q), Key (K), and Value (V) vectors for each token. For subsequent token generation, the K and V vectors for all preceding tokens in the sequence are reused. Storing these in a KV-cache avoids redundant computation.

Kapcsolati űrlap - EN

Do you have a question about AI Security? Reach out to us here:

The side channel emerges from a simple principle: a cache hit is faster than a cache miss. If a part of your prompt matches a prefix already processed by the model (perhaps from another user’s session), the corresponding K and V vectors will be retrieved from the cache. This results in a measurably faster response time for the initial phase of your request compared to a prompt that forces a complete recalculation.

This is especially dangerous in multi-tenant or shared GPU environments where the same model instance serves multiple users, and the cache might not be properly isolated or cleared between requests.

KV-Cache Side Channel Attack Flow Victim User 1. Sends prompt: “The secret code is…” Shared LLM Service KV-Cache Cache populated Attacker 2. Sends probe: “The secret code is…” 3. Measures response time Timing FAST (Hit)

Probing for Cache Hits

Your goal as a red teamer is to systematically probe the cache to reconstruct parts of other users’ prompts. This is an iterative process of guessing prefixes and measuring the server’s response time. A statistically significant drop in latency for a given prefix suggests it was recently processed and is present in the cache.

# Pseudocode for a basic KV-cache timing probe
function probe_for_prefix(api_endpoint, suspected_prefix):
    # Establish a baseline latency with a random, unique prefix
    random_string = generate_random_string(length=len(suspected_prefix))
    baseline_times = []
    for i in range(10): // Multiple runs to average out noise
        start_time = time.now()
        api_endpoint.generate(prompt=random_string)
        baseline_times.append(time.now() - start_time)
    
    baseline_latency = average(baseline_times)

    # Measure latency with the suspected prefix
    probe_times = []
    for i in range(10):
        start_time = time.now()
        api_endpoint.generate(prompt=suspected_prefix)
        probe_times.append(time.now() - start_time)
    
    probe_latency = average(probe_times)

    # A significant drop in latency indicates a potential cache hit
    if probe_latency < baseline_latency * 0.8: // e.g., 20% faster
        return "Potential cache hit for: " + suspected_prefix
    else:
        return "No significant timing difference observed."
                

Practical Hurdles and Refinements

Executing this attack in the real world is non-trivial. You must contend with several confounding factors:

  • Network Jitter: Fluctuations in network latency can easily mask the subtle timing differences you’re trying to measure. Run probes from a location with a stable, low-latency connection to the target server.
  • System Load: The overall load on the inference server affects response times. Your baseline measurements must be taken close in time to your probes to be comparable.
  • Cache Eviction: The KV-cache has a limited size. If the system is busy, the victim’s data may be evicted before you can probe for it. The attack has a narrow time window.
  • Architectural Differences: Caching strategies (like sliding window caches in Mistral) and hardware (e.g., GPU memory hierarchies) can alter the characteristics of the timing signal.

Advanced probes may use statistical methods like t-tests to confidently distinguish a true cache hit from random noise, and may employ intelligent prefix guessing (e.g., using common phrases or context-specific terminology) to increase the hit rate.

Defensive Postures and Mitigation

Defending against KV-cache side channels requires architectural changes in how LLM services are deployed, particularly in multi-tenant scenarios. As a red teamer, your findings can justify the implementation of these more robust, albeit potentially costly, defenses.

Defense Strategy Description Trade-offs
Cache Partitioning Allocate a separate, isolated KV-cache for each user or session. A user can only hit their own cache. Effective, but increases memory overhead significantly, potentially reducing user concurrency.
Cache Flushing Completely clear the KV-cache after each user’s request or session is completed. Simple and secure. Incurs a performance penalty as every request starts with a cold cache.
Constant-Time Processing Introduce artificial delays for cache hits to make their response time indistinguishable from cache misses. Theoretically strong, but difficult to implement perfectly and negates the performance benefit of the cache.
Access Control Policies For services where prompts are not fully private (e.g., a shared chatbot), ensure the model can only access data the current user is authorized to see. Doesn’t solve the timing leak but mitigates the impact by preventing unauthorized data from entering the cache in the first place.