32.1.1 Token generation timing analysis

2025.10.06.
AI Security Blog

The time an AI model takes to produce a token is not a fixed value. This seemingly minor detail—the subtle variation in processing delay from one token to the next—is a potent information side channel. By precisely measuring these inter-token latencies, you can begin to infer the model’s internal computational state, the nature of its workload, and even characteristics of other users’ data in a shared environment. This is the foundation of timing-based cache attacks against LLMs.

The Source of Temporal Variations

The core principle is simple: not all computations are equal. The time required for an autoregressive model to generate the next token depends heavily on the computational path it takes. Several factors contribute to this variability.

Kapcsolati űrlap - EN

Do you have a question about AI Security? Reach out to us here:

  • Computational Complexity: Certain sequences require more complex processing. For example, generating a token that summarizes a long, convoluted context may involve more intensive attention calculations than generating a common word in a predictable sequence.
  • Mixture of Experts (MoE) Routing: In MoE architectures, different inputs are routed to different “expert” subnetworks. If a prompt requires an expert that is currently under heavy load or whose parameters are not in active memory, you will observe a measurable delay.
  • KV-Cache Dynamics: This is the most significant factor for this class of attacks. The Key-Value (KV) cache stores intermediate attention calculations for the input context.
    • A cache hit occurs when the necessary computations for the current token are already present in the cache from previous steps. This is fast.
    • A cache miss forces the model to recompute values, introducing a significant latency spike. Observing these misses is a powerful signal.
  • System Load and Contention: In multi-tenant systems, the workload of other users can impact your observed performance. A sudden increase in your token latency might indicate that another user has submitted a resource-intensive task to the shared GPU, creating resource contention.

Diagram illustrating variable token generation timing. Prompt Sent Token 1 Δt = 150ms (Initial processing) Token 2 Δt = 100ms (Fast: common word, cache hit) Token 3 Δt = 200ms (Slow: complex reasoning, cache miss?) Token 4 Δt = 110ms (Return to baseline) Time →

Executing a Timing Analysis Attack

As a red teamer, your goal is to move from observing noise to identifying a signal. This requires a methodical approach to isolate and amplify the timing variations caused by specific model or system behaviors.

Methodology

  1. Establish a Baseline: First, you need to understand the system’s “heartbeat.” Send a series of simple, identical prompts (e.g., asking the model to count to ten) and measure the inter-token latency for each response. This helps you characterize the baseline performance and natural jitter of the API and infrastructure.
  2. Formulate a Hypothesis: Develop a specific question you want to answer. For example: “Can I detect when another user submits a prompt with a context longer than 4,000 tokens?” or “Can I identify which keywords cause the model to engage a specific, slower computational path?”
  3. Craft Probing Inputs: Design pairs of inputs. One input is a control (e.g., a simple query), and the other is the probe, designed to trigger the condition you’re testing (e.g., the same query but with a very long context).
  4. Measure and Analyze: Execute many trials, alternating between control and probe inputs. Collect high-resolution timestamps for each token’s arrival. Use statistical analysis to compare the timing distributions of the two input sets. A statistically significant difference in latency confirms your hypothesis.

Example Measurement Pseudocode

This conceptual script demonstrates how you might measure inter-token latency from a streaming API endpoint.

# Pseudocode for measuring token timing
import time
import api_client

PROMPT = "Tell me a short story about a robot."

def measure_token_latency(prompt):
    latencies = []
    last_token_time = time.monotonic()
    
    # Initiate a streaming connection to the LLM API
    stream = api_client.generate(prompt, stream=True)
    
    for token in stream:
        current_time = time.monotonic()
        # Calculate time since the last token arrived
        latency = current_time - last_token_time
        latencies.append(latency)
        last_token_time = current_time
        
        print(f"Token: '{token}', Latency: {latency:.4f}s")
        
    return latencies

# Run the measurement and analyze the results
measured_latencies = measure_token_latency(PROMPT)
# Further statistical analysis would be performed here

Defensive Strategies and Mitigation

Protecting against timing analysis requires obscuring the correlation between computational effort and observable delay. Since achieving true constant-time execution is often impractical, defenses focus on adding noise and abstracting workloads.

Defense Mechanism Description Effectiveness
Time Bucketing / Delay Injection Intentionally add small, random delays to token generation. Responses are released in fixed time intervals (buckets) rather than as soon as they are ready. High. Directly masks the underlying signal, but can introduce minor performance overhead and increase perceived latency.
Request Batching Group multiple user requests together and process them as a single batch. The timing observed by one user is influenced by the entire batch, obfuscating the signal from their individual prompt. Moderate to High. Very effective in multi-tenant environments. The signal is averaged out across many requests.
Strict Resource Isolation Use techniques like hardware-level partitioning or dedicated GPU instances for different tenants to prevent one user’s workload from creating contention that affects another’s. High. Prevents cross-tenant information leakage but can be expensive and lead to lower hardware utilization.
Monitoring and Anomaly Detection Monitor for clients making highly structured, repetitive API calls characteristic of a timing attack probe. Flag or rate-limit suspicious patterns. Moderate. A reactive defense that can deter simple attacks but may be bypassed by sophisticated adversaries who mimic normal user behavior.

Understanding token generation timing is your entry point into a family of powerful side-channel attacks. The concepts introduced here—especially the role of the KV-cache—are foundational for the more advanced techniques discussed in the subsequent chapters.