32.2.2 Token Bucket Mechanism Exploitation

2025.10.06.
AI Security Blog

The token bucket is a more sophisticated rate-limiting algorithm than a simple fixed window, offering flexibility for bursty traffic. However, this deterministic nature is precisely what you can exploit. By reverse-engineering its parameters, you can craft a request pattern that maximizes throughput right up to the API’s limit, effectively bypassing its intended throttling effect for short, high-intensity attacks.

Understanding the Token Bucket Algorithm

Imagine a bucket with a fixed capacity. Tokens, each representing permission for one request, are added to this bucket at a constant rate. When a request arrives, it can only proceed if there’s at least one token in the bucket. If so, a token is removed, and the request is processed. If the bucket is empty, the request is rejected (e.g., with a 429 Too Many Requests status).

Kapcsolati űrlap - EN

Do you have a question about AI Security? Reach out to us here:

Token Bucket Mechanism Refill Rate T Bucket Capacity (Burst Limit) API Requests

Two critical parameters define any token bucket implementation:

  • Bucket Size (Burst Capacity): The maximum number of tokens the bucket can hold. This dictates the maximum number of requests that can be sent in a rapid burst before throttling begins.
  • Refill Rate: The rate at which new tokens are added to the bucket, usually specified as tokens per second or minute. This determines the sustainable, long-term request rate.

Exploitation Strategy: Probing and Bursting

The attack is a two-phase process. First, you probe the endpoint to empirically determine the bucket’s parameters. Second, you use this knowledge to craft a request pattern that fully utilizes the available capacity without triggering sustained rate-limiting.

Phase 1: Determine Bucket Size

To find the bucket size, you send a rapid-fire burst of requests using parallel connections and observe how many succeed before you receive a 429 error. The number of successful requests is a strong indicator of the burst capacity.

# Python pseudocode using httpx for concurrent requests
import httpx
import asyncio

TARGET_URL = "https://api.example.com/v1/model/generate"

async def probe_for_burst_limit(client, url):
    # Fire requests concurrently to overwhelm the server quickly
    tasks = [client.post(url, json={"prompt": "test"}) for _ in range(50)]
    responses = await asyncio.gather(*tasks, return_exceptions=True)
    
    successful_requests = 0
    for res in responses:
        if isinstance(res, httpx.Response) and res.status_code == 200:
            successful_requests += 1
        elif isinstance(res, httpx.Response) and res.status_code == 429:
            break # Stop counting at the first rate limit error
    
    print(f"Estimated Bucket Size (Burst Limit): {successful_requests}")
    return successful_requests

# Usage
# async with httpx.AsyncClient() as client:
#     await probe_for_burst_limit(client, TARGET_URL)

Phase 2: Determine Refill Rate

Once you’ve depleted the bucket, you can measure the refill rate. Wait for a fixed interval (e.g., 10 seconds) and then send another small burst of requests. If two requests succeed after 10 seconds, you can infer a refill rate of approximately 1 token every 5 seconds, or 0.2 tokens/second.

# Python pseudocode for probing refill rate
import time

def probe_for_refill_rate(burst_limit):
    # First, exhaust the entire bucket by sending 'burst_limit' requests
    print(f"Depleting bucket with {burst_limit} requests...")
    # (Code to send burst_limit requests omitted for brevity)
    
    wait_interval_sec = 10
    print(f"Waiting for {wait_interval_sec} seconds for tokens to refill...")
    time.sleep(wait_interval_sec)
    
    # Now, probe how many new requests succeed
    successful_after_wait = 0
    for _ in range(burst_limit): # Probe up to the burst limit
        # response = requests.post(TARGET_URL, json={"prompt": "test"})
        # if response.status_code == 200:
        #     successful_after_wait += 1
        # else:
        #     break
        pass # Placeholder for actual request sending
    
    # Dummy result for demonstration
    successful_after_wait = 2 

    refill_rate = successful_after_wait / wait_interval_sec
    print(f"Refilled {successful_after_wait} tokens in {wait_interval_sec}s.")
    print(f"Estimated Refill Rate: {refill_rate:.2f} tokens/second.")

Phase 3: Execute the Controlled Burst Attack

With both parameters known, you can launch a precisely timed attack. The goal is to send a large number of malicious requests (e.g., for data exfiltration or resource exhaustion) in a way that maximizes throughput. The pattern is simple:

  1. Consume the entire bucket capacity with an initial burst of requests.
  2. Immediately switch to sending subsequent requests at a pace that matches the refill rate.

This allows you to “ride the edge” of the rate limit, sending requests as fast as the system permits without being locked out for an extended period.

Parameter Probing Method Example Inferred Value Attack Implication
Bucket Size Send concurrent requests until a 429 response. 15 requests You can send a burst of 15 high-impact requests instantly.
Refill Rate Deplete bucket, wait, and re-probe. 0.5 tokens/sec After the initial burst, you can sustain 1 request every 2 seconds.

Attack Considerations and Nuances

Real-world exploitation requires accounting for several factors:

  • Network Latency: Jitter and high latency can interfere with precise timing, potentially causing your requests to be rejected even if you believe you are within the refill rate. Your probing scripts should account for round-trip time.
  • Distributed Token Buckets: In a load-balanced environment, the token bucket might be shared across multiple servers. This can make probing results inconsistent. You may need to average results over several attempts to get a clearer picture.
  • Dynamic Parameters: Some advanced systems may adjust bucket size or refill rates based on overall system load or user behavior, complicating your efforts to establish a fixed baseline.
  • Per-Endpoint Limits: An application might use different token bucket configurations for different API endpoints (e.g., a more generous limit for /chat than for /chat/history). Each must be probed independently.