The token bucket is a more sophisticated rate-limiting algorithm than a simple fixed window, offering flexibility for bursty traffic. However, this deterministic nature is precisely what you can exploit. By reverse-engineering its parameters, you can craft a request pattern that maximizes throughput right up to the API’s limit, effectively bypassing its intended throttling effect for short, high-intensity attacks.
Understanding the Token Bucket Algorithm
Imagine a bucket with a fixed capacity. Tokens, each representing permission for one request, are added to this bucket at a constant rate. When a request arrives, it can only proceed if there’s at least one token in the bucket. If so, a token is removed, and the request is processed. If the bucket is empty, the request is rejected (e.g., with a 429 Too Many Requests status).
- Bucket Size (Burst Capacity): The maximum number of tokens the bucket can hold. This dictates the maximum number of requests that can be sent in a rapid burst before throttling begins.
- Refill Rate: The rate at which new tokens are added to the bucket, usually specified as tokens per second or minute. This determines the sustainable, long-term request rate.
Exploitation Strategy: Probing and Bursting
The attack is a two-phase process. First, you probe the endpoint to empirically determine the bucket’s parameters. Second, you use this knowledge to craft a request pattern that fully utilizes the available capacity without triggering sustained rate-limiting.
Phase 1: Determine Bucket Size
To find the bucket size, you send a rapid-fire burst of requests using parallel connections and observe how many succeed before you receive a 429 error. The number of successful requests is a strong indicator of the burst capacity.
# Python pseudocode using httpx for concurrent requests
import httpx
import asyncio
TARGET_URL = "https://api.example.com/v1/model/generate"
async def probe_for_burst_limit(client, url):
# Fire requests concurrently to overwhelm the server quickly
tasks = [client.post(url, json={"prompt": "test"}) for _ in range(50)]
responses = await asyncio.gather(*tasks, return_exceptions=True)
successful_requests = 0
for res in responses:
if isinstance(res, httpx.Response) and res.status_code == 200:
successful_requests += 1
elif isinstance(res, httpx.Response) and res.status_code == 429:
break # Stop counting at the first rate limit error
print(f"Estimated Bucket Size (Burst Limit): {successful_requests}")
return successful_requests
# Usage
# async with httpx.AsyncClient() as client:
# await probe_for_burst_limit(client, TARGET_URL)
Phase 2: Determine Refill Rate
Once you’ve depleted the bucket, you can measure the refill rate. Wait for a fixed interval (e.g., 10 seconds) and then send another small burst of requests. If two requests succeed after 10 seconds, you can infer a refill rate of approximately 1 token every 5 seconds, or 0.2 tokens/second.
# Python pseudocode for probing refill rate
import time
def probe_for_refill_rate(burst_limit):
# First, exhaust the entire bucket by sending 'burst_limit' requests
print(f"Depleting bucket with {burst_limit} requests...")
# (Code to send burst_limit requests omitted for brevity)
wait_interval_sec = 10
print(f"Waiting for {wait_interval_sec} seconds for tokens to refill...")
time.sleep(wait_interval_sec)
# Now, probe how many new requests succeed
successful_after_wait = 0
for _ in range(burst_limit): # Probe up to the burst limit
# response = requests.post(TARGET_URL, json={"prompt": "test"})
# if response.status_code == 200:
# successful_after_wait += 1
# else:
# break
pass # Placeholder for actual request sending
# Dummy result for demonstration
successful_after_wait = 2
refill_rate = successful_after_wait / wait_interval_sec
print(f"Refilled {successful_after_wait} tokens in {wait_interval_sec}s.")
print(f"Estimated Refill Rate: {refill_rate:.2f} tokens/second.")
Phase 3: Execute the Controlled Burst Attack
With both parameters known, you can launch a precisely timed attack. The goal is to send a large number of malicious requests (e.g., for data exfiltration or resource exhaustion) in a way that maximizes throughput. The pattern is simple:
- Consume the entire bucket capacity with an initial burst of requests.
- Immediately switch to sending subsequent requests at a pace that matches the refill rate.
This allows you to “ride the edge” of the rate limit, sending requests as fast as the system permits without being locked out for an extended period.
| Parameter | Probing Method | Example Inferred Value | Attack Implication |
|---|---|---|---|
| Bucket Size | Send concurrent requests until a 429 response. | 15 requests | You can send a burst of 15 high-impact requests instantly. |
| Refill Rate | Deplete bucket, wait, and re-probe. | 0.5 tokens/sec | After the initial burst, you can sustain 1 request every 2 seconds. |
Attack Considerations and Nuances
Real-world exploitation requires accounting for several factors:
- Network Latency: Jitter and high latency can interfere with precise timing, potentially causing your requests to be rejected even if you believe you are within the refill rate. Your probing scripts should account for round-trip time.
- Distributed Token Buckets: In a load-balanced environment, the token bucket might be shared across multiple servers. This can make probing results inconsistent. You may need to average results over several attempts to get a clearer picture.
- Dynamic Parameters: Some advanced systems may adjust bucket size or refill rates based on overall system load or user behavior, complicating your efforts to establish a fixed baseline.
- Per-Endpoint Limits: An application might use different token bucket configurations for different API endpoints (e.g., a more generous limit for
/chatthan for/chat/history). Each must be probed independently.