AI API Protection: Configuring Rate Limiting and Throttling Against DoS Attacks

2025.10.17.
AI Security Blog

Your AI API is a Denial-of-Service Goldmine. Here’s How to Fix It.

Let’s get real for a moment. You’ve just launched your shiny new AI-powered application. The model is state-of-the-art, the UI is slick, and your API is ready to serve the world. You’re watching the metrics, and the requests start trickling in. Then, they start flooding. Your heart races—is this it? Is this the viral moment?

Then the alerts fire. GPU utilization is pegged at 100%. Your cloud bill is rocketing into orbit. Legitimate users are getting timeouts. Your beautiful new app is grinding to a halt, choked by a deluge of requests. You haven’t been discovered by a million new fans. You’ve been discovered by one person with a malicious script.

Kapcsolati űrlap - EN

Do you have a question about AI Security? Reach out to us here:

You’ve just met the new face of Denial-of-Service (DoS), tailor-made for the AI era.

If you think protecting an AI API is the same as protecting a standard REST API that just fetches user data from a database, you’re in for a world of pain. The game has changed because the nature of the work has changed. A traditional API is like a fast-food counter—orders are simple, standardized, and fulfilled in seconds. An AI API? That’s a Michelin-starred kitchen with a single, genius chef.

Every single order is a bespoke creation. It requires expensive, rare ingredients (GPU cycles), intense labor (model computation), and a lot of time. Now, what happens when a hundred people—or a thousand—show up at once and all demand a 12-course tasting menu? The kitchen collapses. That’s your AI API during a DoS attack.

Why Your AI API is a Ticking Time Bomb

The fundamental vulnerability of an AI API isn’t just about network bandwidth. A traditional DoS attack is about saturating the network pipe, like a firehose aimed at a garden hose. It’s noisy and brute-force. An AI DoS attack is far more insidious. It’s not about clogging the pipe; it’s about exhausting the chef.

This is called Resource Depletion DoS.

An attacker doesn’t need to send you a billion packets a second. They just need to send a few hundred, or even a few dozen, carefully crafted requests that trigger the most computationally expensive operations your model can perform. Each one of those requests kicks off a chain reaction on your backend:

  • GPU/TPU Meltdown: Your specialized AI hardware, the most expensive part of your stack, gets slammed. A single prompt for a high-resolution image or a complex code generation task can monopolize a GPU for several seconds, or even minutes.
  • Memory Exhaustion: Models, especially large language models (LLMs) or diffusion models, are memory hogs. Loading them into VRAM is a heavy lift. An attacker can craft requests that force your system to constantly swap models or handle large contexts, draining your memory resources.
  • Financial Hemorrhage: This is the scariest part. Every second of GPU time costs you real money. An attacker can sit back and watch your cloud bill spiral into the stratosphere. They don’t need to take your service offline permanently; they just need to make it too expensive for you to keep it online. A successful attack might not result in a 404 error, but in a call from your CFO.

A single, seemingly innocent API call can trigger a cascade of resource consumption that is orders of magnitude greater than the request itself. This is the asymmetry that attackers love.

The Asymmetry of AI API Attacks Tiny API Request (e.g., 2KB JSON) Triggers Massive Backend Computation GPU CPU RAM (Seconds/Minutes of work, Gigabytes of RAM)

So how do we fight back? We start by treating our API endpoint not as an open door, but as a heavily guarded checkpoint. Our first and most important weapons are Rate Limiting and Throttling.

Meet the Bouncers: Rate Limiting and Throttling

Before we dive deep, let’s get the terms straight. People often use them interchangeably, but they are two distinct, albeit related, concepts.

  • Rate Limiting is the bouncer at the door with a clicker. It enforces a hard cap. “This club has a capacity of 100 people. You are number 101. You can’t come in.” In API terms, it means rejecting requests (usually with a 429 Too Many Requests status code) once a certain threshold is passed. It’s a blunt but effective instrument.
  • Throttling is the bouncer who sees a long line forming and tells the DJ to slow down the music. It doesn’t reject requests outright; it slows them down. It shapes the request traffic, forcing it into an orderly queue to be processed as resources become available. It’s about graceful degradation, not outright denial.

You need both. Rate limiting is your shield against overwhelming floods. Throttling is your system for managing legitimate, but heavy, traffic without collapsing.

The Rate Limiting Playbook: Algorithms Matter

Saying “we need rate limiting” is easy. Implementing it effectively is hard. The specific algorithm you choose has massive implications for both security and user experience. Let’s break down the common ones.

1. Fixed Window Counter

This is the simplest, most intuitive approach. You set a limit for a fixed time window. For example, “100 requests per minute.”

How it works: A counter is reset at the start of each minute. Every request increments the counter. If the counter hits 100, all subsequent requests in that minute are rejected. At the start of the next minute, the counter resets to zero.

The problem? It’s dumb. An attacker can be clever. They can wait until the last second of a window (e.g., at 10:00:59) and send 100 requests. Then, at the very next second (10:01:00), the window resets, and they can immediately send another 100 requests. Your system gets hit with 200 requests in two seconds, completely defeating the purpose of a “100 requests per minute” limit. This is called a burst attack at the window edge.

2. Sliding Window Log

This is a much smarter approach that solves the edge-burst problem. It doesn’t care about fixed minutes; it cares about the last minute.

How it works: The system keeps a timestamped log of every request from a user. To check the limit, it counts how many timestamps in the log are within the last 60 seconds. If that count is below the limit, the request is accepted and its timestamp is added to the log. Old timestamps (older than 60 seconds) are discarded.

The good: It’s precise and smooths out traffic, preventing the edge-burst problem.
The bad: It can be memory-intensive, as you have to store a log of timestamps for every single user.

3. Sliding Window Counter

This is a hybrid approach that offers a great balance of performance and accuracy. It combines the low memory footprint of the Fixed Window with the accuracy of the Sliding Window.

How it works: It uses a counter for the current window and also considers the counter from the previous window. For a 1-minute limit, it calculates a weighted count based on how much of the previous minute’s window overlaps with the current 60-second period. For example, if we are 15 seconds into the new minute, the rate is calculated as: (45/60 * count_from_previous_minute) + count_from_current_minute. This provides a rolling average that’s much more resilient to bursts.

4. Token Bucket (My Personal Favorite)

This is one of the most flexible and widely used algorithms, especially for APIs. Forget counters; think in terms of resources.

The Analogy: Imagine every user has a bucket. This bucket is continuously refilled with “tokens” at a steady rate, say, 10 tokens per second. The bucket has a maximum capacity, say, 100 tokens. Each API request costs one token. When a request comes in, the system checks if there’s at least one token in the bucket. If yes, the request is processed, and a token is removed. If the bucket is empty, the request is rejected.

Why it’s brilliant: It naturally handles bursts! A user who has been inactive for a while will have a full bucket of 100 tokens. They can make a burst of 100 requests all at once, which is often legitimate behavior (like a script starting up). After that burst, they are limited to the refill rate (10 requests/sec). It provides flexibility while still enforcing a sustainable long-term rate.

Token Bucket Algorithm Refill Rate (e.g., 10 tokens/sec) User’s Token Bucket Max Capacity: 100 Incoming API Requests Tokens available? Yes: Consume Token & Process Request No: Reject Request (429 Error)

5. Leaky Bucket

The Leaky Bucket is often confused with the Token Bucket, but it serves a different purpose. It’s less about allowing bursts and more about ensuring a steady outflow.

The Analogy: Imagine a bucket with a hole in the bottom. Requests are “poured” into the bucket. The bucket has a fixed capacity (a queue). The system processes requests from the bottom of the bucket at a constant, steady rate. If requests come in faster than they can be processed, the bucket fills up. If the bucket is full when a new request arrives, that request is “spilled” (rejected).

Use Case: This is more of a throttling mechanism. It’s excellent for ensuring your backend services are fed a steady stream of work they can handle, preventing them from being overwhelmed by a sudden flood. It turns bursty, unpredictable traffic into a smooth, predictable workload.

Leaky Bucket Algorithm Bursty Incoming Requests Request Queue (FIFO) Request 3 Request 2 Request 1 Steady Outflow (Constant Processing Rate) Request arrives when bucket/queue is full Spilled / Rejected
Golden Nugget: For most public-facing AI APIs, the Token Bucket algorithm is your best starting point. It provides the perfect blend of burst tolerance for good users and strong protection against sustained abuse.

A Multi-Layered Strategy: The Defense Onion

So you’ve picked an algorithm. Great. But where do you apply it? Applying a single, global rate limit to your entire API is a rookie mistake. A real professional builds layers of defense, like an onion. A request has to get through multiple checkpoints before it’s allowed to touch your expensive model.

Layered API Defense Strategy AI Model (GPU/TPU) Layer 3: Cost-Based Endpoint Limit (e.g., /generate: 5/min, /status: 100/min) Layer 2: Per-User / API Key Limit (e.g., 1000 requests/day) Layer 1: Global / IP Limit (Basic bot protection) API Request

Layer 1: The Edge (Global & IP-Based Limits)

This is your outermost wall. It’s handled by your CDN (like Cloudflare), your WAF (Web Application Firewall), or your API Gateway (like Kong or NGINX). The goal here is to stop the dumbest, loudest attacks.

  • Global Rate Limit: A very high, system-wide limit. “Do not allow more than 10,000 requests per second to hit our origin servers, period.” This is a safety valve to prevent your entire infrastructure from being overwhelmed.
  • Per-IP Rate Limit: A stricter limit applied to individual IP addresses. This is surprisingly effective against naive botnets or single-machine attacks. An anonymous IP trying to hit you 100 times a second? Block it at the edge before it ever consumes a single CPU cycle on your app servers.

This layer is crude, but it filters out a ton of noise.

Layer 2: The User (Per-User / Per-API Key Limits)

This is the most critical layer for any service with authentication. Once you know who a user is, you can enforce much more intelligent limits. Every user or API key gets its own token bucket.

Why is this essential? It ensures that one abusive or compromised user account cannot degrade the service for all your other legitimate users. It isolates the blast radius.

This is also where you can implement tiered pricing. Your free-tier users get a small bucket (e.g., 100 requests/day), while your enterprise customers get a massive one. This is not just a security control; it’s a core part of your business model.

Layer 3: The Task (Cost-Based, Per-Endpoint Limits)

Here’s where we get specific to AI. Not all API calls are created equal. You know this. So why would you give them the same rate limit?

You need to analyze your own API and assign a “cost” to each endpoint. This doesn’t have to be a complex calculation at first. A simple categorization is a great start.

Endpoint Function Cost Profile Example Rate Limit (per user) Strategy
/v1/status Check API health Very Low (DB lookup) 100 per minute Loose limit, mainly for nuisance prevention.
/v1/classify_text Run text through a small classification model Medium (Fast inference) 60 per minute Standard Token Bucket. Allows some burst.
/v1/generate_image Generate an image from a prompt High (GPU-intensive, slow) 5 per minute Strict Token Bucket with a small burst capacity.
/v1/finetune_model Submit a fine-tuning job Extremely High (Hours of GPU) 2 per day Very strict limit, possibly combined with manual approval.

By implementing per-endpoint limits based on computational cost, you are directly mitigating the Resource Depletion DoS attack. An attacker can hammer your /status endpoint all day long and it won’t matter. But if they try to abuse /generate_image, they’ll be shut down after a handful of requests.

Golden Nugget: Your rate limiting strategy MUST be cost-aware. A flat limit across all endpoints is a gaping security hole in an AI API. Profile your endpoints and protect your expensive ones aggressively.

Advanced Warfare: Adaptive Limiting and Throttling

Static, pre-configured limits are good. But the best defense is a dynamic one that responds to the real-time state of your system. This is where you graduate from being purely defensive to being actively responsive.

Adaptive Rate Limiting Based on System Load

The idea is simple: if your system is under stress, tighten the screws. If it’s idle, loosen them up.

How it works: Your rate-limiting service needs to be aware of your system’s health. It should monitor key metrics:

  • Average GPU utilization
  • Inference queue length
  • Model VRAM usage
  • API response latency

You then define thresholds. For example:

  • Normal State (GPU < 70%): Standard rate limits apply.
  • High Load (GPU > 70%): Automatically reduce the refill rate of all token buckets by 25%.
  • Critical Load (GPU > 90%): Reduce refill rates by 50% and drastically shrink the burst capacity. Maybe even temporarily block the most expensive endpoints for new free-tier users.

This is like an automatic surge protector for your entire system. It gracefully degrades performance for everyone to prevent a total meltdown, prioritizing keeping the service online over serving every single request at maximum speed.

Intelligent Throttling and Prioritization

Remember throttling? It’s about managing the queue, not just rejecting requests. When your system is under high load, you shouldn’t just slam the door with a 429 error on your best customers. You should make them wait a little longer.

This is where a Priority Queue comes in. When a request arrives and the backend is busy, instead of processing it immediately, you place it in a queue. But not just any queue—a queue that understands priorities.

  1. Enterprise Customer Request: High priority. Goes to the front of the line.
  2. Paying Pro-Tier User: Medium priority.
  3. Free-Tier User: Low priority.

During normal operation, everyone gets fast responses. But when the system is under load, the free-tier users will experience higher latency as the paying customers’ jobs are processed first. This is a form of throttling that preserves the quality of service for those who pay for it, which is exactly what you want.

Beyond the Bouncer: Other Essential Tools

Rate limiting is your cornerstone, but don’t stop there. A comprehensive defense includes a few other key components.

Circuit Breakers

A circuit breaker is a design pattern you apply between your services. If your API gateway makes a call to your inference service and it times out or returns an error, the circuit breaker “trips.” For the next few seconds, it won’t even try to send traffic to that failing service; it will immediately return an error. This prevents a single failing model server from causing a cascade of failures throughout your entire system. After a cooldown period, it will “half-open,” sending a single request to see if the service has recovered. If it has, the breaker closes and traffic flows normally.

Cost Estimation and Pre-computation Checks

For some AI tasks, you can estimate the cost before you even run the model.

  • Image Generation: A request for a 1024×1024 image with 100 inference steps is far more expensive than a 512×512 image with 20 steps. You can calculate a “compute score” from the request parameters.
  • LLM Prompts: The length of the input prompt and the requested max_tokens for the output are good proxies for cost.

You can set a hard limit on this compute score. If a user sends a ridiculously complex request, you can reject it immediately with a 400 Bad Request, explaining why it was rejected. “Error: Maximum generation complexity exceeded.” This happens before it ever touches the GPU queue.

Caching

Don’t sleep on caching, even for AI. If you have an endpoint for, say, summarizing a URL, and ten people submit the same URL within an hour, are you really going to re-summarize it ten times? Of course not. Cache the result with the URL as the key. This is most effective for deterministic tasks (classification, translation, summarization) and less so for highly creative generative tasks, but it can still save a massive amount of redundant computation.

Final Thoughts: It’s Not Just Security, It’s Survival

Protecting your AI API from DoS attacks isn’t a checkbox on a security audit. It’s a fundamental aspect of your product’s architecture and your company’s financial viability.

The attacks are no longer just about ego and taking a site offline. They are about economic warfare. A sophisticated attacker can fly under the radar of traditional DoS protection, slowly and quietly bleeding your company dry by making thousands of expensive API calls over weeks, never tripping a simple high-volume alert.

Your defense has to be just as sophisticated. It must be layered, cost-aware, and adaptive. You have to move beyond just counting requests per second and start thinking about managing computational resources per user, per task, per second.

So, look at your API. Look at your most expensive endpoints. Ask yourself the uncomfortable question: what’s stopping one person with a $10 VPS and a simple script from costing me $10,000 in cloud bills overnight?

If you don’t have a good answer, you know where to start.