Design a Rate Limiter: Algorithms, Distributed Challenges, and Optimizations

Rate limiting is a fundamental building block of any robust distributed system. Use it to protect your services from excessive usage, prevent Denial of Service (DoS) attacks, and ensure fair resource allocation among users. In this deep dive, we'll architect a production-grade rate limiter, moving from basic algorithms to a distributed, fault-tolerant solution.
Why Rate Limit?
Before we design, let's clarify the requirements:
- Prevent Resource Starvation: Ensure a single heavy user doesn't degrade performance for others.
- Security: Mitigate brute-force password attempts and DDoS attacks.
- Cost Management: Limit calls to expensive third-party APIs (e.g., Stripe, OpenAI).
1. Choosing the Right Algorithm
There is no "one size fits all." Let's compare the most common strategies.
A. Token Bucket
The Concept: A bucket holds tokens. A "refiller" adds tokens at a fixed rate. Each request consumes one token. If the bucket is empty, the request is rejected.
- Pros: Allows bursts of traffic (up to bucket capacity). Memory efficient.
- Cons: Tuning bucket size and refill rate can be tricky.
B. Leaky Bucket
The Concept: Requests enter a queue (bucket). They are processed (leak) at a constant rate. Using a FIFO queue.
- Pros: Smoothes out bursts into a stable outflow rate. Ideal for cases where the downstream service requires a strict processing rate.
- Cons: Bursts fill the queue efficiently, but new legitimate requests are dropped if the queue is full.
C. Sliding Window Counter
The Concept: Divide time into fixed operational windows (e.g., 1 minute). Track the count in the current window and the previous window. Weighted average is used to estimate the count in the "rolling" window.
Equation: Requests = (Requests in Current Window) + (Requests in Previous Window * Overlap Percentage)
- Pros: Smoothes out the "edge cases" of Fixed Window (where 2x limit could occur at the boundary). Memory efficient.
- Cons: Slightly less accurate than Sliding Window Log, but sufficient for 99% of use cases.
2. Distributed Rate Limiting
Local memory rate limiting works for a single server, but modern systems are distributed. We need a shared store. Redis is the industry standard here due to its speed and support for atomic operations.
The Problem: Race Conditions
Imagine two instances strictly reading and writing a counter:
- Server A reads
count = 9 - Server B reads
count = 9(limit is 10) - Server A increments to 10 and allows.
- Server B increments to 10 and allows.
- Result: 11 requests allowed. Optimization failed.
The Solution: Redis + Lua Scripts
To ensure atomicity, we must perform the "check and increment" logic in a single atomic step. Redis Lua scripts allow exactly this.
-- Lua Script for Token Bucket
local key = KEYS[1]
local limit = tonumber(ARGV[1])
local current_time = tonumber(ARGV[2])
local interval = tonumber(ARGV[3]) -- refill time in ms
local tokens_per_interval = tonumber(ARGV[4])
local value = redis.call("hgetall", key)
local last_time = tonumber(value[2])
local current_tokens = tonumber(value[4])
if not last_time then
current_tokens = limit
last_time = current_time
end
local time_passed = current_time - last_time
local tokens_to_add = math.floor(time_passed / interval) * tokens_per_interval
current_tokens = math.min(current_tokens + tokens_to_add, limit)
if current_tokens > 0 then
redis.call("hmset", key, "last_time", current_time, "tokens", current_tokens - 1)
redis.call("pexpire", key, interval * 2) -- Clean up idle keys
return 1 -- Allowed
else
return 0 -- Rejected
end3. Architecture Overview
Here is how the components fit together in a real-world system:
4. Optimization Strategies
Querying Redis for every request adds latency. Here's how to make it faster.
A. Local Caching (Hierarchical Limiting)
Implement a small in-memory counter on the server itself.
- Total Limit: 1000 req/sec globally.
- Local Limit: 200 req/sec per instance (if you have 5 instances).
- Trade-off: Less precise global enforcement, but zero network latency for the first layer of defense.
B. Synchronization with Batching
Instead of updating Redis on every request, sync locally and push updates in batches (every 500ms).
- Result: Drastically reduces Redis load (IOPS).
- Risk: The "limit" becomes "eventual consistency." You might allow slightly more than the limit during the sync window.
Conclusion
Designing a rate limiter is a balancing act between precision (strict enforcement) and performance (latency impact).
- For strict APIs (billing endpoints), use Token Bucket with Redis/Lua.
- For protecting high-volume public endpoints, consider Sliding Window Counters with Local Caching.
Start simple, measure your latency budget, and optimize only when the network hop to Redis becomes your bottleneck.