Rate Limiting and Abuse Prevention at the Edge

Rate limiting belongs at the network perimeter, not the origin. By the time an abusive request reaches your application server it has already consumed a TCP connection, a TLS handshake, and a slice of origin compute. Enforcing limits inside an edge isolate stops the request at the closest Point of Presence, returns a 429 Too Many Requests in single-digit milliseconds, and shields your backend from credential-stuffing bursts, scraping fleets, and accidental client retry storms. This guide is part of Building a Custom Middleware Chain, and it focuses on the algorithms, counter-storage trade-offs, and client-identification rules that make edge rate limiting correct under concurrency.

The hard problem is not the algorithm — it is where the counter lives. Edge runtimes execute in isolated V8 isolates with no shared memory across requests, regions, or invocations. A counter held in a module-global variable is per-isolate, resets on eviction, and never sees traffic landing in another PoP. Accurate limiting therefore requires an external, consistent counter store, and every store choice trades latency against correctness.

Why edge isolates make counting hard

A naive limiter increments an in-memory map keyed by client IP. At the edge this fails three ways. First, isolates are ephemeral: the runtime spins them up per PoP and recycles them aggressively, so the counter has no durable lifetime. Second, isolates do not share state: two requests from the same client can land in San José and Frankfurt and each sees a fresh count of zero. Third, even within one isolate, concurrent requests interleave between the read and the write, so two requests can both read count = 99, both pass a limit of 100, and both write 100 — a classic lost-update race.

The fix is to move the counter into a store that offers atomic read-modify-write and a global view. The three viable shapes are: a strongly-consistent single-writer (Durable Objects), an eventually-consistent KV store (fast, approximate), and a platform-native rule engine (Cloudflare Rate Limiting rules) that runs before your Worker even executes.

Every limiter resolves to the same decision: derive a stable client key, atomically update a shared counter, then forward or reject.

Choosing the algorithm

Three algorithms dominate edge rate limiting. They differ in accuracy at window boundaries, memory footprint, and burst tolerance.

Fixed window

Partition time into fixed intervals (for example, each clock minute) and keep one counter per interval. Increment on each request; reject when the counter exceeds the limit; the counter resets when the next interval begins. Fixed window is the cheapest option — a single integer per key — and maps directly onto a KV value with a TTL equal to the window length.

Its flaw is the boundary burst. A client can send the full quota in the last second of one window and the full quota in the first second of the next, delivering 2× the intended rate across a two-second span. For coarse protection (for example, 1000 requests/hour) the boundary error is negligible; for tight limits it is a real bypass.

Sliding window

Sliding window smooths the boundary by weighting the previous window’s count by how far into the current window you are. With count = current + previous * (1 - elapsed_fraction), a request at the 25%-mark of the current minute counts 75% of the prior minute. This approximation needs only two integers per key yet eliminates almost all boundary bursting, which is why it is the default for most production API gateways. A true sliding log (storing every request timestamp) is exact but unbounded in memory and rarely worth it at the edge.

Token bucket

Token bucket models a bucket that holds up to capacity tokens and refills at refillRate tokens per second. Each request removes one token; if the bucket is empty, reject. Because the bucket can hold a full capacity, it permits short bursts up to that size while still enforcing the long-run average rate — ideal for clients that batch requests. The state is two numbers: token count and last-refill timestamp. Refill is computed lazily on read, so there is no background timer. The dedicated walkthrough, Token-bucket rate limiting at the edge, covers the refill math and the atomic update in a Durable Object.

// Token-bucket decision, computed lazily from stored state.
interface BucketState {
  tokens: number;       // tokens currently available
  updatedAt: number;    // ms timestamp of last refill calculation
}

function consume(
  state: BucketState,
  now: number,
  capacity: number,
  refillPerSecond: number,
): { allowed: boolean; state: BucketState; retryAfterMs: number } {
  const elapsedSeconds = (now - state.updatedAt) / 1000;
  const refilled = Math.min(capacity, state.tokens + elapsedSeconds * refillPerSecond);

  if (refilled >= 1) {
    return {
      allowed: true,
      state: { tokens: refilled - 1, updatedAt: now },
      retryAfterMs: 0,
    };
  }

  // Time until the bucket holds one whole token again.
  const deficit = 1 - refilled;
  return {
    allowed: false,
    state: { tokens: refilled, updatedAt: now },
    retryAfterMs: Math.ceil((deficit / refillPerSecond) * 1000),
  };
}

Where the counter lives

The algorithm is portable; the storage is not. Each backend offers a different consistency/latency contract.

Durable Objects — strong counts

A Durable Object is a single-threaded, globally-addressable actor. All requests for a given key route to the same instance, so increments serialize and the count is exact even under heavy concurrency. This is the only option that gives you a strictly-correct counter without races. The cost is latency: requests in distant regions pay a round-trip to the object’s home region, typically tens of milliseconds. Use Durable Objects when a precise limit is a security or billing requirement — per-API-key quotas, login-attempt throttling, or paid-tier metering.

KV — approximate counts

A KV store reads in single-digit milliseconds from a local PoP replica but writes propagate eventually (commonly within seconds, globally). Two PoPs incrementing the same key concurrently can each read a stale value and lose updates, so the enforced limit is a soft ceiling, not a hard one. KV is the right choice when over-counting by a small margin is acceptable and latency matters more than precision — coarse per-IP throttling, scraper slowdown, or a cheap first line of defence in front of a stricter Durable Object check. The companion guide, Per-IP rate limiting with Cloudflare KV, shows exactly how approximate it is and when to graduate to Durable Objects.

Cloudflare Rate Limiting rules — pre-Worker enforcement

Cloudflare’s native Rate Limiting rules run in the edge proxy before your Worker is invoked, matching on path, method, and characteristics you select. They cost zero Worker CPU and block volumetric floods upstream of your code, but they are configured declaratively and cannot express bespoke per-tenant logic. Treat them as the outer ring: absorb obvious abuse with a rule, then apply nuanced application-level limits in middleware.

Provider mapping

Concern	Cloudflare Workers	Vercel Edge Middleware	Netlify Edge Functions
Strong counter store	Durable Objects (single-writer, exact)	Vercel KV (Upstash Redis, `INCR` atomic)	External Redis / Upstash
Approximate counter store	Workers KV (eventual, ~seconds)	Edge Config (read-optimized, not for counters)	Netlify Blobs (eventual)
Pre-code rule engine	Rate Limiting rules + WAF	Vercel Firewall / WAF rules	No native per-route rule engine
Client IP header	`CF-Connecting-IP`	`x-forwarded-for` / `request.ip`	`x-nf-client-connection-ip`
Atomic increment primitive	DO storage transaction	Redis `INCR` / `EXPIRE`	Redis `INCR`
Bot signals	Bot Management score, `cf.botManagement`	Vercel Firewall bot rules	External (Turnstile, header heuristics)

For Vercel and Netlify the pragmatic counter is an atomic Redis INCR against Upstash, which gives strong counts without running your own actor. On Cloudflare, reach for Durable Objects when you need that same strength inside the platform; reach for KV when approximate is enough.

Identifying the client

A limiter is only as good as its key. The key must be stable for a legitimate client and expensive to rotate for an abuser.

// Resolve a stable rate-limit key, most-trusted identity first.
async function rateLimitKey(request: Request): Promise<string> {
  // 1. Authenticated identity is the strongest key.
  const auth = request.headers.get("authorization");
  if (auth?.startsWith("Bearer ")) {
    const sub = await jwtSubject(auth.slice(7)); // verified `sub` claim
    if (sub) return `user:${sub}`;
  }

  // 2. API key for machine clients.
  const apiKey = request.headers.get("x-api-key");
  if (apiKey) return `key:${apiKey}`;

  // 3. Fall back to client IP. Trust the platform header, NOT raw XFF.
  const ip =
    request.headers.get("cf-connecting-ip") ??
    request.headers.get("x-forwarded-for")?.split(",")[0].trim() ??
    "0.0.0.0";
  return `ip:${ip}`;
}

Two rules govern client identification. First, prefer authenticated identity — a verified JWT sub (see the JWT verification pattern in the middleware chain overview) or an API key — because it cannot be spoofed by changing networks. Second, never trust a raw X-Forwarded-For header for IP keys. Clients can forge it. Use the platform’s own connecting-IP header — CF-Connecting-IP on Cloudflare, x-nf-client-connection-ip on Netlify, request.ip on Vercel — which the edge sets from the real TCP peer. If you must read X-Forwarded-For, take only the left-most entry the first trusted proxy appended, and only when the edge guarantees the header was reset at ingress.

Returning a correct 429

When a client exceeds its limit, return 429 Too Many Requests with a Retry-After header so well-behaved clients back off instead of hammering. Set Retry-After to the whole seconds until a token or window slot frees up. Adding X-RateLimit-* headers on every response (including allowed ones) lets clients self-throttle proactively.

function tooManyRequests(retryAfterSeconds: number, limit: number): Response {
  return new Response(
    JSON.stringify({ error: "rate_limited", retryAfter: retryAfterSeconds }),
    {
      status: 429,
      headers: {
        "content-type": "application/json",
        "retry-after": String(retryAfterSeconds),
        "x-ratelimit-limit": String(limit),
        "x-ratelimit-remaining": "0",
      },
    },
  );
}

Rejecting over-limit requests this early is itself an early-return guard: the limiter short-circuits the chain before auth, routing, or any origin fetch runs, which is exactly why rate limiting should sit near the front of the middleware order.

Bot and abuse signals beyond raw counts

Pure rate limiting throttles volume but does not distinguish a human from a headless scraper sending one request each from a thousand residential IPs. Layer in behavioural signals:

Bot scores. Cloudflare exposes a managed bot score on request.cf.botManagement; gate sensitive routes on a threshold and apply a tighter limit (or a challenge) to suspected automation.
Challenge instead of block. For ambiguous traffic, issue a managed challenge or Turnstile token rather than a hard 429, so false positives can self-clear.
Per-route asymmetry. Apply aggressive limits to expensive or sensitive endpoints (login, search, password reset) and looser limits to cheap static reads.
Sliding penalties. On repeated violations, exponentially extend the block window for that key — a tarpit that makes brute-forcing uneconomical.

Pair these with structured logging so every limit decision is queryable; abuse patterns only become visible in aggregate.

Framework integration

The limiter is a middleware stage; only its registration differs per framework.

Next.js App Router — enforce in root middleware.ts, scoped with config.matcher so static assets never consume a check:

// middleware.ts
export const config = {
  matcher: ["/api/:path*", "/login"],
};

export default async function middleware(request: Request) {
  const key = await rateLimitKey(request);
  const verdict = await checkLimit(key); // DO or KV-backed
  if (!verdict.allowed) {
    return tooManyRequests(verdict.retryAfterSeconds, verdict.limit);
  }
  // fall through to the app
}

Remix — wrap createRequestHandler, run the limiter before loaders execute, and return the 429 directly when the verdict denies.

SvelteKit — call the limiter at the top of the handle hook in src/hooks.server.ts; return the 429 Response instead of resolve(event) when over limit.

Debugging workflow

Local emulation. Run wrangler dev (or netlify dev / next dev) and fire a burst with for i in $(seq 1 50); do curl -s -o /dev/null -w "%{http_code}\n" localhost:8787/api; done. Confirm the transition from 200 to 429 happens at the configured limit, not before or after.
Verify the key. Log the resolved rateLimitKey and assert two requests from one client collapse to one counter — a frequent bug is keying on a header that varies per request.
Inspect counter state. For Durable Objects, expose a debug RPC that returns the current bucket; for KV, read the key directly to confirm TTL and value.
Trace, then alert. Emit a span per decision with the key and remaining quota, then alert when the 429 rate for a single key spikes — that is either an attack or a misconfigured client.

Common pitfalls

Symptom	Cause	Fix
Limit enforced at ~2× the configured value	Per-isolate in-memory counter, not shared	Move the counter to Durable Objects or KV
Limit silently bypassed	Keying on forgeable `X-Forwarded-For`	Use `CF-Connecting-IP` / platform connecting-IP header
Two requests both pass at the boundary	Non-atomic read-then-write race	Serialize in a Durable Object or use Redis `INCR`
KV limit drifts high under burst	Eventual consistency loses concurrent writes	Accept it as a soft limit, or switch to Durable Objects
Clients retry instantly and amplify load	`429` returned without `Retry-After`	Always set `Retry-After` to seconds until reset
Legitimate shared-NAT users blocked	IP key collides many users behind one NAT	Prefer authenticated identity; loosen IP-only limits
Counter never resets	Missing TTL on the KV value	Set `expirationTtl` equal to the window length

Runtime-constraints checklist

Counter store is external and atomic (Durable Object, Redis INCR Counter store is external and atomic (Durable Object, Redis `INCR`), never an isolate global
Client key prefers verified identity (JWT sub Client key prefers verified identity (JWT `sub` / API key) over IP
IP keys read the platform connecting-IP header, not raw IP keys read the platform connecting-IP header, not raw `X-Forwarded-For`
Every over-limit response returns 429 with an accurate Every over-limit response returns `429` with an accurate `Retry-After`
Limit check runs near the front of the chain, before origin fetches
KV-backed limits are documented as approximate; precise limits use Durable Objects
Expensive routes carry tighter limits than cheap static reads
Limit decisions are logged with the key and remaining quota for abuse analysis

Frequently Asked Questions

Should I use KV or Durable Objects for rate-limit counters?

Use Durable Objects when the limit is a security or billing boundary that must be exact — per-API-key quotas, login throttling, paid metering — because a single-writer object serializes increments and eliminates races. Use KV when approximate is acceptable and latency matters more than precision, such as coarse per-IP scraper slowdown. KV writes propagate eventually, so concurrent increments across PoPs can lose updates and let a client slightly exceed the limit.

Which header should I use to identify a client by IP?

On Cloudflare use CF-Connecting-IP, on Netlify use x-nf-client-connection-ip, and on Vercel use request.ip (or the left-most x-forwarded-for entry the edge set). Never trust a raw X-Forwarded-For value supplied by the client, since it is trivially forged and lets an attacker rotate keys at will. Where possible, key on a verified JWT sub or API key instead, which a client cannot spoof by changing networks.

What is the difference between token bucket and sliding window?

Token bucket allows short bursts up to the bucket capacity while enforcing a long-run average refill rate, which suits clients that batch requests. Sliding window enforces a smooth ceiling by weighting the previous window’s count and is better when you want to forbid bursts. Token bucket stores a token count plus a timestamp; sliding window stores two interval counters. Both need only a constant amount of state per client.

Why does my in-memory rate limiter let too many requests through?

An in-memory counter lives inside a single V8 isolate. Isolates are per-PoP, are recycled frequently, and never share state, so a client hitting multiple PoPs is counted independently in each, and a recycled isolate resets the count. Move the counter to an external atomic store — Durable Objects, or Redis INCR on Vercel and Netlify — so all requests for a key see one shared, consistent count.

What headers should a 429 response include?

Always include Retry-After set to the whole number of seconds until the client may retry, so conforming clients back off instead of retrying immediately. Adding X-RateLimit-Limit, X-RateLimit-Remaining, and optionally X-RateLimit-Reset on every response lets clients self-throttle before they are blocked. Return the 429 as an early short-circuit so no origin work is wasted.