
Every Node.js engineer has written this code at least once:
async function callService() {
for (let i = 0; i < 3; i++) {
try {
return await fetch(url);
} catch (e) {
// try again!
}
}
throw new Error('failed');
}
It looks defensive. It looks resilient. And during a normal day, it is. The problem is that on the worst day — the day a downstream service is already on fire — this exact pattern is what finishes the job. Three retries from every client, multiplied across a fleet, multiplied across tiers of services, is how a brief blip turns into an hour-long outage.
This post walks through why naive retries are so dangerous, the formal mechanism behind that danger (retry amplification), and how polite-retry — a small, zero-dependency TypeScript library — applies the academic research on the subject to give you retries that are actually safe to run in production.
- npm:
polite-retry - GitHub: darkrishabh/polite-retry
- Docs: darkrishabh.github.io/polite-retry
- Paper: Retry Amplification in Distributed Systems: A Systematic Analysis of Retry Policies and Their Role in Cascading Failures — SSRN abstract 6313332
The Problem: Retry Amplification
Imagine a normal request path through three tiers of services: an API gateway calls a business-logic service, which calls a data service. On a healthy day, 100 requests in equals 100 responses out. Easy.
Now suppose the data tier starts failing 50% of its requests — maybe a deploy, maybe a bad node, maybe a noisy neighbour. Each tier above it has been configured the "obvious" way: retry up to 3 times on failure.
Watch what happens to the request volume hitting the struggling service:
- The middle tier sees 50% failures, so for every failed call it tries up to 3 more times. The data service's load roughly doubles.
- The gateway sees the middle tier failing, so it retries too, multiplying the load again.
- More load means more failures. More failures mean more retries.
In a 3-tier system with a 50% underlying failure rate and 3 retries per tier, the total request volume hitting the bottom service can be 6.6× normal load. The retries didn't help the system recover — they pushed it from "degraded" into "completely down." This is the cascade collapse pattern that takes whole platforms offline, and it has a name: retry amplification.

The research paper this library is based on (linked above) walks through the math formally and analyses how different retry policies — fixed delay, exponential backoff, jittered backoff, budgeted retries — perform under these conditions. The TL;DR from the paper is the design philosophy of the library: retries must be aware of how the system as a whole is doing, not just whether one individual call succeeded.
The Three Things You Actually Need
Most retry libraries on npm give you exponential backoff and call it a day. That's the easy 80%. The hard 20% — the part that prevents amplification — needs three additional ideas working together:
Jitter, so retries don't synchronise into periodic spikes.
Circuit breaking, so when a service is clearly down, you stop hammering it.
Retry budgeting, so the aggregate retry traffic is capped relative to baseline load.
polite-retry exposes these as three composable strategies, with progressively stronger guarantees:
| Strategy | Use Case | Amplification Risk |
|---|---|---|
retry() | Simple retries with backoff and jitter | Medium |
retryWithCircuitBreaker() | Stop retrying when the service is clearly down | Low |
retryWithBudget() | Adaptive Retry Budgeting (recommended for prod) | Very Low |
retryWithProtection() | Combined budget + circuit breaker for critical paths | Very Low |
Let's walk through each.
1. Basic Retry — and Why Jitter Matters
The basic retry() function looks like what you'd expect, but the default jitter: 'full' is doing real work:
import { retry } from 'polite-retry';
const data = await retry(
async () => {
const response = await fetch('https://api.example.com/data');
if (!response.ok) throw new Error(`HTTP ${response.status}`);
return response.json();
},
{
maxRetries: 3,
initialDelayMs: 100,
backoffMultiplier: 2,
jitter: 'full',
timeoutMs: 5000,
retryIf: (error) => !error.message.includes('4'), // don't retry 4xx
onRetry: (err, attempt, delay) => {
console.log(`Retry ${attempt} in ${delay}ms: ${err.message}`);
},
},
);
Why jitter matters: imagine 1,000 clients all hit a flaky service at the same moment, all fail, and all back off "exponentially" — say, 100ms then 200ms then 400ms. Without jitter, all 1,000 clients retry at exactly the 100ms mark, then exactly the 200ms mark. The struggling service sees a precise periodic stampede instead of a smooth distribution of load. It never gets a quiet moment to recover.
polite-retry ships four jitter strategies straight out of the AWS Architecture Blog's classic on the subject:
| Strategy | Formula | Use case |
|---|---|---|
none | delay | Testing only — never production |
full | random(0, delay) | General-purpose default |
equal | delay/2 + random(0, delay/2) | When you want a guaranteed minimum |
decorrelated | random(base, prev * 3) | Long correlated retry sequences |
The single biggest improvement most Node services can make to their resilience profile is changing jitter: 'none' (or no jitter at all) to jitter: 'full'. It costs nothing.
2. Circuit Breaker — Knowing When to Stop
Retries assume the failure is transient. When it isn't — when a downstream is genuinely down — you don't want every request to spend 5 seconds going through 3 retries before giving up. That's just wasted compute, wasted connections, and wasted user time.
The circuit breaker pattern fixes this. Track failures in a sliding window. If the failure rate crosses a threshold, open the circuit — fail fast for a cooldown period without even attempting the call. After the cooldown, transition to half-open and let one test request through. If it succeeds, close the circuit and resume normal traffic. If it fails, open again.
import { retryWithCircuitBreaker, CircuitBreaker } from 'polite-retry';
const paymentBreaker = new CircuitBreaker({
failureThreshold: 0.5, // open at 50% failure rate
windowSize: 10, // over the last 10 requests
resetTimeoutMs: 30_000, // try again 30s after opening
onStateChange: (state) => log.info(`payment circuit: ${state}`),
});
const result = await retryWithCircuitBreaker(
() => chargePayment(amount),
paymentBreaker,
{ maxRetries: 3 },
);
The key rule: one breaker per downstream service, shared across all the call sites in your process that talk to that service. A breaker per individual request gives you nothing.
3. Adaptive Retry Budgeting — The Real Fix
This is the recommended strategy for production microservices, and it's the one that directly maps to the paper's analysis.
The core idea: instead of letting every request retry up to 3 times, cap the total retry traffic as a fraction of normal traffic. If you only allow a 20% retry budget, then for every 100 original requests you can issue at most 20 retries — regardless of how many of them are failing. The math means amplification is bounded by 1 + budget (so 1.2× in this case), no matter how bad things get downstream.
The "adaptive" part is what makes this practical: a static 20% budget is fine when failure rates are low, but a smart system should shrink the budget when failures spike (because retrying a sick service is counterproductive) and restore it when things calm down.
import { retryWithBudget, AdaptiveRetryBudget } from 'polite-retry';
// One budget instance per downstream service, shared across the process
const paymentBudget = new AdaptiveRetryBudget({
initialBudget: 0.2, // 20% retry overhead allowed
highFailureThreshold: 0.3, // shrink budget when >30% failing
lowFailureThreshold: 0.05, // restore budget when <5% failing
budgetDecreaseRate: 0.5, // halve budget on shrink
budgetIncreaseRate: 0.1, // grow 10% on restore
adjustmentIntervalMs: 1000,
onBudgetChange: (budget, rate) => {
metrics.gauge('retry.budget', budget);
metrics.gauge('retry.failure_rate', rate);
},
});
const data = await retryWithBudget(
() => fetchFromPaymentService(),
paymentBudget,
{ maxRetries: 3, jitter: 'full' },
);
// Don't forget cleanup
process.on('SIGTERM', () => paymentBudget.dispose());
The behaviour table is intuitive:
| Observed failure rate | Budget action |
|---|---|
| < 5% | Grow budget, up to the configured maximum |
| 5–30% | Hold steady |
| > 30% | Cut budget by 50% |
| Backpressure signal received | Stop retrying immediately |
You can also pull metrics out of the budget for observability:
const m = paymentBudget.getMetrics();
// {
// totalRequests: 150,
// successfulRequests: 140,
// failedRequests: 10,
// totalRetries: 15,
// failureRate: 0.08,
// retryAmplificationFactor: 1.11
// }
The retryAmplificationFactor is the key SLO to alert on. If it's drifting above 1.5, your retry policy is starting to add meaningful load instead of absorbing transient failures.
4. Backpressure: Letting Servers Tell Clients to Slow Down
Even the smartest client-side policy can't beat server-side knowledge. The server knows its own load, queue depth, and latency tail. The most polite thing a client can do is listen when the server says "I'm overloaded."
polite-retry ships both halves of this — a server-side Express middleware that automatically annotates responses with load information, and a client-side manager that reads those annotations and feeds them back into the retry budget.
Server side (Express)
import express from 'express';
import { RequestCounter, createBackpressureMiddleware } from 'polite-retry';
const app = express();
const counter = new RequestCounter();
app.use(counter.middleware()); // tracks active requests automatically
app.use(createBackpressureMiddleware({
getLoadLevel: () => counter.getCount() / 100, // 100 = max concurrent
overloadThreshold: 0.8,
}));
Every response now carries:
X-Backpressure: 0.75— current load level (0.0 to 1.0)X-Load-Shedding: true— set when over thresholdRetry-After: 5— suggested wait in seconds
Client side
import {
retryWithBudget,
AdaptiveRetryBudget,
BackpressureManager,
} from 'polite-retry';
const backpressure = new BackpressureManager();
const budget = new AdaptiveRetryBudget({
checkBackpressure: () => backpressure.isOverloaded('payment-service'),
});
async function callPaymentService(payload) {
return retryWithBudget(
async () => {
const res = await fetch('https://payment-service/charge', {
method: 'POST',
body: JSON.stringify(payload),
});
backpressure.recordFromHeaders('payment-service', res.headers);
if (!res.ok) throw new Error(`HTTP ${res.status}`);
return res.json();
},
budget,
{ maxRetries: 3 },
);
}
The same primitives work for gRPC via metadata instead of HTTP headers — examples in the docs.
How This Fits a Real Node.js Service
A typical Node.js backend has a handful of distinct downstream dependencies — a database, a payment provider, a notification service, an internal user service, a third-party auth API. Each one has different latency profiles, different reliability characteristics, and different "this is on fire" signals. The pattern that works:
One module per downstream, exporting a wrapped client. Inside that module, instantiate a single AdaptiveRetryBudget (and optionally a CircuitBreaker) at module scope. Every function in that module routes its calls through retryWithBudget (or retryWithProtection) using the shared instances.
Tune per-service. Payment APIs deserve a tighter budget (10%) and stricter circuit thresholds — false positives are cheap, retry storms during checkout are catastrophic. Internal best-effort calls (analytics, telemetry) can run looser budgets (30%) and more retries.
Wire the metrics. Send failureRate, retryAmplificationFactor, and circuit state to whatever monitoring stack you use (Datadog, Prometheus, CloudWatch). Alert on amplification > 1.5 and on circuits stuck open.
Add backpressure middleware to your own services. Your service is downstream to *something*. Adding the middleware costs you essentially nothing and gives every caller — whether they use polite-retry or not — actionable signals.
Clean up on shutdown. AdaptiveRetryBudget runs an internal interval timer for adjustment cycles. Call .dispose() on SIGTERM, otherwise your process will hang.
A skeleton looks like this:
// src/clients/payment.ts
import {
retryWithProtection,
CircuitBreaker,
AdaptiveRetryBudget,
} from 'polite-retry';
const breaker = new CircuitBreaker({
failureThreshold: 0.4,
windowSize: 20,
resetTimeoutMs: 30_000,
});
const budget = new AdaptiveRetryBudget({
initialBudget: 0.1, // tighter for payments
highFailureThreshold: 0.2,
});
export async function chargeCard(amount: number, token: string) {
return retryWithProtection(
() => doChargeRequest(amount, token),
{ circuitBreaker: breaker, budget },
{
maxRetries: 2,
jitter: 'full',
timeoutMs: 4000,
retryIf: (err) => !/^4\d\d/.test(err.message), // never retry 4xx
},
);
}
export function disposePaymentClient() {
budget.dispose();
}
// src/server.ts
import { disposePaymentClient } from './clients/payment';
process.on('SIGTERM', () => {
disposePaymentClient();
// ...other cleanup
});
Things to Get Right (and Wrong) in Production
A short, opinionated checklist of what experience — and the paper — say to actually do:
Do
- Use
jitter: 'full'everywhere. Always. - Cap retries at 3. More than that almost never converts a failure into a success and almost always inflates load.
- Use a per-downstream shared budget, not one budget per request.
- Set per-attempt
timeoutMs. A retry policy without timeouts isn't resilience, it's a slow leak. - Be selective with
retryIf: 4xx errors are not the server's fault and won't get better on a second try. - Alert on
retryAmplificationFactor > 1.5.
Don't
- Don't retry without backoff. Immediate retries are how you turn a 50ms blip into a 5-second outage.
- Don't ignore
Retry-After. The server told you when to come back; come back then. - Don't construct a new budget per request. The whole point is to share the count across calls.
- Don't silently swallow
onRetrycallbacks. Log them, even at debug level — when the postmortem happens you'll want them.
Wrapping Up
Retries are one of those topics where everyone knows roughly what to do, the official advice is fine, and yet production systems still fall over for retry-amplification reasons multiple times a year at almost every company. The gap is that "exponential backoff with jitter" — what most retry libraries give you — is necessary but not sufficient. You also need to bound the aggregate retry traffic, listen to the server, and stop retrying entirely when the downstream is clearly cooked.
polite-retry packages those ideas into a small TypeScript library with no runtime dependencies, with the algorithms calibrated against the analysis in the paper. If you've got a Node.js service that talks to anything else over the network — and that's basically all of them — it's worth replacing your hand-rolled for loop with this.
npm install polite-retry
MIT licensed. Issues, PRs, and benchmarks welcome.