The next step is usually local circuit breakers. The two easiest to implement are terminating the request if the error rate to the service over the last <window> is greater than x%, and terminating the request (or disabling retries) if the % of requests that are retries over the last <window> is greater than x%.
i.e. don't bother sending a request if 70% of requests have errored in the last minute, and don't bother retrying if 50% of the requests we've sent in the last minute have already been retries.
Google SRE book describes lots of other basic techniques to make retries safe.
(Yes, there should also be the non-abstracted direct path for cases where you do want to roll your own).
There is a school of thought that argues that the best retry pattern is no retry at all, and just get the client to fail and handle that state.
One of the driving arguments is that retries are a lazy way to try to move faults from the client onto the server, and in the process cause more harm (i.e., DDoS).
Sometimes complex means wrong, and all these retry strategies are getting progressively more complex at the expense of hammering servers with traffic way beyond the volume it's designed to handle. How is that a decent tradeoff?
Some failures really are random, let's say 0.1% of requests fail. For a sufficiently complex backend/operation, one user request can easily generate 100 internal requests that can fail. If you don't retry, this adds up to a non-negliglible chance that a whole user facing operation fails and all 100 requests have to be retried - you actually increased the number of requests that had to be made! As an extreme example, imagine that during training ChatGPT one request failed, and whole training has to be started from scratch because we don't do retries.
What author didn’t mention: sometimes you want to add jitter to delay the first request too, if the request happens immediately after some event from server (like server waking up). If you don’t do this, you may crash the server, and if your exponential backoff counter is not global you can even put server into cyclic restart.
Funnily, you'll notice that some of the visualisations have the clients staggering their first request. It's exactly for this reason. I wanted the visualisations to be as deterministic as possible while still feeling somewhat realistic. This staggering was a bit of a compromise.
Not sure what is meant by "if your exponential backoff counter is not global", though. Would love to know more about that.
I had fun with the details of the explosion animation. When it explodes, the number of requests that come out is the actual number of in-progress requests.
In general the phenomena is known as _metastable failure_ that could be triggered when there are more things to do during failure than normal run.
With retry, the client do more work within the same amount of time, compared to doing nothing or doing exponential backoff.
That being said, processes should ideally be failing in ways which make it clear whether an error is retryable or not.
I don't think exponential backoffs were ever accused of being overrated. Retries in general have been criticized for being counterproductive in multiple aspects, including the risk of creating self-inflicted DDOS attacks, and exponential backoffs can result in untenable performance and usability problems without adding any upside. These are known problems, but none of them is hardly classified as "overrating".
I’m the author of this post, and happy to answer any questions :)
I noticed this mid-read, when looking at one of the animations with 28 clients, that they would hammer the server but suddenly go into wait state, without apparent reason.
Later in the final animation with debug mode enabled, the reason becomes apparent for those who click on the Controls button:
Retry Strategy > Max Attempts = 10
It makes sense, because in the worst case when everything goes wrong, a client should reach a point where it desists and just aborts with a "service not available" error.
I'll look about giving it a nod in the text, thank you for the feedback. :)
One thing I noticed is that the post is very first-principles right up to where it reaches exponential backoff. At that point, it quickly jumps to "and here's exponential backoff and here's some good parameters". But I've worked on a lot of systems that got those wrong. In both directions: too-short caps that were insufficient for the underlying system to recover and too-long caps so that even when the servers _did_ recover, clients weren't even going to try again for way too long (e.g., 2 days). It'd be neat to have another section or two exploring those tradeoffs.
I really want one of these visual explorations for the idea of margin. Concretely: it's common to have systems at, say, 88% CPU utilization that appear to be working great. Then you ramp them up to like 92% and start seeing latency bubbles of multiple seconds or even tens of seconds. We tend to think of that idle time as waste, but it's essential for surviving transient blips in load. I increasingly feel like this concept is really fundamental and ought to be taught in like high school because it applies so many places (e.g., emergency funds, in the realm of personal finance).
A rudimentary look in the source code showed a <traffic-simulation/> element but I'm not up to date enough with web standards to guess where to look for that in your JS bundle to guess at the framework!
I've been thinking about creating a separate repo to house the source code of posts I've finished so people can see it. I don't like all the bundling and minification but sadly it serves a very real purpose to the end user experience (faster load speeds on slow connections).
Until then feel free to email me (you'll find my address at the bottom of my site) and I'd be happy to share a zip of this post with you.
But there is an additional piece of info everyone who writes clients needs to see: And that's what people like me, who implement backend services, may do if clients ignore such wisdom.
Because: I'm not gonna let bad clients break my service.
What that means in practice: Clients are given a choice: They can behave, or they can
HTTP 429 Too Many RequestsThe article is about making requests, and strategies to implement when the request fails. By definition, these are clients. Was there any ambiguity?
> But there is an additional piece of info everyone who writes clients needs to see: And that's what people like me, who implement backend services, may do if clients ignore such wisdom.
I don't think this is the obscure detail you are making it out to be. A few of the most basic and popular retry strategies are designed explicitly with a) handling throttled responses by the servers, b) mitigate the risk of causing self-inflicted DDoS attacks. This article covers a few of those, such as the exponential backoff and jitters.
Did I say there was?
> I don't think this is the obscure detail you are making it out to be
Where did I call this detail "obscure"?
My post is meant as a light-hearted, humorous note pointing out one of the many reasons why it is in general a good idea for clients to implement the principles outlined in the article.
At some point in the distant (internet time) past, a sales engineer, or the equivalent, had written a sample script to demonstrate basic uses of the API. As many of you quickly guessed, customers went on a copy/paste rampage and put this sample script into production.
The script went into a tight loop on failure, naively using a simple library that did not include any back-off or retry in the request. I'm not deeply familiar with how the company dealt with this situation. I am aware there was a complex load balancing system across distributed infrastructure, but also, just a lot of horsepower.
Lesson for anyone offering an API product: don't hand out example code with a self-own, because it will become someone's production code.
The simulation retries failed requests using various retry strategies, and then after a successful request will wait a configured amount before sending the next request.