Retries – An interactive study of request retry methods (opens in new tab)

(encore.dev)

251 pointswhenlambo2y ago53 comments

53 comments

40 comments · 15 top-level

sesm2y ago· 6 in thread

Summary of the article: use exponential backoff + jitter for retry intervals.

What author didn’t mention: sometimes you want to add jitter to delay the first request too, if the request happens immediately after some event from server (like server waking up). If you don’t do this, you may crash the server, and if your exponential backoff counter is not global you can even put server into cyclic restart.

whenlamboOP2y ago

If you can crash the server with an improperly timed request, then you have a much bigger problem than client-side stuff.

samwho2y ago

I think what they mean is something that would cause client to do something at the same time (could be all sorts, some synchronised crash, aligning timers to clock-time, etc.). If the requests aren't user-driven then yes, you likely would want to include some jitter in the first request too.

Funnily, you'll notice that some of the visualisations have the clients staggering their first request. It's exactly for this reason. I wanted the visualisations to be as deterministic as possible while still feeling somewhat realistic. This staggering was a bit of a compromise.

Not sure what is meant by "if your exponential backoff counter is not global", though. Would love to know more about that.

1 more reply

sroussey2y ago

True, but you can imagine something like a websocket to all clients getting reset and everyone re-connecting, re-authenticating, and getting a new payload.

__turbobrew__2y ago

One example is if a datacenter loses power and then all the hosts get turned on at the same time they can all send requests at the same time and crash a server.

andenacitelli2y ago

Yes. Worst that should happen is getting a 404 or something. A crash due to requesting a piece of data that has not yet been created is poor design.

fooey2y ago

Yup, classic Thundering Herd Problem

samwho2y ago· 5 in thread

Thanks for sharing!

I’m the author of this post, and happy to answer any questions :)

j1elo2y ago

There's a subtle insight that could be added to the post if you consider worth it, and it's something that's actually there already, but difficult to realize: Clients in your simulation have an absolute maximum number of retries.

I noticed this mid-read, when looking at one of the animations with 28 clients, that they would hammer the server but suddenly go into wait state, without apparent reason.

Later in the final animation with debug mode enabled, the reason becomes apparent for those who click on the Controls button:

Retry Strategy > Max Attempts = 10

It makes sense, because in the worst case when everything goes wrong, a client should reach a point where it desists and just aborts with a "service not available" error.

samwho2y ago

You know, I hadn't actually considered mentioning it. Another commenter brought it up, too. It's so second nature I forgot about it entirely.

I'll look about giving it a nod in the text, thank you for the feedback. :)

1 more reply

dap2y ago

Thanks for this -- it's really great!

One thing I noticed is that the post is very first-principles right up to where it reaches exponential backoff. At that point, it quickly jumps to "and here's exponential backoff and here's some good parameters". But I've worked on a lot of systems that got those wrong. In both directions: too-short caps that were insufficient for the underlying system to recover and too-long caps so that even when the servers _did_ recover, clients weren't even going to try again for way too long (e.g., 2 days). It'd be neat to have another section or two exploring those tradeoffs.

I really want one of these visual explorations for the idea of margin. Concretely: it's common to have systems at, say, 88% CPU utilization that appear to be working great. Then you ramp them up to like 92% and start seeing latency bubbles of multiple seconds or even tens of seconds. We tend to think of that idle time as waste, but it's essential for surviving transient blips in load. I increasingly feel like this concept is really fundamental and ought to be taught in like high school because it applies so many places (e.g., emergency funds, in the realm of personal finance).

codebeaker2y ago

What technology did you use for the animations? I've a bunch of itches I'd like to scratch that would be improved by having some canvas animated explainers or UI but I never clicked with anything. D3 back in the day.

A rudimentary look in the source code showed a <traffic-simulation/> element but I'm not up to date enough with web standards to guess where to look for that in your JS bundle to guess at the framework!

samwho2y ago

It uses PixiJS (https://pixijs.com/) for the 2D rendering and GSAP3 (https://gsap.com/) for the animation. The <traffic-simulation /> blocks are custom HTMl elements (https://developer.mozilla.org/en-US/docs/Web/API/Web_compone...) which I use to encapsulate the logic.

I've been thinking about creating a separate repo to house the source code of posts I've finished so people can see it. I don't like all the bundling and minification but sadly it serves a very real purpose to the end user experience (faster load speeds on slow connections).

Until then feel free to email me (you'll find my address at the bottom of my site) and I'd be happy to share a zip of this post with you.

2 more replies

lclarkmichalek2y ago· 4 in thread

This still isn't what I'd call "safe". Retries are amazing at supporting clients in handling temporary issues, but horrible for helping them deal with consistently overloaded servers. While jitter & exponential backoff help with the timing, they don't reduce the overall load sent to the service.

The next step is usually local circuit breakers. The two easiest to implement are terminating the request if the error rate to the service over the last <window> is greater than x%, and terminating the request (or disabling retries) if the % of requests that are retries over the last <window> is greater than x%.

i.e. don't bother sending a request if 70% of requests have errored in the last minute, and don't bother retrying if 50% of the requests we've sent in the last minute have already been retries.

Google SRE book describes lots of other basic techniques to make retries safe.

spockz2y ago

Finagle fixes this with Retry Budgets: https://finagle.github.io/blog/2016/02/08/retry-budgets/

samwho2y ago

Totally! Thanks for bringing those up. I tried to keep the scope specifically on retries and client-side mitigation. There's a whole bunch of cool stuff to visualise on the server-side, and I'm hoping to get to it in the future.

cowsandmilk2y ago

Your response makes it sound like you think circuit breakers are server side and not related to retries. They are not; they are a client-side mitigation that are a critical part of a mature retry library.

1 more reply

Axsuul2y ago

Do you have a newsletter?

1 more reply

tyingq2y ago· 2 in thread

This is one of those things that sort of exposes our industry maturity versus other engineering that's been around longer. You would think by now that the various frameworks for remote calls would have standardized down to include the best practice retry patterns, with standard names, setting ranges, etc. But we mostly still roll our own for most languages/frameworks. And that's full of footguns around DNS caching, when/how to retry on certain failures (unauthorized, for example), and so on.

(Yes, there should also be the non-abstracted direct path for cases where you do want to roll your own).

rewmie2y ago

> You would think by now that the various frameworks for remote calls would have standardized down to include the best practice retry patterns, with standard names, setting ranges, etc.

There is a school of thought that argues that the best retry pattern is no retry at all, and just get the client to fail and handle that state.

One of the driving arguments is that retries are a lazy way to try to move faults from the client onto the server, and in the process cause more harm (i.e., DDoS).

Sometimes complex means wrong, and all these retry strategies are getting progressively more complex at the expense of hammering servers with traffic way beyond the volume it's designed to handle. How is that a decent tradeoff?

pixel8account2y ago

I disagree. I think the trade-off is very reasonable. At some point you need to retry (even if the trigger is user manually pressing F5 in the browser/clicking a button again/running a program again). Because they actually have some goal to accomplish.

Some failures really are random, let's say 0.1% of requests fail. For a sufficiently complex backend/operation, one user request can easily generate 100 internal requests that can fail. If you don't retry, this adds up to a non-negliglible chance that a whole user facing operation fails and all 100 requests have to be retried - you actually increased the number of requests that had to be made! As an extreme example, imagine that during training ChatGPT one request failed, and whole training has to be started from scratch because we don't do retries.

2 more replies

joshka2y ago· 2 in thread

For a lot of things, retry once and only once (at the outermost layer to avoid multiplicative amplification) is more correct. At a large enough scale, failing twice is often significantly (like 90%+) correlated with the likelihood of failing a third time regardless of backoff / jitter. This means that the second retry only serves to add more load to an already failing service.

xer2y ago

Correct. It's also the case that human generated requests will lose their relevance within seconds, a quick retry is all it's worth. As for machine generated requests a dead letter queue would make more sense, poor engineered backend services would OOM and well-engineered would load shed, if the requests are queued on the application servers they are doomed to be lost anyway.

tomwt2y ago

Retrying end-to-end instead of stepwise greatly reduces the reliability of a process with a reasonable number of steps.

That being said, processes should ideally be failing in ways which make it clear whether an error is retryable or not.

usrbinbash2y ago· 2 in thread

This is the client side of things. And I think this is a great resource that everyone who writes clients for anything, should see.

But there is an additional piece of info everyone who writes clients needs to see: And that's what people like me, who implement backend services, may do if clients ignore such wisdom.

Because: I'm not gonna let bad clients break my service.

What that means in practice: Clients are given a choice: They can behave, or they can

    HTTP 429 Too Many Requests

rewmie2y ago

> This is the client side of things.

The article is about making requests, and strategies to implement when the request fails. By definition, these are clients. Was there any ambiguity?

> But there is an additional piece of info everyone who writes clients needs to see: And that's what people like me, who implement backend services, may do if clients ignore such wisdom.

I don't think this is the obscure detail you are making it out to be. A few of the most basic and popular retry strategies are designed explicitly with a) handling throttled responses by the servers, b) mitigate the risk of causing self-inflicted DDoS attacks. This article covers a few of those, such as the exponential backoff and jitters.

usrbinbash2y ago

> Was there any ambiguity?

Did I say there was?

> I don't think this is the obscure detail you are making it out to be

Where did I call this detail "obscure"?

My post is meant as a light-hearted, humorous note pointing out one of the many reasons why it is in general a good idea for clients to implement the principles outlined in the article.

2 more replies

self_awareness2y ago· 1 in thread

Really nice animations, I especially liked the demonstration of the effect that after some servers will "explode", any server that will be restarted will automatically be DoS'ed until we'll throw a bunch of extra temporary servers into the system. Thanks.

samwho2y ago

Yeah! An insidious problem that’s not obvious when you’re picking a retry interval.

I had fun with the details of the explosion animation. When it explodes, the number of requests that come out is the actual number of in-progress requests.

christophberger2y ago· 1 in thread

A must-read (or rather: must-see) for anyone who thinks exponential backoff is overrated.

rewmie2y ago

> A must-read (or rather: must-see) for anyone who thinks exponential backoff is overrated.

I don't think exponential backoffs were ever accused of being overrated. Retries in general have been criticized for being counterproductive in multiple aspects, including the risk of creating self-inflicted DDOS attacks, and exponential backoffs can result in untenable performance and usability problems without adding any upside. These are known problems, but none of them is hardly classified as "overrating".

shoker2y ago· 1 in thread

If a picture is worth 1,000 words, then what's a well made animation worth? These are great intuitive representations of your retry methods. Bravo!

samwho2y ago

Thank you! <3

Probiotic60812y ago· 1 in thread

Exponential backoff doesn't apply for successful requests right? The simulation doesn'T reflect that i think. peace

samwho2y ago

It doesn’t apply to successful requests, that’s right.

The simulation retries failed requests using various retry strategies, and then after a successful request will wait a configured amount before sending the next request.

fadhilkurnia2y ago

The animations are so cool!!!

In general the phenomena is known as _metastable failure_ that could be triggered when there are more things to do during failure than normal run.

With retry, the client do more work within the same amount of time, compared to doing nothing or doing exponential backoff.

whenlamboOP2y ago

Remember to limit the exponential backoff interval if you are not limiting the number of retries

cratermoon2y ago

I worked at a company with a self-inflicted wound related to retries.

At some point in the distant (internet time) past, a sales engineer, or the equivalent, had written a sample script to demonstrate basic uses of the API. As many of you quickly guessed, customers went on a copy/paste rampage and put this sample script into production.

The script went into a tight loop on failure, naively using a simple library that did not include any back-off or retry in the request. I'm not deeply familiar with how the company dealt with this situation. I am aware there was a complex load balancing system across distributed infrastructure, but also, just a lot of horsepower.

Lesson for anyone offering an API product: don't hand out example code with a self-own, because it will become someone's production code.

davidw2y ago

I have been thinking about queueing theory lately. I don't have the math abilities to do anything deep with it, but it seems like even basic applications of certain things could prove valuable in real world situations where people are just kind of winging it with resource allocation.

Probiotic60812y ago

The pale red failed retries should be more kiki-like, the way they are now, their pointedness is hard to see when theyre moving

j / k navigate · click thread line to collapse

53 comments

40 comments · 15 top-level

sesm2y ago· 6 in thread

Summary of the article: use exponential backoff + jitter for retry intervals.

whenlamboOP2y ago

If you can crash the server with an improperly timed request, then you have a much bigger problem than client-side stuff.

samwho2y ago

Not sure what is meant by "if your exponential backoff counter is not global", though. Would love to know more about that.

1 more reply

sroussey2y ago

True, but you can imagine something like a websocket to all clients getting reset and everyone re-connecting, re-authenticating, and getting a new payload.

__turbobrew__2y ago

One example is if a datacenter loses power and then all the hosts get turned on at the same time they can all send requests at the same time and crash a server.

andenacitelli2y ago

Yes. Worst that should happen is getting a 404 or something. A crash due to requesting a piece of data that has not yet been created is poor design.

fooey2y ago

Yup, classic Thundering Herd Problem

samwho2y ago· 5 in thread

Thanks for sharing!

I’m the author of this post, and happy to answer any questions :)

j1elo2y ago

I noticed this mid-read, when looking at one of the animations with 28 clients, that they would hammer the server but suddenly go into wait state, without apparent reason.

Later in the final animation with debug mode enabled, the reason becomes apparent for those who click on the Controls button:

Retry Strategy > Max Attempts = 10

It makes sense, because in the worst case when everything goes wrong, a client should reach a point where it desists and just aborts with a "service not available" error.

samwho2y ago

You know, I hadn't actually considered mentioning it. Another commenter brought it up, too. It's so second nature I forgot about it entirely.

I'll look about giving it a nod in the text, thank you for the feedback. :)

1 more reply

dap2y ago

Thanks for this -- it's really great!

codebeaker2y ago

samwho2y ago

Until then feel free to email me (you'll find my address at the bottom of my site) and I'd be happy to share a zip of this post with you.

2 more replies

lclarkmichalek2y ago· 4 in thread

i.e. don't bother sending a request if 70% of requests have errored in the last minute, and don't bother retrying if 50% of the requests we've sent in the last minute have already been retries.

Google SRE book describes lots of other basic techniques to make retries safe.

spockz2y ago

Finagle fixes this with Retry Budgets: https://finagle.github.io/blog/2016/02/08/retry-budgets/

samwho2y ago

cowsandmilk2y ago

1 more reply

Axsuul2y ago

Do you have a newsletter?

1 more reply

tyingq2y ago· 2 in thread

(Yes, there should also be the non-abstracted direct path for cases where you do want to roll your own).

rewmie2y ago

> You would think by now that the various frameworks for remote calls would have standardized down to include the best practice retry patterns, with standard names, setting ranges, etc.

There is a school of thought that argues that the best retry pattern is no retry at all, and just get the client to fail and handle that state.

One of the driving arguments is that retries are a lazy way to try to move faults from the client onto the server, and in the process cause more harm (i.e., DDoS).

pixel8account2y ago

2 more replies

joshka2y ago· 2 in thread

xer2y ago

tomwt2y ago

Retrying end-to-end instead of stepwise greatly reduces the reliability of a process with a reasonable number of steps.

That being said, processes should ideally be failing in ways which make it clear whether an error is retryable or not.

usrbinbash2y ago· 2 in thread

This is the client side of things. And I think this is a great resource that everyone who writes clients for anything, should see.

But there is an additional piece of info everyone who writes clients needs to see: And that's what people like me, who implement backend services, may do if clients ignore such wisdom.

Because: I'm not gonna let bad clients break my service.

What that means in practice: Clients are given a choice: They can behave, or they can

    HTTP 429 Too Many Requests

rewmie2y ago

> This is the client side of things.

The article is about making requests, and strategies to implement when the request fails. By definition, these are clients. Was there any ambiguity?

> But there is an additional piece of info everyone who writes clients needs to see: And that's what people like me, who implement backend services, may do if clients ignore such wisdom.

usrbinbash2y ago

> Was there any ambiguity?

Did I say there was?

> I don't think this is the obscure detail you are making it out to be

Where did I call this detail "obscure"?

My post is meant as a light-hearted, humorous note pointing out one of the many reasons why it is in general a good idea for clients to implement the principles outlined in the article.

2 more replies

self_awareness2y ago· 1 in thread

samwho2y ago

Yeah! An insidious problem that’s not obvious when you’re picking a retry interval.

I had fun with the details of the explosion animation. When it explodes, the number of requests that come out is the actual number of in-progress requests.

christophberger2y ago· 1 in thread

A must-read (or rather: must-see) for anyone who thinks exponential backoff is overrated.

rewmie2y ago

> A must-read (or rather: must-see) for anyone who thinks exponential backoff is overrated.

shoker2y ago· 1 in thread

If a picture is worth 1,000 words, then what's a well made animation worth? These are great intuitive representations of your retry methods. Bravo!

samwho2y ago

Thank you! <3

Probiotic60812y ago· 1 in thread

Exponential backoff doesn't apply for successful requests right? The simulation doesn'T reflect that i think. peace

samwho2y ago

It doesn’t apply to successful requests, that’s right.

The simulation retries failed requests using various retry strategies, and then after a successful request will wait a configured amount before sending the next request.

fadhilkurnia2y ago

The animations are so cool!!!

In general the phenomena is known as _metastable failure_ that could be triggered when there are more things to do during failure than normal run.

With retry, the client do more work within the same amount of time, compared to doing nothing or doing exponential backoff.

whenlamboOP2y ago

Remember to limit the exponential backoff interval if you are not limiting the number of retries

cratermoon2y ago

I worked at a company with a self-inflicted wound related to retries.

Lesson for anyone offering an API product: don't hand out example code with a self-own, because it will become someone's production code.

davidw2y ago

Probiotic60812y ago

The pale red failed retries should be more kiki-like, the way they are now, their pointedness is hard to see when theyre moving

j / k navigate · click thread line to collapse