Avoiding Fallback in Distributed Systems (2020) (opens in new tab)

(aws.amazon.com)

54 pointsomaras4y ago11 comments

11 comments

This is the main valuable insight imho: "Distributed fallback strategies [can] ... in our experience ... increase the scope of impact of failures as well as increasing recovery times." (The ~strawman malloc analogy is not entirely convincing.)

But then again now we consider physical systems, say a spaceship, which require critical capabilities and operational regimes, and ask if fallback fault management is really a 'bad idea'.

sitkack4y ago

I am trying to be more positive in general, take everything with a grain of salt, I also work for a Big Cloud provider.

I read that as we work really hard to engineer crystalline fault lines vertically through our stack so the system has a nice clean single plane of fracture.

Given their track record of reliability and the unsubstantiated claims in the article, I can't even. In the real world, all the actions that have absolutely saved a system was an occurrence of fallback.

Having branch free code, one way to fail is nice from a reasoning perspective, and reasoning was more than one of the points brought up in the article. But reasoning is a goal that is different than reliability. I can use a reliable automatic transmission without reasoning about it.

Fallback fixes issues that failover doesn't. Rather put out a piece that encourages someone to not do something (sometimes this is important granted), encouraging folks to use immutability would be a larger global positive.

Immutability really does change everything.

https://cacm.acm.org/magazines/2016/1/195722-immutability-ch...

EGreg4y ago

I mean, I can definitely see their point. I work in distributed systems for a decade and I can tell you, when you kick the can downstream, it just gets worse later when it’s spread out and systemic.

You should nip overloads in the bud, and not propagate them. Have backpressure be at the protocol level, and every node only deals with its neighbors.

In fact, I would go so far as to say that the main reason for these failures is because we have monolitic, global addressing systems like DNS or IP routing tables, which let me send spam email to anyone, or DDOS a site from many machines at once. It’s totally discontinuous.

What a good distributed system should have is be continuous in distributing capabilities. Each node can grant capabilities only to trusted neighbors, and revoke any that have been misused. Neighbors can then delegate some capabilities to others, or — if the node wants — forward an invitation to them, to become a neighbor.

That would also solve all the issues about “real names policy”, and other crap like that. It shouldn’t matter whether you are “the real” Bill Gates or not. Your email shouldn’t be accessible to the whole world.

And websites would also be stored using a FileCoin-type market, which recruits more machines as more readers SPEND MONEY using micropayments to access the files.

Right now micropayments aren’t feasible, so instead we essentially have the publishers pay for hosting and collect micropayments via subscriptions and bundles.

yuliyp4y ago

Immutability doesn't really solve everything. It provides a cleaner path for retries for writes, but still doesn't handle situations where reads fail.

I think the conclusion in the article ("don't do fallback") is misguided. Fallback code is sketchy, but sometimes it is worth it to take the time to write well-audited, well-tested fallback code to ensure a system which has high availability requirements can survive dependencies which are less reliable.

1 more reply

gumby4y ago

> But then again now we consider physical systems, say a spaceship, which require critical capabilities and operational regimes, and ask if fallback fault management is really a 'bad idea'.

Their very example -- airport notice boards -- is an example of someplace where fallback is needed. The thesis of the piece is that management of fallbacks is complicated and painful and thus increase the scope of failure, as you observed.

In other words: fallback is often but not always required, and if you can plan to avoid it it may be better for you, depending on your application.

PaulHoule4y ago

I think of how the Space Shuttle had 4 computers running the same software and a backup computer running a simpler implementation of the control program.

The flight control systems of civil aircraft like the A320 has failback modes to handle hardware failures such as a failed angle-of-attack sensor

https://a320podcast.libsyn.com/flight-control-laws

The 737 MAX crashed because it didn't have fallback modes.

Engine Control Units in automobiles also have fallback modes. You shouldn't get stuck just because an oxygen sensor failed, even though that means the car will have trouble balancing clean emissions, performance and fuel efficiency.

2 more replies

letitbeirie4y ago

> physical systems, say a spaceship, which require critical capabilities and operational regimes, and ask if fallback fault management is really a 'bad idea'.

Depends on context obviously but IME as a controls engineer, what you want is a failsafe, not a fallback.

AWS calls a fallback when you "use a different mechanism to achieve the same result." Failsafes are all about returning the system to a stable and controllable state - if you can salvage the result that's great, but if it takes flaring off $10,000,000 worth of distillate to stabilize the system that's fine too.

j / k navigate · click thread line to collapse