undefined | Better HN

0 pointsjustapassenger3y ago0 comments

This. People working in IT naturally think keeping IT systems up 100% time is most important. And depending on the business it often is, but it all costs money. Running a business is about managing costs and risks.

- Is it worth to spend 20% more on IT to keep our site up 99.99% vs 99%?

- Is it worth to have 3 suppliers for every part that our business depends, with each of them being contracted to be able to supply 2x more, in case other supplier has issues? And pay a big premium for that?

- Is it worth to have offices across the globe, fully staffed and trained to be able to take on any problem, in case there's big electrical outage/pandemic/etc in other part of the world?

I'm not saying that some of those outages aren't results of clowny/incompetent design. But "site sometimes goes down" can be often a very valid option.

0 comments

tetha3y ago

I've had some interesting discussions about this with a bunch of representatives of our larger B2B customers about this. Interestingly enough, to them, a controlled downtime of 2-4 hours with an almost guaranteed success is preferable, compared to a more complex, probably working zero-downtime effort that might leave the system in a messed up - or not messed up - state.

To them it's much easier to communicate "Hey, our customer service is going to be degraded on the second saturday in october, call on monday" to their customers 1-2 month in advance, prepare to have the critical information without our system, and have agents tell people just that.

This has really started to change my thoughts of how to approach, e.g. a major postgres update. In our case, it's probably the better way to just take a backup, shutdown everything, do an offline upgrade and rebuild & restore if things unexpectedly go wrong. We can totally test the happy case to death, and if the happy case works, we're done in 2 hours for the largest systems with minimal risk. 4 hours if we have to recover from nothing, also tested.

And you know, at that point, is it really economical to spend weeks to plan and weeks to test a zero downtime upgrade that's hard to test, because of load on the cluster?

Sebb7673y ago

At least in my experience, AWS downtime also only accounts for a minor share of the total downtime; the major source are crashes and bugs in the application you're actually trying to host. Being completely HA and independent of AZ crashes/bug is extremely hard and time intensive and usually not worth it compared to investing that time to get your app to run smoothly.

fnordpiglet3y ago

Yes but when someone else causes your downtime it’s fun to sit around and snipe at them for fun.

aledalgrande3y ago

I think a good trade off, if your infra is in TF, is to be able to run your scripts with a parameterized AZ/region. That way you can reduce the downtime even more at a fraction of the cost. (assuming the services that are down are not the base layers of AWS, like the 2020 outage)

Sebb7673y ago

If you can get the data out of the downed AZ, don't have state you need to transfer and are not shot in the foot once the primary replica comes online again. I've rarely deployed an app where it was as easy as just to change a region variable.

aledalgrande3y ago

Yeah the data stores are the ones that I would always keep multi AZ no matter what. Everything else is stateless and can be moved quickly.

1 more reply

j / k navigate · click thread line to collapse

0 comments

tetha3y ago

And you know, at that point, is it really economical to spend weeks to plan and weeks to test a zero downtime upgrade that's hard to test, because of load on the cluster?

Sebb7673y ago

fnordpiglet3y ago

Yes but when someone else causes your downtime it’s fun to sit around and snipe at them for fun.

aledalgrande3y ago

Sebb7673y ago

aledalgrande3y ago

Yeah the data stores are the ones that I would always keep multi AZ no matter what. Everything else is stateless and can be moved quickly.

1 more reply

j / k navigate · click thread line to collapse