undefined | Better HN

0 pointsbamboozled1y ago0 comments

What happens when the single machine fails?

0 comments

Worst case scenario you service is not available for a couple of hours. In 99% of business, customers are totally okay with that (if it's just not every week). IRL shops are also occasionally closed due to incidents; heck even ATMs and banks don't work 100% of the time. And that's the worst case: because your setup is so simple, restoring a backup or even doing a full setup of a new machine is quite easy. Just make sure you test you backup restore system regularly. Simple system also tend to fail much less: I've run a service (with customers paying top euro) that was offline for ~two hours due to an error maybe once or twice in 5 years. Both occurrences were due to a non-technical cause (a bill that wasn't payed - yes this happened, the other one I don't recall). We were offline for a couple of minutes daily for updates or the occasional server crash (a go monolith, crash mostly due to an unrecovered panic), however the reverse proxy was configured to show a nice static image with the text along the lines "The system is being upgraded, great new features are on the way - this will take a couple of minute". I installed this the first week when we started the company with the idea that we would do a live-upgrade system when customers started complaining. Nobody ever complained - in fact customers loved to see we did an upgrade once in a while (although most customers never mentioned having seen the image).

bamboozledOP1y ago

Depending on your product, this could mean tens of thousands to millions of dollars worth of revenue loss. I don't really see how we've gone backwards here.

You could just distribute your workloads using...a queue, and not have this problem, or have to pay for and pay to maintain backup equipment etc.

pjlegato1y ago

If your product going down for an hour will lead to the loss of millions of dollars, then you should absolutely be investing a lot of money in expensive distributed and redundant solutions. That's appropriate in that case.

The point here is that 99% of companies are not in that scenario, so they should not emulate the very expensive distributed architectures used by Google and a few other companies that ARE in that scenario.

For almost all companies on the smaller side, the correct move is to take the occasional downtime, because the tiny revenue loss will be much smaller than the large and ongoing costs of building and maintaining a complex distributed system.

1 more reply

sturmdev1y ago

From the original post: “Your business is not Google and will never be Google”

From the post directly above: “Most businesses…”

The thread above is specifically discussing business which won’t lose a significant amount of money if they go down for a few minutes. They also postulate that most businesses fall into this category, which I’m inclined to agree with.

2 more replies

koliber1y ago

It could. In those cases, you set up the guardrails to minimize the loss.

In your typical seed, series A, or series B SaaS startup, this is most often not the case. At the same time, these are the companies that fueled the proliferation of microservice-based architectures, often with a single-point of failure in the message queue or in the cluster orchestration. They shifted easy-to-fix problems into hard-to-fix problems.

fennecbutt1y ago

Hellishly and endlessly optimising for profit is how we've gotten the world into its current state, lmao.

perbu1y ago

Machine failures are few and far between these days. Over the last four years I've had a cluster of perhaps 10 machines. Not a single hardware failure.

Loads of software issues, of course.

I know this is just an anecdote, but I'm pretty certain reliability has increased by one or two orders of magnitude since the 90s.

sgarland1y ago

Also anecdotally, I’ve been running 12th gen Dells (over a decade old at this point) for several years. I’ve had some RAM sticks report ECC failures (reseat them), an HBA lose its mind and cause the ZFS pool to offline (reseat the HBA and its cables), and precisely one actual failure – a PSU. They’re redundant and hot-swappable, so I bought a new one and fixed it.

bamboozledOP1y ago

You didn't answer the question though. You're answer is "it won't" and that isn't a good strategy.

Zababa1y ago

It is in that if something happens less often, you don't need to prepare for it as much if the severity stays the same (cue in Nassim Taleb entering the conversation).

1 more reply

koliber1y ago

Your monitoring system alerts you on your phone, and you fix the issue.

When I worked with small firms who used kubernetes, we had more kubernetes code issues that machines failing. The solution to the theoretical problem was the cause of real issues. It was expensive to keep fixing this.

venv1y ago

Depending on your requirements for uptime, you could have a stand-by machine ready or you spin up a new one from backups.

j / k navigate · click thread line to collapse

0 comments

marhee1y ago

bamboozledOP1y ago

Depending on your product, this could mean tens of thousands to millions of dollars worth of revenue loss. I don't really see how we've gone backwards here.

You could just distribute your workloads using...a queue, and not have this problem, or have to pay for and pay to maintain backup equipment etc.

pjlegato1y ago

1 more reply

sturmdev1y ago

From the original post: “Your business is not Google and will never be Google”

From the post directly above: “Most businesses…”

2 more replies

koliber1y ago

It could. In those cases, you set up the guardrails to minimize the loss.

fennecbutt1y ago

Hellishly and endlessly optimising for profit is how we've gotten the world into its current state, lmao.

perbu1y ago

Machine failures are few and far between these days. Over the last four years I've had a cluster of perhaps 10 machines. Not a single hardware failure.

Loads of software issues, of course.

I know this is just an anecdote, but I'm pretty certain reliability has increased by one or two orders of magnitude since the 90s.

sgarland1y ago

bamboozledOP1y ago

You didn't answer the question though. You're answer is "it won't" and that isn't a good strategy.

Zababa1y ago

It is in that if something happens less often, you don't need to prepare for it as much if the severity stays the same (cue in Nassim Taleb entering the conversation).

1 more reply

koliber1y ago

Your monitoring system alerts you on your phone, and you fix the issue.

venv1y ago

Depending on your requirements for uptime, you could have a stand-by machine ready or you spin up a new one from backups.

j / k navigate · click thread line to collapse