undefined | Better HN

0 pointsporknubbins3y ago0 comments

The default, naive assumption should always have been programs keep running indefinitey on their own. If thats not the goal of software then I don’t know what is (might as well go back to switchboard operators). Real world experience tells us that, to the contrary, all software goes down and requires specialist intervention eventually. I think a lot of people just jumped to the second level based on political motivations rather than deep knowledge of system failures.

0 comments

brookst3y ago

> all software goes down and requires specialist intervention eventually

Well, that’s it, isn’t it? How many software systems need to keep running for Twitter to remain more or less functional?

If there are 10 critical systems that are running at four 9’s, you’d expect 3.6 hours of downtime a year, or about 90 days of uptime at a stretch if I have my math right.

If there are 100 critical systems running at 3 9’s, you’d expect 2.5 hours of downtime per day.

So yeah, all software should keep running. But it doesn’t. And something like Twitter isn’t “a software”, it’s a very large assembly of lots of different software systems and the exponential math that dependencies create.

foobazgt3y ago

Yep, and when one of the SEVs rolls around that would have been small (say 5m of downtime fixed with a flag flip), it instead will have a nontrivial chance of escalating into a major multi-hour/multi-day outage without the right institutional knowledge.

I'd guesstimate that Twitter probably has dozens of services that are in the critical path of an average user interaction. It's hard to keep even logically optional dependencies truly optional in large scale systems involving many people.

However Twitter didn't die in the past when fail whales ruled its day, so they probably won't kill it now. It's just not that kind of business. (In contrast, a one hour outage had me directly apologizing to our largest customers on the phone). That said, Twitter can only be unstable and lack feature growth for so long before something else takes its place, so Musk is on a clock.

bagels3y ago

It has effects on engagement, retention, and ad revenue though.

1 more reply

bagels3y ago

Some of those 9s come with an assumption that someone is around who knows what buttons to push when it really goes awry.

sroussey3y ago

I had MySQL running on some bare metal for many years without a restart.

I was terrified to update the kernel at that point, knowing that system disk had been running continuously for many years, and had no faith it would restart successfully.

Finally got two new servers to replace these (with these new SSD things!) and after migration, sure enough, one of the old servers failed to boot.

kyriakos3y ago

Even if your mysql instance and hardware had run indefinitely, if a table is being written to it will eventually run out of disk space or key space and crash. How long it will take depends on the application but it will happen eventually and if no one is around to fix it...

kova123y ago

Having single point of failure, and also not even knowing if it will even come up after reboot is a horrible way to run the service

xyzwave3y ago

Reminds me of an Alan J. Perlis quote:

> Is it possible that software is not like anything else, that it is meant to be discarded: that the whole point is to see it as a soap bubble?

http://www.cs.yale.edu/homes/perlis-alan/quotes.html

sangnoir3y ago

> I think a lot of people just jumped to the second level based on political motivations rather than deep knowledge of system failures.

Anyone who has ever been oncall can intuit how often stuff breaks in big or little ways. Sometimes it's transient and goes away, sometimes it can be filed away to be fixed in the next year, but sometimes, it turns out to be an all-hands-on-deck crisis for a team, or 5.

Izkata3y ago

> The default, naive assumption should always have been programs keep running indefinitey on their own.

...for people who understand software to some extent. I get the feeling a lot of people see it more like a hamster wheel, where once the developers are gone it immediately starts noticeably slowing down as it stops (and are confused when that doesn't happen).

dcow3y ago

I have a piece of Rust software that has not gone down in its entire lifetime.

onionisafruit3y ago

Me too. It’s a hello world web service I wrote while reading a tutorial after work this afternoon.

mradek3y ago

Thanks for the laugh I needed that

EdwardDiego3y ago

Good for you!

Now, if your Rust code was a distributed system that handles spiky loads from ~330m users, and processes petabytes of data, then I'd consider your comparison relevant to Twitter.

But I'm going to assume it's not relevant.

P.S., I've written Java services that never went down, because they had a well defined domain and all potential errors were handled. But, I'm not about to compare that to all of frigging Twitter.

dcow3y ago

I wasn’t attempting to compare anything to Twitter…

silisili3y ago

The infra usually matters way more than the code. RAM or a disk will typically fail before the Linux kernel, and it's written in the boogeyman language.

polio3y ago

By definition, no software goes down until its lifetime ends.

jonahrd3y ago

I would say the lifetime of a piece of software is different from the lifetime of the process running it.

j / k navigate · click thread line to collapse

0 comments

brookst3y ago

> all software goes down and requires specialist intervention eventually

Well, that’s it, isn’t it? How many software systems need to keep running for Twitter to remain more or less functional?

If there are 10 critical systems that are running at four 9’s, you’d expect 3.6 hours of downtime a year, or about 90 days of uptime at a stretch if I have my math right.

If there are 100 critical systems running at 3 9’s, you’d expect 2.5 hours of downtime per day.

foobazgt3y ago

bagels3y ago

It has effects on engagement, retention, and ad revenue though.

1 more reply

bagels3y ago

Some of those 9s come with an assumption that someone is around who knows what buttons to push when it really goes awry.

sroussey3y ago

I had MySQL running on some bare metal for many years without a restart.

I was terrified to update the kernel at that point, knowing that system disk had been running continuously for many years, and had no faith it would restart successfully.

Finally got two new servers to replace these (with these new SSD things!) and after migration, sure enough, one of the old servers failed to boot.

kyriakos3y ago

kova123y ago

Having single point of failure, and also not even knowing if it will even come up after reboot is a horrible way to run the service

xyzwave3y ago

Reminds me of an Alan J. Perlis quote:

> Is it possible that software is not like anything else, that it is meant to be discarded: that the whole point is to see it as a soap bubble?

http://www.cs.yale.edu/homes/perlis-alan/quotes.html

sangnoir3y ago

> I think a lot of people just jumped to the second level based on political motivations rather than deep knowledge of system failures.

Izkata3y ago

> The default, naive assumption should always have been programs keep running indefinitey on their own.

dcow3y ago

I have a piece of Rust software that has not gone down in its entire lifetime.

onionisafruit3y ago

Me too. It’s a hello world web service I wrote while reading a tutorial after work this afternoon.

mradek3y ago

Thanks for the laugh I needed that

EdwardDiego3y ago

Good for you!

Now, if your Rust code was a distributed system that handles spiky loads from ~330m users, and processes petabytes of data, then I'd consider your comparison relevant to Twitter.

But I'm going to assume it's not relevant.

P.S., I've written Java services that never went down, because they had a well defined domain and all potential errors were handled. But, I'm not about to compare that to all of frigging Twitter.

dcow3y ago

I wasn’t attempting to compare anything to Twitter…

silisili3y ago

The infra usually matters way more than the code. RAM or a disk will typically fail before the Linux kernel, and it's written in the boogeyman language.

polio3y ago

By definition, no software goes down until its lifetime ends.

jonahrd3y ago

I would say the lifetime of a piece of software is different from the lifetime of the process running it.

j / k navigate · click thread line to collapse