Why Twitter didn’t go down: From a real Twitter SRE

1351 comments

From this operations engineer's perspective, there are only 3 main things that bring a site down: new code, disk space, and 'outages'. If you don't push new code, your apps will be pretty stable. If you don't run out of disk space, your apps will keep running. And if your network/power/etc doesn't mysteriously disappear, your apps will keep running. And running, and running, and running.

The biggest thing that brings down a site is changes. Typically code changes, but also schema/data changes, infra/network/config changes, etc. As long as nothing changes, and you don't run out of disk space (from logs for example), things stay working pretty much just fine. The trick is to design it to be as immutable and simple as possible.

There are other things that can bring a site down, like security issues, or bugs triggered by unusual states, too much traffic, etc. But generally speaking those things are rare and don't bring down an entire site.

The last thing off the top of my head that will absolutely bring a site down over time, is expired certs. If, for any reason at all, a cert fails to be regenerated (say, your etcd certs, or some weird one-off tool underpinning everything that somebody has to remember to regen every 360 days), they will expire, and it will be a very fun day at the office. Over a long enough period of time, your web server's TLS version will be obsoleted in new browser versions, and nobody will be able to load it.

mjr003y ago

It's crazy to think about, but many people who use and build software today, including HN readers/commenters, are young enough to have only been exposed to the SaaS, cloud-first era, where software built with microservices deployed from CI/CD systems multiple times per day is just the way things are done.

You're totally right; if you don't make changes to the software, it's unlikely to spontaneously stop working, especially after that first 6-12 months of "hardening" where bugs are found and patched.

Many people working in tech have never been exposed to a piece of software which isn't being constantly changed in small increments and forced upon end users. People are assuming that software is inherently unstable simply because they never use anything that isn't a "cloud service".

This probably comes off as "old man yells at cloud" but I'm not trying to bash cloud here. The cloud/SaaS approach has a ton of advantages for both consumers and businesses. But the average tech person in their 20s vastly underestimates how stable software can be when you aren't constantly pushing new features.

6 more replies

drdrey3y ago

Another thing we noticed at Netflix was that after services didn’t get pushed for a while (weeks), performance started degrading because of things like undiscovered memory leaks, threads leaks, disks filling up. You wouldn’t notice during normal operations because of regular autoscaling and code pushes, but code freezes tended to reveal these issues.

TranquilMarmot3y ago

We used to have a horribly written node process that was running in a Mesos cluster (using Marathon). It had a memory leak and would start to fill up memory after about a week of running, depending on what customers were doing and if they were hitting it enough.

The solution, rather than investing time in fixing the memory leak, was to add a cron job that would kill/reset the process every three days. This was easier and more foolproof than adding any sort of intelligent monitoring around it. I think an engineer added the cron job in the middle of the night after getting paged, and it stuck around forever... at least for the 6 years I was there, and it was still running when I left.

We couldn't fix the leak because the team that made it had been let go and we were understaffed, so nobody had the time to go and learn how it worked to fix it. It wasn't a critical enough piece of infrastructure to rewrite, but it was needed for a few features that we had.

5 more replies

lcw3y ago

Agreed, one of the craziest bugs I had to deal with was we had a distributed system using lots of infrastructure. Said distributed system started having trouble communicating with random nodes and sub-systems. I spent 3 hard days finding a Linux kernel bug where the ARP cache was not removing least recently accessed network addresses. Normally, this wouldn't be a big deal for a typical network because few networks would fill up the default arp cache size. That was even true for ours except that we would slowly add and remove infrastructure over the course of a couple months until eventually the ARP cache would fill and remove the random network devices... It wasn't even our distributed application code... Some bugs take time to manifest themselves in very creative ways.

2 more replies

polio3y ago

If resource leaks became a serious issue I imagine they could buy time by restarting. I'm curious what the causes were for code freezes. At Meta they would freeze around Thanksgiving and NYE because of unusually high traffic.

1 more reply

__bjoernd3y ago

I once debugged a kernel memory leak in an internal module that manifested after around 6 years of (physical) server uptime. There are surprises lurking very far down the road.

r3trohack3r3y ago

We joked about adding this to the NodeQuark platform:

    // Fix Slow Memory Leaks
    setTimeout(() => process.exit(1), 1000 * 60 * 60 * 24)

1 more reply

bandrami3y ago

Back in the Pleistocene I worked in a ColdFusion shop (USG was all CF back then and we were contractors) and we had two guys whose job was to bounce stacks when performance fell under some defined level.

evanelias3y ago

> you don't run out of disk space (from logs for example)

For a social media / user-generated content application, the macro storage concerns are a lot more important than the micro ones. By this I mean, care more about overall fleet-wide capacity for product DBs and media storage, instead of caring about a single server filling up its disk with logs.

With UGC applications, product data just grows and grows, forever, never shrinking. Even if the app becomes less popular over time, the data set will still keep growing -- just more slowly than before.

Even if your database infrastructure has fully automated sharding, with bare metal hosting you still need to keep doing capacity planning and acquiring new database hardware. If no one is doing this, it's game over, there's simply nowhere to store new tweets (or new photos, or whichever infra tier runs out of hardware first...)

Staffing problems in other eng areas can exacerbate this. For example, if automated bot detection becomes inadequate, bot posting volume goes way up and takes up an increasing amount of storage space.

threeseed3y ago

> absolutely bring a site down over time, is expired certs

From today's Casey Newton's newsletter:

In early December, a number of Twitter’s security certificates are set to expire — particularly those that power various back-end functions of the site. (“Certs,” as they are usually called, serve to reassure users that the website they are visiting is authentic. Without proper certs, a modern web browser will refuse to establish the connection or warn users not to visit the site). Failure to renew these certs could make Twitter inaccessible for most users for some period of time.

We’re told by some members of Twitter’s engineering team that the people responsible for renewing these certs have largely resigned — raising concerns that Twitter’s site could go down without the people on hand to bring it back. Others have told us that the renewal process is largely automated, and such a failure is highly unlikely. But the issue keeps coming up in conversations we have with current and former employees.

1 more reply

edanm3y ago

> There are other things that can bring a site down, like security issues, or bugs triggered by unusual states, too much traffic, etc.

In my experience as a data engineer, unusual states are one of the leading causes of issues, at least after something is built for the first time. You can spend half a year running into weird corner cases like "this thing we assumed had to always be a number apparently can arbitrarily get filled in with a string, now everything is broken."

Also, conditions changing causing code changes is the norm, not the exception, definitely in the beginning but also often later. Most services aren't written and done - they evolve as user needs evolve and the world evolves.

TranquilMarmot3y ago

> As long as nothing changes, and you don't run out of disk space (from logs for example), things stay working pretty much just fine. > ... > There are other things that can bring a site down, like security issues, or bugs triggered by unusual states, too much traffic, etc. But generally speaking those things are rare and don't bring down an entire site.

Aren't these changes inevitable, though? There is no such thing as bug free code.

Another thing that forces consistent code changes is compliance reasons- any time a 0-day is discovered or some library we're using comes out with a critical fix, we would have to go update things that hadn't been touched sometimes in years.

At my last job, I spent a significant amount of time just re-learning how to update and deploy services that somebody who left the company years ago wrote, usually with little-to-no documentation. And yes, things broke when we would deploy the service anew, but we were beholden to government agencies to make the changes or else lose our certifications to do business with them.

Eventually, Twitter will have to push code changes, if only to patch security vulnerabilities. Just waiting for another Heartbleed to come around...

1 more reply

grayfaced3y ago

But real world conditions can force code changes. For example, a region abandons daylight savings time or a court order on copyright infringement. Someone unqualified working a system they are unfamiliar with could blow it up. Losing that knowledge of how the system works is a risk.

threeseed3y ago

> But real world conditions can force code changes

Security fixes.

1 more reply

ithkuil3y ago

An example where something that correlates with time can reveal pre-existing bugs long after the system was chugging along just fine: counter limits/overflows.

Simple example: you have a DB with a table with an auto incrementing table. You chose a small integer type for the primary key and after years this just worked fine, you finally saturate that integer type you can no longer insert rows in the table. Imagine now this has cascading effects in other systems that depend on this database indirectly and you end up with an "outage"

btbuildem3y ago

> The biggest thing that brings down a site is changes

Absolutely agreed. In that vein, there is such a thing as too much automation. Sometimes, build chains are set up to always pull in the newest and the freshest -- and given the staggering number of dependencies software generally has, this might mean, small changes all the time. Even when your code does not change, it can eventually break.

It's been my experience that a notable part of software development (in the cloud age, anyway) is about keeping up with all the small incremental changes. It takes bodies to keep up with this churn, bodies which twitter now does not have.

It'll be interesting to keep observing this. So far it's been a testament to the teams that built it and set up the infra -- it keeps running, despite a monkey loose in a server room. It's very impressive.

sleight423y ago

"Outages": this is an enormous ellipsis.

* Power outages and general acts of God

* Resource utilization

How do your databases perform when their CPUs are near capacity? Or disks? Or I/O? I've seen Postgres do some "weird s%$#": where query times don't go exponential but they go hockey stick.

* Fan-out and fan-in

These can peg CPU, RAM, I/O. Peg any one of these and you're in trouble. Even run close to capacity for any one of these and you're liable to experience heisenbugs. Troublesome fan-out and fan-in can sometimes be a result of...

* Unintended consequences

The engineering decision made months or years ago may have been perfectly legitimate for the knowledge available at the time. However, we live in a volatile, uncertain, complex, and ambiguous (VUCA) world; conditions change. If your inputs deviate qualitatively or quantitatively significantly, you risk resource utilization issues or, possibly, good ol' fashioned application errors.

"No battle plan survives contact with the enemy." -- Stormin' Norman

Same with software systems. They're living entities that can only maintain homeostasis so long as their environment remains predictable within norms. Deviate far enough from that and boom.

dymk3y ago

Any sort of cached object expiring might bring the servers down. Who knows when the Death TTL will come?

coldcode3y ago

I worked as an engineer for a very large non tech company (but used a lot of tech, both bought and in-house). We had 100s of teams supporting services, internal apps (web and mobile), external apps (web and mobile), and connections to vendors plus a huge infrastructure in the real world that interconnected to all of this. One time someone changed something in a single data center (I vaguely remember some kind of DNS or routing update) and every single system worldwide failed in a short time. Even after the issue was resolved, it took most of a day and hundreds of people to successfully restart everything, all while our actual business had to continue without pissing off all of our customers. The triage was brutal as to what mattered most.

You can't do this without a lot of people. Sure you could pare it down, maybe improve some architecture, but without a ton of people involved who understand the systems and how they connect, when things might go south they may never return.

radu_floricica3y ago

I have an old project I gave up on - haven't touched it, done any code changes or maintenance in... almost a decade? At least a stubborn client is still using it, successfully. And it's not an old guy in a living room, but an honest small sized company that has this software as the core of its operations.

So yeah, I totally agree with you. No code changes = long life.

1 more reply

lr4444lr3y ago

You didn't mention data scale. Just because the disks have room, doesn't mean the data access patterns in perfectly stable code will perform well at continual multiples if old data isn't somehow moved to colder storage.

vonmoltke3y ago

> There are other things that can bring a site down, like [...] too much traffic[.] But generally speaking those things are rare and don't bring down an entire site.

I agree with your assessment, but I do want to highlight that this condition is not rare for Twitter. Load is very spiky, sometimes during predictable periods (e.g., the World Cup, New Year's Eve) and sometimes during unpredictable periods (e.g., Queen Elizabeth II's death, the January 6th US Capitol attack). It isn't going to cause a total site failure (anymore), but it can degrade user experience in subtle or not-so-subtle ways.

An aside on the "anymore", there was a time when the entire site did go down due to high-traffic events. A lot of the complication in the infrastructure was built to add resiliency and scalability to the backend services to allow Twitter to handle these events more gracefully. That resiliency is going to help keep the services up even if maintenance is understaffed and behind a learning curve.

mmcnl3y ago

Sorry for hijacking your expertise, but why no mention of memory leaks? In my experience they can cause really weird bugs not obvious at first, and are difficult to reproduce, i.e. triggered by edge cases that happen infrequently. Or are you assuming services automatically restart when memory is depleted?

1 more reply

mschuster913y ago

> If, for any reason at all, a cert fails to be regenerated (say, your etcd certs, or some weird one-off tool underpinning everything that somebody has to remember to regen every 360 days), they will expire, and it will be a very fun day at the office. Over a long enough period of time, your web server's TLS version will be obsoleted in new browser versions, and nobody will be able to load it.

At least for expired certs, most people have learned the hard way just how bad that is, and either implemented automated renewal (thank to heavens for cert-manager, LetsEncrypt, AWS ACM and friends) or where that doesn't work (MS AD...) monitoring.

marstall3y ago

I'll add one: when usage scales beyond anticipated levels. then that code that is "good enough" will no longer be, and serious intervention may be required - by senior engineers with history.

enominezerum3y ago

Takes me back to the first broken mess of an environment I worked in. Change freezes were a day of life and lasted and, magically, nothing would break during that time.

Now, those change freezes even extended to preventative maintenance, one of the dual PSUs in a core switch went bad and we couldn't get an exception to replace it... for 6 months. We got an exception when the second one went down and we had to move a few connections to its still alive mate.

itsoktocry3y ago

>The biggest thing that brings down a site is changes.

Well, Elon is talking about a massive amount of changes coming down the pipe, so I guess we'll see how that goes!

illiac7863y ago

I think that without code push they won’t be able to maintain compatibility. With updated APIs from third parties, new hardware, new encryption requirements from clients or browsers etc. It’s a slow descent into chaos indeed.

city413y ago

A browser update is a form of "new code". It's rare, but having to work around newly introduced browser bugs does happen.

ranguna3y ago

And vulnerabilities.

resonious3y ago

> This left a lot wondering what exactly was going on with all those engineers and made it seem like it was all just bloat.

I was partly expecting the rest of the article to explain to me why exactly it wasn't just bloat. But it goes on talking about this 1~3-person cache SRE team that built solid infra automation that's really resilient to both hardware and software failures. If anything, the article might actually persuade me that it was all bloat.

donkeyd3y ago

> the article might actually persuade me that it was all bloat

First of all, how does it persuade you of that? The article touches a really small (though incredibly important for up-time) subject.

Secondly, in any large company, the majority is 'bloat'. It's security engineers, code reviews, data architecture, HR, internal audit teams, content moderators, ccrum masters and I can keep going. In a start-up many of these roles can be ignored, becaus growth > stability. In a large organization, part of the bloat helps insure a certain amount of stability that's necessary to keep an organization alive.

If a product is mature enough, like Twitter seems to be, removing engineers won't instantly crash the product. It'll happen slowly. Bugs will creep in, because less time is spent on review and over all architecture. Security issues will creep in because of about the same issues and less oversight. Then, once this causes enough issues for the product to actually crash, the right people to fix it quickly might not be there anymore. That's when fixing the issues suddenly takes a lot more time.

If the current state of affairs at Twitter keeps up, it'll probably be a slow descent into chaos. Especially with Elon pushing for new features to be implemented quickly, inevitably by people who cannot fully understand the implications of said features, because 80% of knowledge is missing.

20 more replies

283042834092343y ago

Argh. "It works now, so it will work until forever."

It takes _effort_ to make it work this smoothly now, _and in the future_.

SRE is about _preventing_ issues. Not mopping up after them.

To me, the article read like every succesfull sysadmin story: there's no fires, so sysadmin must be bloat.

4 more replies

oxfordmale3y ago

Let's do a thought experiment and see what functions aren't needed to keep the light on for 30 days:

1) HR 2) Legal 3) Sales 4) Marketing 5) Payroll 6) Admin staff 7) Most of Engineering, other than the bare minimum of L1/L2/3 support.

As someone paraphrased, a car without breaks and steering wheel works just fine until you hit the first bend.

2 more replies

weinzierl3y ago

No matter if it was or not and for better or worse:

If Twitter survives this without any major harm it will have profound consequences for the whole software industry.

4 more replies

jameshart3y ago

That small team seems to have been running the caches for other teams, by using infrastructure provided by another team, in two massive datacenters operated by other teams, using monitoring tools managed by another team, and a ticketing system run by another, on hardware purchases by another team…

All just to put caching in front of services that actually do anything.

1 more reply

twblalock3y ago

It’s easy to think it’s bloat at a steady state. When something important goes down and nobody knows how to fix it, it looks different.

1 more reply

bjarneh3y ago

> I was partly expecting the rest of the article to explain to my why exactly it wasn't just bloat

Same here. I guess his header was on point in why Twitter is still up; but I was also interested in hearing about why Twitter actually needs all those people. If it can be run with 50-80% of the staff gone, that does sound like some bloat at least.

1 more reply

Lutger3y ago

I'm puzzled by this statement. Do you think of resiliency as waste? That twitter would have been fine without it?

The article makes a point that the reason Twitter is running ok on 20% of personnel at this moment is exactly because it was build to be resilient, not because the personnel was bloated. A large part of this so called bloat, the 80%, was responsible for Twitter to be running right now. Calling this bloat implies it is actually not important for Twitter to be available all the time (or at all).

1 more reply

toss13y ago

>>If anything, the article might actually persuade me that it was all bloat.

Not for me

This is almost exactly like the new manager coming in, noticing that the floors and surfaces are all clean, all the systems work, the trash is emptied, etc., and so deciding that the entire maintenance staff is unnecessary and firing them.

The place doesn't become a decrepit pigsty the next morning; it slowly degrades.

Same for these systems. They were designed, built, tuned, and maintained over the course of years to go from requiring constant manual intervention to running largely unattended and with a good buffer of ready hardware and automatic failover for failures. That "largely" in "largely unattended" is doing some very heavy lifting.

The system WILL require human intervention to keep running, and more than just a skeleton crew. The only question is whether it will happen before the new crew gets up to speed to handle the inevitable degradation.

This does NOT mean that the SREs were bloat - it means that they were doing an excellent job and could safely take a break. We're now in just the two-week vacation zone - same as if the entire SRE team went on a holiday. We'd expect it to work. Now let's see what happens in two months.

nashashmi3y ago

In addition to the many great comments here, remember that super star engineers don't exactly fix problems from day to day. They fix the problems before they become problems.

The engineer was doing stability planning for 6 months out for the purpose of cost optimization. I guess we can assume that the costs of infrastructure is about to go up and reliability is about to go down in the coming months.

memish3y ago

There's an incredible amount of bloat in big tech.

It's become an adult daycare, https://twitter.com/DavidSacks/status/1561096423243800576

Twitter's layoffs followed by that 1AM photo of hackers at work is terrifying to lifestyle employees.

It's the Return of the Nerds.

Google, Meta, Netflix, Microsoft are all watching.

1 more reply

ryanfreeborn3y ago

I have been under the good faith assumption that most (though definitely not all) of the employees that have departed Twitter were probably necessary and valuable to the company. I left the article with the same impression as you. This single person did this very important job, seemingly well, and didn't appear to be drowning in the work. What were the other 8-9k doing?

pron3y ago

> But it goes on talking about this 1~3-person cache SRE team that built solid infra automation that's really resilient to both hardware and software failures.

... for the Cache component. There are many others.

carabiner3y ago

Yeah it's dancing around the question: Was Musk right? All signs so far are pointing to, yes. MBA's will be studying this for years.

10 more replies

Aeolun3y ago

And so is the cache setup. It’s permanently (and deliberately) running at less than 50% utilization to prevent an issue that comes up only once every 5 years (according to the author).

ouid3y ago

Of course its all bloat. Software runs on computers, not engineers. The default assumption for software is that it will go on working. The state might devolve, but the software is exactly as reliable right now as it was before. I'm no friend of Elon, and I think its hilarious to think that he can be king of twitter, but all these people talking about "code entropy" are certifiably insane.

Big tech maintains talent so that they won't use their knowledge of the system to produce an identical competitor without the technical debt or investor liability of the original.

H8crilA3y ago

This is a PR article that tries to push the idea that Twitter is OK (to work at, and maybe also to buy ad space from). Damage control.

1 more reply

lazyant3y ago

I guess different definitions for "bloat" but how is it bloat to have a tiny team taken care of a fundamental piece of infrastructure? if the team is now gone, an issue there would mean hours of downtime. If that's acceptable then yes, it was "bloat".

halffaday3y ago

I’m suspicious that most of the value in these systems comes from a small fraction of the effort and many technology jobs boil down to knowing you’re a huge cost center and putting on a performance to hide that.

w0m3y ago

If you only care about 'it mostly didn't crash' as the end-goal of a company, that would be a reasonable take.

adql3y ago

You just need to look at profit and revenue. Revenue grew nicely in last 2-3 years, profit not so much. Bloat is only reason.

1 more reply

HeavyStorm3y ago

Indeed!

v0idzer03y ago

Of course it was bloat. This whole “twitter is going to crash and burn” thing is a weird fantasy. Most likely it will just be run more efficiently by far less people.

adam_arthur3y ago

Well, WhatsApp had ~50 employees and Instagram around ~15 when FB acquired them, and they were around the same order of magnitude of complexity as Twitter.

The only concern Id have is that by having so many people, your design probably comes to rely on them whereas a smaller team would be forced to make the system easier to maintain.

Personally, if I were Elon, I’d build an entirely new backend and point the clients to that rather than trying to incrementally improve what they have.

Get 50-100 10x engineers that are loyal to Elon, with big equity stakes, and crush it

1 more reply

usgroup3y ago

I expect it all to work better with Musk in charge. He knows how to make scalable software and he knows about performant teams.

That he’s not going to realise these totally obvious first order consequences people are raising seems unlikely.

3 more replies

poulsbohemian3y ago

I did SRE consulting work for a phase of my career... as the author points out, these systems are scaled out and resilient, but what happens next is entropy. Team sizes shrink, everything starts to be viewed through a cost cutting / savings lens, overtaxed staff start ignoring problems or the long-term view because they are in firefighting mode, it becomes hard to attract new talent because the perception is "the good times are over." Things start to become brittle and/or not get the attention needed, contractors are brought in because they are cheaper and/or bits get outsourced to the cheapest bidder... the professional care and attention like the author clearly brought just starts to shift over time. Consultants like me are brought in to diagnose what's wrong - the good staff could write our briefs, they know what's going on - and generally we slap a band-aid on the problem because management just wants to squeeze whatever value they can out of the assets rather than actually improve anything.

rjzzleep3y ago

The reality is most huge companies are majority bloat. The hiring numbers are also in part crap that goes into Series X Raise pitch decks. Oftentimes a lot of the new bloat pisses off competent people, because their work doesn't actually get less, it becomes more. Not only do they have to now nanny people that are often not actually competent in their job, they just happened to go through the coding interview with wholly unqualified interviewers, but they now also have to handle nightmare features that were built by people completely disconnected from the other side.

I'm not a friend of Elon's, but outside of the flashiness of the whole thing, I don't think his firing spree was wholly unwarranted.

The other day I saw a video of a bunch of people at twitter leaving that have been there for a decade or so. I mean wholly crap, this reminds me of old German industry where people retire in the place they started.

8 more replies

bigiain3y ago

New owner: "Nothing ever goes wrong with the cache! It just works, look at the status logs. Why are we even paying those guys to look after it!"

Also new owner: "What even is Mesos? Why are we running something called Aurora? Obviously pure bloat. Fire the lot of them."

radu_floricica3y ago

Yeah, but that's not what's happening here, is it? Musk is pretty obviously looking for a corporate culture shift, not turning the company into a cash cow - and for this to work, entropy needs only to work slower than half a year. Which this article argues pretty convincingly is going to happen.

Dave3of53y ago

I don't think it will fail on a technical level. As this article says lots of engineering has gone in to make the thing pretty resilient. I would also say that there are still enough engineers who work there who can figure out what's gone wrong and "turn it on and off again" or w/e makes it splutter back to life.

In terms of changes to the Platform ditto. It's not difficult to make these changes that a team of 100's of devs who are not 100% aware of what's there already can't figure it out. I've taken over systems that I knew very little of and were pretty big (not as big as twitter) and I managed for years to make changes without drastically breaking stuff. In any event if they do break stuff they will be able to fix what they have broken.

No, the real failure here is the massive debt burden and the fact that there is no way that twitter can ever service that amount of debt. Note that before EM took over it was ticking along with a relatively small loss. If they had cut headcount by maybe 10% they would have been break even easily. There is no way that's possible with $4million of interest per day. They have to radically change the way they monetise the platform to get to that level. I don't think they will ever get there and Musk will sell off at a bargain basement price at some point in the future to pay back the debt.

Moissanite3y ago

The fact that he was able to "buy" Twitter and yet transfer a significant amount of debt to the company rather than being liable himself is just another sign of how different "rich people accounting" is. He will walk away from this with a bit less theoretical money but no material impact to his life, while thousands of people are having their lives up-ended. How long are we going to keep letting shit like this happen?

4 more replies

larsnystrom3y ago

Are you saying Twitter is paying ~$1.5B in interest rates every year?

4 more replies

kaputmi3y ago

Personnel was their biggest cost, which has now been cut by 60+%. That will help a bit in servicing the debt

andreyk3y ago

Yep, agree. Twitter's revenue was primarily from ads, and now I'd bet their ads revenue has dropped a huge amount. Given how Musk is behaving now (reinstating Trump etc.), and losing many pivotal sales people, and the firing of a ton of people in charge of dealing with hate speech and such, it seems unlikely advertisers will return now.

2 more replies

_3u103y ago

If you consider that engineers cost about $1000 a day and he got rid of 4,000 of them… oddly it works out to about $4 million per day.

I’m not an expert in math but it’s seems pretty possible.

Plus they’ll have a lot more impressions to sell now that people are allowed to speak. Ad rates might drop but then someone will write an article about how they are getting great CPC on Twitter and everything will be back to normal after the blue checks have their sob fest.

2 more replies

swellguy3y ago

I think the real question is: Twitter grew 3x on the headcount front with a flat stock price over the course of less than 5 years. What exactly where these thousands of employees actually doing and why did the previous CEO think what they were doing was worth hiring them for? That's just basic accountability from a stock holder or employee perspective. That's apparently a ton of money being wasted on nothing at all.

summerlight3y ago

Twitter used to experience significant downtime compared to all other major platforms and one of the reason was its lack of redundancies across everything. Headcount is one such thing and it takes manpower to automate infrastructures as discussed in the post.

Sure, you can run the platform with 1/10 headcount with significantly degraded user experiences (say ~98%). This is not a problem for startups but people usually have higher expectations for established companies. As always, the last 2% is a hard problem and business doesn't really want to deal with a such unreliable platform. You wanna onboard big advertisers which potentially spend $100M ARR? Then you need to assign a dedicated account manager to handle all customer escalations. PMs then triage and plan their feature requests and later engineers implement it. Which all adds up.

And they also uses your competitor's product, like Google, FB, TikTok etc etc... Twitter is a severely underdog here, so you need to support at least a minimal, essential subset of features in those products to convince them to spend their money on Twitter. That alone takes hundreds of engineers, data scientists and PM thanks to modern ad serving stacks with massive complexity.

Yeah, it ultimately boils down into a simple fact that it's really hard to take other folk's money. You need to first earn trust from them. They want to see if your product is capable of following a modern standard of digital ad serving for now and foreseeable futures. Twitter has spent lots of time for earning trusts and the original post is one evidence of such efforts. And this usually needs more man power. You might be able to do that in a more efficient manner, but I don't think that's as simple as firing 75% of your entire headcount.

3 more replies

ezoe3y ago

Redundancy.

Now the systems are stable but human workers either be sick, leave, or die eventually.

Rising the pay has diminishing returns. You can't prevent workers leaving because of lost of interests, be sick or die by throwing more money at them.

The article wrote about achieving stability by the distributed system so an unexpected death of one rack doesn't affect the service availability. The same can be done for the human workers unexpectedly not working anymore. Have a multiple workers doing the same things improve stability.

Sure, it's inefficient in terms of money. But alternative is one sick important employee catch a COVID-19 and die lost the knowledge of the system. Documents doesn't solve it because you want the manual operation available right now rather than a few months later when replaced workers learned from the documents.

randmeerkat3y ago

> Rising the pay has diminishing returns. You can't prevent workers leaving because of lost of interests, be sick or die by throwing more money at them.

People would absolutely be more engaged and more excited about their work if they were paid more. The only reason people work is literally for money…

5 more replies

mirzap3y ago

Yeah, and they were even profitable with ~3k employees. Then the hiring spree started and they went negative. Even if there wasn't Musk they would have to let go at least 30% of the people.

1 more reply

mhio3y ago

Have a browse through their engineering blog: https://blog.twitter.com/engineering/en_us

It's largely focussed on the event stream behind the core service and data analytics. There's maybe one entry on the main data store and one on search over the last few years.

busymom03y ago

Not just that. From my personal experience, twitter had to be one of the slowest websites I use. Even in my 2020 mac, it often shows the memory warning in safari. Things take a while to load. And the UX is terrible with having to constantly click to read child comments, having to click on “show hidden replies” etc. I honestly have no idea how a company with thousands of employees and a billion in loss was able to operate such a terribly performing website.

pessimizer3y ago

It had to take a bunch of them to wreck the UI.

xedrac3y ago

I think a likely answer is "moderating". That would explain why Elon was ok letting so many go so quickly.

Robotbeat3y ago

From what I understand, contractors were used for moderating.

francisofascii3y ago

> What exactly where these thousands of employees actually doing Maybe trying to increase traffic, attract advertisers, add efficiencies. Not everyone is an SRE. As long as your efforts increase revenue by more than you cost as an employee, you are adding value.

whatshisface3y ago

If the 3x headcount increase really did add no value, there are still about 1/3rd profitable employees there now. In fact giant layoffs tend to cut the best people first because they are the ones who feel comfortable walking. The people that are the last to go are the ones who are very entrenched in the organization and who don't estimate their chances outside of it highly, and that's the exact description of who Elon thinks he's laying off.

JamesBarney3y ago

I've found the opposite. I almost never see low performing employees fired outside of a mass layoff. In every layoffs I've seen 10x as many people were fired as quit. So you lose a bunch of low performers involuntarily, and a few top performers both voluntarily and involuntarily and that leads to the average quality improve.

3 more replies

xyzzyz3y ago

> In fact giant layoffs tend to cut the best people first because they are the ones who feel comfortable walking.

This is more true when the layoffs happen because the company’s situation deteriorates. If the company cuts jobs because revenues fall and products fail, better employees are indeed more likely to move to greener pastures before mediocre ones do. If, however, the company prospects improve, rather than worsen, this is no longer the case.

2 more replies

kybernetyk3y ago

>What exactly where these thousands of employees actually doing

They had wine on tap.

2 more replies

jansan3y ago

Did't you see the "Day in my life at the Twitter office video!"?

https://www.tiktok.com/@realpankhilpatel/video/7159187292631...

Normal people don't have vacations like that.

bearmode3y ago

There is nothing here that other big tech companies don't have. To attract the best, they spend a shitload on perks and benefits.

1 more reply

halfmatthalfcat3y ago

This is the real question? Your question has nothing to do with the blog post and if you take a look around, what Twitter did was literally done across the entire industry, hence all the layoffs recently. There was a hiring glut to take advantage of cheap capital during COVID recovery. The capital has dried up, glut has ended and a lot of people lost their jobs. Why is that so hard to see? None of this is unique in any way to Twitter.

subroutine3y ago

A bunch of people just got axed from Twitter because covid cash dried up? Shit I thought it was because Elon took over and fired anyone who refused to work at the office instead of at home.

2 more replies

MuffinFlavored3y ago

> Twitter grew 3x on the headcount front

There were multiple executives making $10m/yr+

There were board members

There were shareholders

Why did all of them not stop this headcount increase if it's as easily reduced as "too much headcount bad, smaller headcount good"? These are paid professionals who are supposedly wealthy, good at their jobs, smart, informed, etc.

How can us commenters on HackerNews sit from our armchair and say "ah, goofballs should've just not let headcount get so high!"

These qualified people thought at the time it was a good idea to get up to 7.5k people. How were they all wrong?

frognumber3y ago

It's not goofballs. It's generally misaligned incentives. Managing a 10,000 org leads to better job prospects than a 1000 person org, than a 100 person org, than a pizza box team.

Organizations tend to bloat.

Random, rapid cuts might not be the fix here, but headcount was too high.

1 more reply

jdminhbg3y ago

> How can us commenters on HackerNews sit from our armchair and say "ah, goofballs should've just not let headcount get so high!"

The cliche HN comment on sites like Twitter (and many, many others, any time headcount comes up) has always been "why do they need so many people?" I've mostly dismissed it the same way I dismiss "I could build Uber in a weekend," but with every other tech giant laying people off, maybe I shouldn't. Maybe the effect of all that extra money sloshing around in the system was to incentivize hiring everyone to make sure you didn't accidentally get a false negative, and not all of those hires were good ones.

4 more replies

Mikeb853y ago

> These qualified people thought at the time it was a good idea to get up to 7.5k people. How were they all wrong?

Come on, just look at the tech industry. When rates were low and stock prices kept going up, "headcount" was used as an indicator of future growth. Grow headcount, investors are happy. After all, the promise of tech stocks was "growth". Usually you're not looking to cut costs until you think growth is over. Of course, Twitter was a dog and did nothing useful for years, no innovation, no new products, nothing. But tech investors definitely saw rising headcount as a good thing...

googlryas3y ago

Companies are mismanaged into death all the time by groups of highly intelligent, previously successful people.

1 more reply

StanislavPetrov3y ago

>Why did all of them not stop this headcount increase if it's as easily reduced as "too much headcount bad, smaller headcount good"?

For the same reason that colleges and universities have seen their administrative bloat skyrocket at 10x the rate of student enrollment. Administrative bloat inevitably creeps into all large organizations. Many of the people in the trenches making hiring decisions weren't considering the overall financial performance of Twitter as a company. They were making hiring decisions based on what was happening in their own department, or how that decision would help advance their own agenda, or increase their budget, or increase manpower on a favored project. When you further consider that many at Twitter openly conceded (and in many cases, bragged about) that they viewed their role at Twitter as moral arbiters of society, crucial to policing the discourse of the public, it is not hard to see how enlisting as many true believers as possible to the cause would be seen as desirable, regardless of the larger financial implications.

xcambar3y ago

> There were multiple executives making $10m/yr+

> These are paid professionals who are supposedly wealthy, good at their jobs, smart, informed, etc.

Wealth is not a valid indicator of ability.

I'm not judging the execs and board members individually but rather questioning your assumption. I have read you mention "supposedly", yet it can be read as a rhetorical term.

strangeattractr3y ago

I've worked in multiple financial services companies where management is incentivised to be as ruthless as possible and they are always overstaffed in areas and understaffed in others. I've been in teams of 10 people that could be staffed by 2.

Hiring often isn't done because of current requirements. Senior execs come and go and with them so do strategic objectives. You accumulate people and they're often not laid off when the thing they work on becomes redundant. Large scale layoffs are awful for morale and usually only come after a 'crisis' occurs.

s1artibartfast3y ago

Has no company ever been mismanaged? Have they they ever grossly misallocated funds? The answer is of course yes, that happens all the time. Corporate leaderships are not not infallible.

ab83y ago

Tesla has 99k employees

sooyoo3y ago

Ever tried assembling a car in your home office?

rajamaka3y ago

Tesla's product can't be replicated with a 2000 line CRUD app

2 more replies

bmn__3y ago

> What exactly where these thousands of employees actually doing

https://nitter.lacontrevoie.fr/libsoftiktok/status/158539526...

ulkesh3y ago

All this does is point out that smart people worked at Twitter who may now no longer work there, whether on their own accord, or due to Elon’s bulldogging tactics.

Elon thinks he knows what he’s doing, but what he is going to be left with are people who are willing to work hard by his standards, but not necessarily smart.

The simple truth is Elon knows nothing about the actual work involved in tech. He knows words or elicits help from others on what to say that sounds like tech speak (RPCs!), but when it comes to being truly knowledgeable in this space, he is losing his most valuable assets because of his amazingly poor managerial and ownership style.

I know there are a lot of Elon fans on this site, and will disagree with all of this; but his abilities have not at all been proven. Yes, he knows how to spend money to claim credit for technical advances, but until he actually has his hands dirty in the muck of the hard work of tech, he will always be a glorified self-promoter with no substance.

And Twitter will suffer for it.

memish3y ago

John Carmack, "Elon is definitely an engineer. He is deeply involved with technical decisions at spacex and Tesla. He doesn’t write code or do CAD today, but he is perfectly capable of doing so."

Kevin Watson, who developed the avionics for Falcon 9 and Dragon and previously managed the Advanced Computer Systems and Technologies Group within the Autonomous Systems Division at NASA's Jet Propulsion laboratory: "Elon is brilliant. He’s involved in just about everything. He understands everything. If he asks you a question, you learn very quickly not to go give him a gut reaction.

He wants answers that get down to the fundamental laws of physics. One thing he understands really well is the physics of the rockets. He understands that like nobody else. The stuff I have seen him do in his head is crazy.

He can get in discussions about flying a satellite and whether we can make the right orbit and deliver Dragon at the same time and solve all these equations in real time. It’s amazing to watch the amount of knowledge he has accumulated over the years."

karpathy3y ago

Elon also understands deep neural nets a lot more than I think people imagine. He starts with good intuitions and mental models, but also actively asks for technical deep dives, and has very good retention. E.g. I recall teaching him about our use of focal loss in contrast to binary cross-entropy for the object detection neural net (I said it had given us a 5% bump and he asked to know more) and he understood how it works about as quickly as you'd expect a PhD student to. The fact that he can do this across many technical disciplines is impressive and borderline superhuman. I don't think people understand or would believe how low-level and technical typical meetings with him are. Just saying because I get triggered reading way off innacurate takes on this topic (original comment).

7 more replies

thefz3y ago

> "Anyone who actually writes software, please report to the 10th floor at 2 pm today. Before doing so, please email a bullet point summary of what your code commands have achieved in the past ~6 months, along with up to 10 screenshots of the most salient lines of code"

Actual quote. Anyone using the term "code commands" comes out a little detached from programming reality, let alone the rest of this request, it is out of a Dilbert strip.

1 more reply

minhazm3y ago

Many people who have worked with Musk have shared similar sentiments in interviews. But it seems that people just refuse to believe any of it. People think that there's no way it's possible for someone to be that deeply technical and be a CEO of multiple companies at the same time. I've talked to people about it and they straight up refuse to believe it saying that it's impossible and that any evidence of him being technical in interviews is all set up and that he was trained on the materials and questions ahead of time.

3 more replies

hackfrednews3y ago

Channing Robertson, the face of Stanford chemical engineering department and the associate dean of Stanford’s School of Engineering, who taught and mentored Elizabeth Holmes, has said the following to say about her:

“She had somehow been able to take and synthesize these pieces of science and engineering and technology in ways that I had never thought of.”

“I never encountered a student like this before of the then thousands of students that I had talked”

“You start to realize you are looking in the eyes of another Bill Gates, or Steve Jobs.”

He also maintained that Holmes was a once-in-a-generation genius, comparing her to Newton, Einstein, Mozart, and Leonardo da Vinci.

Excerpt from: "Bad Blood: Secrets and Lies in a Silicon Valley Startup" by John Carreyrou.

alsodumb3y ago

Adding another datapoint from one of my previous comments:

In response to someone saying on Twitter how Elon doesn't understand the technical stuff of rocketry, Tom Meuller, former CTO of Propulsion at SpaceX and the designer of many of their engines responded

"I worked for Elon directly for 18 1/2 years, and I can assure you, you are wrong"

https://twitter.com/lrocket/status/1512919230689148929?s=20&...

lelanthran3y ago

You know, I think Musk is an ass, and would never work for him, but don't you think that someone who has managed to launch and then run many successful and complex technology projects might actually know a thing or two about launching and running simpler technology projects?

And if you're going to claim that his successes have been due to the people surrounding him who actually know what they are doing, then all that tells me is that you are acknowledging that he knows how to surround himself with people who know what they are doing.

We're not fans (I'm certainly not), but it takes a special kind of mind to look at Musk's track record of successes and conclude that his latest project is doomed.

1 more reply

WalterBright3y ago

America is such a great country that a random person can just fecklessly blunder into creating a revolutionary electric car company and cluelessly blunder into creating a rocket company that is the envy of the world.

edmundsauto3y ago

Automotive and aerospace are not that similar to social media. People buying into the vision of "get the planet off fossil fuels for transport" and "get this species to Mars" are probably willing to make sacrifices that people working on social media are not.

It's the Halo Effect fallacy to think competence in one field automatically translates to another. Especially when the founder in question has displayed increasingly erratic behavior in the meantime.

Is today's Elon capable of doing what Elon from 15 years ago did at Tesla? I don't think that is necessarily in evidence, much less in a very different industry.

2 more replies

trollerator233y ago

I _think_ you are joking. No?

1 more reply

nverno3y ago

Judging by some of the old patents he's filed [1], I'd guess he has at least a decent understanding of the tech involved. Probably less so, when it comes to the details of more modern distributed systems, but I also wouldn't be surprised if he's spent some effort towards all that as well - he's been working in/around pretty cutting edge tech for quite a while. Could he sit down and code it himself? probably not, but that's hardly required in his situation.

1. https://patents.justia.com/inventor/elon-musk

HyperSane3y ago

Musk was fired from PayPal because he wanted to replace all the Unix servers with Windows.

1 more reply

lupire3y ago

The CEO get to put his name on the company's patents, yes.

acknorabotr3y ago

It's crazy to think a guy who builds reusable rockets thinks he can run a complex technology operation like Twitter.

nikau3y ago

s/builds/obtains funding to get other people to build

tiborsaas3y ago

Well, he's definitely not afraid of blowing up complex systems :)

1 more reply

emodendroket3y ago

You aren't wrong, but his playbook is familiar to anyone who's gone through acquisitions (especially leveraged ones) and many companies were in a strong enough position to start with that they do manage to limp through and get sold off despite all the abuse.

jillesvangurp3y ago

You underestimate Elon Musk. Many people have done that before and lost that bet. If anything, he repeatedly succeeded in building world class software and hardware teams for Tesla, SpaceX, and a few other companies. The notion that he won't be able to attract world class talent is ludicrous. Yes, he is a bit of a liability and his management style is obnoxious and unconventional. But he does get things right once in a while.

And he hates bloated inefficient teams. His decrees on meetings are infamous. Tripling the team at Twitter implies a lot of internal politics, fiefdoms, communication overhead, and generally a lot of headless chickens running around. There's no nice way to fix such a team. A sledge hammer is one way to fix it and obviously he likes getting results quickly.

So, the notion of laying off most of that team was a foregone conclusion. The notion that a lot of the better people would get upset about that and leave as well is also highly predictable. What's left is a team with some gaps but also a lot of breathing room. And he can always lure key people back in by throwing money at them.

Simple plan. It might actually work. At the cost of a bit of drama, temporary instability, and lots of free publicity. Exactly his style. Cringe worthy and effective. I can see the logic here.

2 more replies

headsoup3y ago

Nothing beats an internet random telling us how much a successful person is silly and also can't meet their own superior standards.

It's not about being a fan or not, it's that you're not actually providing any real insight other than signalling how smart you are.

theclansman3y ago

These people behave just like the irrational fanboys, except they just do the exact opposite. Being a sheep and being a contrarian sheep are the same thing.

2 more replies

triggercut3y ago

Years ago I had a massively downvoted comment when I criticised his AR for CAD vapour ware. As someone who was fully in that area at the time, what he was showing while looking fancy had no practical application in the area of design he was talking about.

Ever watch someone do CAD/CAM modelling? They need extreme precision of input that AR sausage fingers just aren't going to help with. You need a num-pad and a good mouse with a stepped click wheel.

I get it. We share some similar neurodivergence traits. He wants to be right in the detail. Constantly jumping from interest to interest, seeing the hidden patterns and connections that aren't apparent to others. But there are a times when I know I just need to shut up and let someone more experienced talk despite my brain wanting to lead every discussion right into solution mode, or providing additional context mode.

I've spent the last 6 years in management consulting (without formal business education), I agree with him when he says MBAs are useless. We know that the best solutions come from diverse teams with diverse backgrounds, skills and knowledge. Not 5 clones who know how to build value driver trees, not to say the tools they bring aren't useful, but they can be incredibly limiting.

For someone who hates MBAs he's sure going about this take-over like someone who barely passed one (i.e. knows more than enough to be dangerous). Sure, you're hemorrhaging money in operations. You need to cut costs and find new revenue streams.

What are your biggest costs?

Labor. Slash / Burn. The old McKinsey 7% FTE reduction will give you some extra operating cash from the years remaining budget and you know it's not so much that people (in fear of their jobs) won't just pick up the slack to keep everything moving. Do it quick because you need to rip the band-aid off and get rid of all that accrued leave, restricted cash etc. off your books too.

Equipment. Redundancy? Sounds like unused resources we can fire sale.

Contracts. Renegotiate? The only two meaningful levers are price and quantity. Start cutting quantity now, renegotiate price later.

This is all dummies guide stuff and tends to go terribly in reality when implemented all at once all together.

For instance, research has shown companies that lay-off when under pressure end up underperforming against the ones who chose not to.

Now who's going to help build and operate those new revenue streams?

Quick fixes for a quick buck and a whole lot of extra risk.

philistine3y ago

And Twitter's problem are nowhere near technological. The site needed to make more money, not reengineer the whole thing while advertisers are fleeing because Trump is back on on a whim!

taolegal3y ago

So should the platform be guided by advertisers? Especially one that’s apparently the de facto public square?

6 more replies

headsoup3y ago

Was Twitter 'good' aside Musk purchased it? Without that event, would it still need substantial changes to be profitable and useful going forward?

NaturalPhallacy3y ago

>while advertisers are fleeing because Trump is back on

[citation needed]

CNN's ratings were never better than under Trump. He's fantastic for advertising. So is Musk. All controversial figures are. That, oddly, isn't controversial in advertising.

>on a whim!

He created a public poll, and when people voted for Trump to be allowed back won, he unbanned him, tweeting "vox populi, vox dei" ("the will of the people is the will of god"). Had he unbanned him despite the poll saying "no" you could argue it was a whim, but that isn't the reality we're in. He also refused to unban Alex Jones, citing exploitation of child deaths and a personal story. Not unbanning Alex Jones was more whimsical than unbanning Trump was, factually speaking. Why do people always misrepresent his actions? And why is it always upvoted and not flagged here?

1 more reply

spaceman_20203y ago

Man, it's not about tech. This is simply about leveraging the "Musk halo effect" on a highly visible, failing business.

You can screenshot this: Musk will cut down costs, make Twitter profitable, and take it public again when markets are better placed.

The markets will give the newly listed Twitter the same "Musk-boost" as Tesla and ramp up the valuation to $100B.

raldi3y ago

The most helpful thing to reflect on in these Twitter operational discussions is the difference between homeostasis and evolution.

You can get rid of 80% of the work force and the existing homeostasis systems will keep things running smoothly despite known day-to-day chaos.

Where you’re really going to run into trouble is inventing responses to novel chaos and gradually changing times.

emodendroket3y ago

I think this is kind of baked in though. Part of the thought process seems to be, at least for non-paying customers, it's not actually necessary to have five nines for Twitter, because people will just put up with it if it's less reliable.

kweingar3y ago

I don’t have personal experience in this, so obviously I can’t speak with any authority. But I have heard from colleagues that tons of little factors can dramatically affect user engagement. For example, even a couple dozen milliseconds of longer load times can push a noticeable number of users away from your app.

4 more replies

onion2k3y ago

Very few people are going to be converted to paying users if they start to see downtime or breakages. No one buys into a failing app.

8n4vidtmkvmk3y ago

true. if twitter, Facebook, reddit, and hackernews go down for a couple days it wouldn't affect me at all. if GitHub and npm went down I'd me mildly annoyed but could still work.

1 more reply

judge20203y ago

I'm sure we're going to see some sabotage accusations once this happens.

oska3y ago

We had an obituary for Fred Brooks on here just the other day. I'd suggest that his thesis in The Mythical Man-Month conflicts with your comment above (that reduction in staff count for a software project has a good correlation with the ability to maintain it / evolve it / innovate on top of it).

sangnoir3y ago

I've never heard of (or thought of) your interpretation of the corollary to Brooke's Law, but removing people from projects until they succeed and are on time seems like a bold strategy.

1 more reply

valachio3y ago

I think the opposite is true.

The bigger a ship is, the slower it is to turn.

IBM is a "tech" company that employs 282,000 employees, and when was the last time they invented something? I don't remember the last time I heard IBM in the news about something they made.

The bigger the company, you often times find less innovation and more administration & bureaucracy.

The reason startups can survive is because of its small size that makes it very flexible and adaptable to chaos and change, that gives it the edge over bigger companies.

btbuildem3y ago

Homeostasis is a good metaphor, but it implies a living, dynamic system. Something that resists entropy by itself being in a state of flow -- the matter constantly changing, while maintaining the form.

In modern software environments, the entropy is almost violent -- the changes in all the constituent dependencies are constant and relentless. Something frozen in time does not stand a chance, unless it's entirely stand-alone and dependency-free -- an unlikely scenario with a service of Twitter's size.

mtejoOP3y ago

Hey, sorry for the new account, i just like to try my best to keep my online identity separate. this for better or for worse has my real name on it. Hope this is interesting!

cloudking3y ago

Nice work! Does Twitter runs their own DCs or hosted somewhere else?

devoutsalsa3y ago

As of three years ago, Twitter had 2 to 3 data centers, and was moving some stuff to GCP. Not sure if current state of things.

3 more replies

mtejoOP3y ago

Thanks! Yeah the other guy is right

leftcenterright3y ago

Kudos for nice work!

What did you make of Mudge's report regarding resiliency of data-centers?

> Insufficient data center redundancy, 59 without a plan to cold-boot or recover from even minor overlapping data center failure, raising the risk of a brief outage to that of a catastrophic and existential risk for Twitter's survival.

- https://techpolicy.press/wp-content/uploads/2022/08/whistleb...

1 more reply

ilaksh3y ago

So you still work there? Or did you quit recently?

mtejoOP3y ago

I left over the summer

zekrioca3y ago

Weren’t Twitter moving out of Mesos to K8s?

mtejoOP3y ago

Yeah there was a few teams trying to make it happen

robswc3y ago

It was if I was in two realities the other day. I did experience a few bugs but twitter never went "down" for me, despite everyone claiming that 1) it would and 2) it already did.

All 10 of the "top trends" for me were about how twitter was dead/dying and armchair experts on reddit said it was only a matter of hours at this point.

Maybe it was close (in reference to external factors not implied in your very insightful post!) but its amazing how confident people are with opinions on events they have 0 insight into. Everyone knows how to solve the war in Ukraine or world hunger but how rare is the "consensus" (in terms of up-votes or popularity) right... just something that got me thinking. Thanks again for this article. I always love seeing details of tech ops!

ilaksh3y ago

That's what I have been trying to tell people. There are literally two alternate realities and the extreme polarization tries to suck you into either one or the other. Because what people don't realize is that worldviews are tied to group identification.

My theory is that the economic problems have been stressing the systems and that makes the "bones" more apparent. The core concept of this country is checks and balances of two opposing sides. I think it was a great improvement over authoritarianism, but you can see that it is definitely not the ideal final paradigm.

ransom15383y ago

As a SRE one of two things happen when you are laid off:

1) The site goes down [you look like you can't do you job]

2) The site stays up [you can do your job] but the lay off looks like a good decision.

You are screwed either way.

jeffbee3y ago

A few parts clearly did go down. 2FA login was just serving error codes all day a few days ago. On Saturday people were posting entire feature films, 2 minutes per tweet, because apparently their copyright content matching wasn’t working.

thakoppno3y ago

> On Saturday people were posting entire feature films, 2 minutes per tweet, because apparently their copyright content matching wasn’t working.

That’s funny but at the same time it’s easy to find stuff like that on youtube and has been forever.

umanwizard3y ago

Those are hardly “Twitter going down”. They’re normal medium-severity outages that could have happened anywhere.

Sure, it’s plausible they were made more likely by so many people leaving, but they’re not exactly the meltdown people predicted.

ilaksh3y ago

I would use it as a case in point. You have one group claiming that Twitter is completely on fire and soon to be closed, and the other saying it's never been better and there was zero fallout from rapidly firing thousands of people including a lot of engineers.

Neither of those extreme viewpoints reflected reality accurately. There were significant problems, but they did not take the entire site down or anything. In fact, for millions who did not have 2FA or use it at that point, they did not see any issue. Whereas people that wanted to find an issue could go looking for the movies. So each side was able to find ways to reinforce their alternate realities.

1 more reply

8n4vidtmkvmk3y ago

that's surprising actually. I'd expect my sites to run for about a year if i suddenly died. maybe longer. certs will renew themselves, servers will restart themselves. everything is on autopay. the thing to kill it would probably be api breaks by 3rd parties. maybe disk space would kill me eventually too.. i don't have auto resizing for my sql database but its got a long runway yet. or god forbid my site grows on its own and the qps kills it.

2 more replies

minhazm3y ago

It's hard to really say whether that was due to something Elon did or if it was just a normal outage from code deployments, hardware failures, etc. Anything that might seem like an outage will be magnified now because everyone will think it was Elon's doing and it will get tons of clicks. The reality is that these huge sites have a dozen small/medium outages almost every day, but almost no one notices.

astrange3y ago

The QRT system doesn't work (if a post says it has 200 quote tweets, you can always open it and never see any), but then again, it never did seem to work.

It's still entirely believable the site will go down and remain down for a week or two at any time.

dpedu3y ago

The extent of the 2FA issue I saw - which, by the way, was w.r.t. sms-based 2fa login, not other 2fa methods - was a single screenshot making this claim.

Was there more I missed?

jakevoytko3y ago

At least one high-follower account on my timeline posted that their newest notifications were from October

1 more reply

hairofadog3y ago

Regardless of what’s up with the tech stack, I find it difficult to want to be a part of whatever this is:

https://mobile.twitter.com/elonmusk/status/15945006557246095... (screenshot: https://raw.githubusercontent.com/aboxwithrocksinit/test-buc...)

If this is the new town square, you can forward my mail to a cabin in the woods.

Havoc3y ago

That is questionable even by Musk standards

Maybe the guy is really losing it

throwawaylinux3y ago

I think that's pretty hilarious. It has caught the ire of a lot of furious unhinged puritans too, which has actually been the funniest thing about it for me.

hairofadog3y ago

It doesn’t bother me because it’s lascivious, it bothers me because it’s the comedy equivalent of fart noises. It was one thing when he was just a user of Twitter, but now he’s in charge, and it seems like a tone is being set that’s not really the vibe I’m into. Whatever floats your boat, I guess.

3 more replies

HyperSane3y ago

It would be funny for Musk to share privately with friends. It is a deeply strange and inappropriate thing for the CEO of a billion dollar company to post publicly. Something is deeply wrong with Musk's decision making process.

4 more replies

memish3y ago

That post really did bring the church ladies out. Puritanism runs deep.

1 more reply

Operative01983y ago

You do know that you can block Elon's account.

reverius423y ago

If you have to block the account of the CEO of the service you are using, perhaps you should just stop using that service?

1 more reply

tux2bsd3y ago

but hairofadog needs a reason to be hysterical.

dpedu3y ago

What's objectionable about this?

RickTM3y ago

bold strategy on space karen’s part, marketing twitter as the next thing not to put your dick in

emodendroket3y ago

I think people's expectations became so exaggerated it was inevitable they wouldn't be lived up to. I'm sure Twitter will experience degradation from the drastic cost-cutting, but it was never going to happen overnight and I'm not sure why news outlets were saying that (except that their sources were employees with slightly inflated senses of their own importance, which we're all sometimes guilty of). And people became really invested in the idea that a site cannot possibly stay up without dedicated SREs, as though tons of tech sites (including big names like Amazon) don't just devolve this work to their on-call rotations.

hn_throwaway_993y ago

I did see a lot of politically-motivated salivating which I thought was woefully premature. That said, I see 2 outcomes now:

1. Generally, with large, complex systems like this, everything works, until it doesn't. All the big boys have major outages periodically. I just can't fathom how Twitter is going to handle the eventual certainty of a major outage when, as the author notes, in some cases there are teams that have 0 people left.

2. More than the technical issues, betting that Twitter will go bankrupt is the easiest bet one can make. Musk saddled Twitter with a shit ton of debt - even if things worked as they did before he had to cut tons of people due to the debt burden.

The issue I see is that #2 directly works against #1. Musk has said it will be lots of intense work adding new features to try to raise revenue. But making a ton of changes, probably with lots of shortcuts to get them out the door quickly, especially when so much institutional knowledge has walked out the door, will make keeping the site stable even that much harder.

eru3y ago

Generally agreed. However keep in mind that bankruptcy is a purely financial event and doesn't necessarily have to have any impact on operations.

Airlines are notoriously for going bankrupt regularly.

2 more replies

015a3y ago

I also want to add a #3: the crew that's left is probably on-call 24/7. My thoughts are with the poor souls on that rotation (if your team even still has a "rotation")

emodendroket3y ago

Yeah, I find that easy to envision myself, but it's not the kind of spectacular, instant failure that's been predicted at all.

mikeweiss3y ago

Wait Musk saddled Twitter with debt? Please elaborate...

5 more replies

fenomas3y ago

To me the analogy here is that the new boss rolled in and sold all the fire extinguishers. That by itself doesn't set the building on fire - it doesn't even increase the chances of a fire occurring on any given day. But when one does...

libraryofbabel3y ago

Every SRE knows that the leading cause of outages by far is someone making a change to the system. Twitter isn’t shipping many new features right now or even doing much maintenance. But eventually they will have to.

So the analogy becomes, the new boss sold all the fire extinguishers and also placed a short temporary ban on cooking in the building. But eventually people are going to start turning on stoves again… and then…

5 more replies

wardedVibe3y ago

Definitely stealing this as the right way to frame the issue. What would normally be a small kitchen mistake turns into no longer having an apartment complex.

jonas213y ago

Perhaps a closer analogy would be that the new boss rolled in and threw away 80% of the fire extinguishers.

Whether that will spell disaster when there's a fire depends on whether the building had too many fire extinguishers to begin with and whether the boss can buy new, better fire extinguishers to replace some of them before there's a fire.

1 more reply

cratermoon3y ago

He also sold all the smoke detectors.

1 more reply

inferiorhuman3y ago

He rolled in, sold most of the fire extinguishers, and made a big show of trying to make cherries jubilee while shoving his dick in one of the few remaining fire extinguishers. Let's be clear, it isn't just the erratic layoffs it's Musk's incessant meddling that's going to be Twitter's downfall. He literally took down SMS based 2FA because "microservices bad". He fired the payroll and tax departments (HR too?). He's scared off Twitter's main source of income while saddling it with significant debt.

As an SRE I would have been shocked if Twitter failed catastrophically (well moreso than broadly disabling authentication) in short order. However failure is pretty much inevitable at this point given the damage that E-Lon is actively doing.

Whatever. Twitter and Musk deserve each other.

1 more reply

quickthrower23y ago

My analogy.

Someone purchased some land for $1. Built a house for say $100. And now spends $100,000 a year making it the perfect place to rent, receiving $100,000 a year in rent.

Someone comes along and borrows $1m to buy that house. They feel ripped off but eventually are force to go ahead with the purchase. As a result they have to pay $100,000 a year in interest. They need this thing to be profitable!

To do this they need to cut back on the $100,000 a year spent. They decide go go in quickly and so email all the services saying "go hard or go home". So the plumbers, tradie, cleaners etc that don't like it leave.

As a side hustle also charge visitors to the house $9 to be allowed to wear their bowtie they used to wear for free.

Some of the people do maintenance jobs and improvements. They keep the termites out, fix subsidence issues, and so on.

And the house didn't fall down within 3 weeks of it being purchased.

headsoup3y ago

Because politics and ideologues are so prevalent and everyone's attaching 'the other' (both sides) for the own.

And of course people in 'important' roles who've been laid off are going to say the company is doomed, these are the probably the worst source to go off. They're not going to say "oh yeah I didn't do much at all really, just bossed people around and spent my budget every year."

memish3y ago

It's actually scary how many people, even engineers, put their reputation on the line saying Twitter wouldn't survive the weekend. It wasn't just Twitter employees.

It's like a mass psychosis of some kind. It comes off as a kind of desperation, as though they need Elon to fail.

Why? What's driving that response?

5 more replies

emodendroket3y ago

I don't even doubt that they did a lot of work, but as this article suggests, that's part of the reason why we _wouldn't_ expect the site to just go down in a couple days. If all the servers needed constant manual massaging that would speak more poorly to their work than the other way around.

1 more reply

AustinDev3y ago

I love this comment. It's so salient. It's probably hard for a lot of people on this board to digest but, I feel like most know where you're coming from.

porknubbins3y ago

The default, naive assumption should always have been programs keep running indefinitey on their own. If thats not the goal of software then I don’t know what is (might as well go back to switchboard operators). Real world experience tells us that, to the contrary, all software goes down and requires specialist intervention eventually. I think a lot of people just jumped to the second level based on political motivations rather than deep knowledge of system failures.

brookst3y ago

> all software goes down and requires specialist intervention eventually

Well, that’s it, isn’t it? How many software systems need to keep running for Twitter to remain more or less functional?

If there are 10 critical systems that are running at four 9’s, you’d expect 3.6 hours of downtime a year, or about 90 days of uptime at a stretch if I have my math right.

If there are 100 critical systems running at 3 9’s, you’d expect 2.5 hours of downtime per day.

So yeah, all software should keep running. But it doesn’t. And something like Twitter isn’t “a software”, it’s a very large assembly of lots of different software systems and the exponential math that dependencies create.

2 more replies

sroussey3y ago

I had MySQL running on some bare metal for many years without a restart.

I was terrified to update the kernel at that point, knowing that system disk had been running continuously for many years, and had no faith it would restart successfully.

Finally got two new servers to replace these (with these new SSD things!) and after migration, sure enough, one of the old servers failed to boot.

3 more replies

xyzwave3y ago

Reminds me of an Alan J. Perlis quote:

> Is it possible that software is not like anything else, that it is meant to be discarded: that the whole point is to see it as a soap bubble?

http://www.cs.yale.edu/homes/perlis-alan/quotes.html

sangnoir3y ago

> I think a lot of people just jumped to the second level based on political motivations rather than deep knowledge of system failures.

Anyone who has ever been oncall can intuit how often stuff breaks in big or little ways. Sometimes it's transient and goes away, sometimes it can be filed away to be fixed in the next year, but sometimes, it turns out to be an all-hands-on-deck crisis for a team, or 5.

Izkata3y ago

> The default, naive assumption should always have been programs keep running indefinitey on their own.

...for people who understand software to some extent. I get the feeling a lot of people see it more like a hamster wheel, where once the developers are gone it immediately starts noticeably slowing down as it stops (and are confused when that doesn't happen).

dcow3y ago

I have a piece of Rust software that has not gone down in its entire lifetime.

4 more replies

throwawaylinux3y ago

It wasn't only news outlets. A majority of my... politically-noisy tech friends on Facebook went through a recent phase of intensely posting about Twitter on fire and collapsing.

Now just because they're "in tech" doesn't mean they have any idea about Twitter, but they should at least know enough to know they don't know what's going to happen, but obviously they're not actually using their brains when posting comments like that. Point is a lot of people opposed to Musk have been participating in a spiraling echo chamber of fairy tales and wishful thinking, it's not just journalists (although clearly they're printing lies with ulterior motvies too, as usual).

emodendroket3y ago

I'll put my cards on the table and say I would have been just as happy to see him fail as anyone else, but it did begin to take on the tenor of the constant, never-came-true stories about how any minute now they were going to spring the trap on Donald Trump and he'd get thrown out of office.

e: I did not mean for this to be an invitation for everyone to argue about the merits or demerits of Donald Trump.

3 more replies

viraptor3y ago

> as though tons of tech sites (including big names like Amazon) don't just devolve this work to their on-call rotations

Not everyone has the same knowledge and skills. Not everything is documented. Not everything in the documentation is current and correct. (especially after recent changes) It's not even that they won't stay up, but depending on who left the company, the oncall response may look radically different and have different time to resolution.

emodendroket3y ago

Yeah, but I think a judgment is being made that having the service be less reliable is tolerable.

1 more reply

lowbloodsugar3y ago

Riiight. But what happens when there's nobody on that oncall rotation? I've worked at FAANGs, and I'm just thinking what would happen if there was a problem with an upstream team and nobody there. Maybe it's ours now. Maybe there's a principal engineer who worked on it back in the day that's still on the payroll? Maybe we're just going to try to bring it back up with no idea what is going on? What if you have a bug that silently loses data?

DSMan1952763y ago

> but it was never going to happen overnight and I'm not sure why news outlets were saying that

I don't think it's that obvious, it's one thing to just leave everything running but Elon was also talking about changes he was making (Ex. turning off a bunch of "microservices" because they don't do anything). If you turn off the wrong thing and don't have anybody left who knows how to properly turn it back on again, then you're in a pretty bad situation. It doesn't seem like that happened, but I don't also think we have enough information to say how close it was to happening.

edgyquant3y ago

This whole thing has been quite telling for me. I don’t use Twitter and am not a Musk fan, but seeing someone try something to move away from ad driven seemed like a good possibility for me.

Then I saw how bitter and nasty a lot of the online communities I belonged to were. I mention that advertisers were of course mad, the stated goal is to reduce reliance on ad revenue. I was met with people attacking me as an idiot with clickbait articles of “proof” that ad models were irrelevant to twitters problems etc, as if I was doing anything other than quoting musks official reasons for the purchase. Suddenly Twitter was this great beacon of graceful discussion and Musk has ruined it.

People just love to be mad, no one had anything nice to say about Twitter and in a flick of a switch they’re holding completely opposite opinions. We’ve always been at war with Eurasia, it seems

vba6163y ago

>I mention that advertisers were of course mad, the stated goal is to reduce reliance on ad revenue

In the computer industry, there are some famous historical examples of companies that announced their new product before it was ready, people stopped buying the old one, and they went out of business before the new version was done.

Not quite the same business, but a bit reminiscent.

danielodievich3y ago

Excellent article. However!, software, like everything, is subject to laws of physics. Entropy always wins in the end. No matter how good the original engineering and planning, without maintenance it will all fall apart soon enough.

enneff3y ago

Right. It’s not like if all the engineers walk out it’s suddenly going to fall over. It’s that when an issue comes up there may not be the expertise on hand to fix it. So the remaining twitter engineers should expect a rough time over the coming months.

user_named3y ago

Small issues going unfixed, building up over time and cascading into a crisis.

1 more reply

emodendroket3y ago

There must be some plan for this, though, right? If it were me and I were trying to do more with less I'd plan to cut some features (for instance, are Spaces that critical to the operation? Because it seems like running them would be demanding) and try to migrate things to managed services to reduce the operational load.

2 more replies

userbinator3y ago

software, like everything, is subject to laws of physics

I disagree; math would be a closer analogy. And indeed, arithmetic still works like it did a millenia ago. Closer to the present, I have binaries from the late 80s that still work today (and I use them semi-regularly.)

Indeed, much of the impetus of the software industry seems to be to propagate the illusion that software somehow needs constant "maintenance" and change. For the preservation of their own self-interests, of course; much like the company that makes physical objects too robust and runs out of customers, planned obsolescence and the desire to change things and justify it so they can be paid to do something are still there.

It's possible to make things which last. Unfortunately, much of the time, other economic considerations preclude that.

mongol3y ago

If software ran without side effects, perhaps. But it doesn't. Databases grow, files are uploaded, logs pile on, messages and events propagates and filesystems fill up. This is why entropy matters.

2 more replies

freshestfish3y ago

> I disagree; math would be a closer analogy. And indeed, arithmetic still works like it did a millenia ago. Closer to the present, I have binaries from the late 80s that still work today (and I use them semi-regularly.)

Sure, those binaries might work the same when executed. Although the probability of that is never 100%, but as you pointed out, the rules of arithmetic aren’t expected to change any time soon. That’s correct. Unfortunately software does not exist in its own micro-verse, it’s subject to the laws of physics acting on the machines it’s running on. So while you might be able to write scripts that work decades later, it’s much harder to ensure those scripts consistently run for decades. RAM chips, CPUs, and everything in between are guaranteed to eventually fail if left running unsupervised in perpetuity. Entropy raises with complexity. At Twitter’s scale, to run a software service you need globally distributed cloud infrastructure. They likely have hundreds of services, deployed to many hardware instances distributed across the globe. Twitter isn’t 1 script running 1 time producing a single result. It’s hundreds if not thousands of systems interacting with one another across many physical machines. Layers of redundancy help, but ultimately cascading failures are a mathematical certainty. Many would argue the best strategy to reduce downtime on these systems is to actually optimize for low recovery time when you do fail.

Software is also bound to the world in other ways. Similarly to how most business processes, products and even more generally, tools, change over time, so too do the requirements placed on software systems made to facilitate or automate these things.

Ultimately the only way to escape the maintenance cost of software is to stop running it. The longer you leave a software system running, the more likely it will eventually stop.

jdminhbg3y ago

Even if the entirety of Twitter.com were mathematically proven correct, it still would run on servers that are made of physical bits that are subject to entropy.

Renaud3y ago

It’s possible to make things that last if you are in total control of the whole stack, including hardware.

Embedded systems that still do their job after 30 years do exist but they live in isolation in a specific and controlled environment, and are built for a limited, unchanging task.

On the other hand, complex web software is build on layer upon layers that are not in Twitter’s complete control.

Hardware change regularly, requiring changes at the lower levels of an OS, inducing potential changes in behaviour, performance, which require adaptation as a consequence.

And that’s before considering security, eternally moving goalposts. Not just at the OS or network level, but also at the business level.

Twitter and al are not living in a locked down context, they live in the messy world of human interactions and that alone requires constant tweaking.

So yes, a binary is more like a mathematical construct and by itself it won’t rot, but if the world around that binary changes, you need to change the binary as well, and for that you need maintenance. The amount required depends on the complexity, brittleness and how well your stack is engineered, but implying it’s a con is a bit extreme.

ShamelessC3y ago

Computation is literally bound by entropy. Math has no such limitations unless you explicitly define them.

I thoroughly recommend researching entropy as it regards to e.g. information theory, systems engineering and even (perhaps especially) to machine learning.

Computation is ultimately about what we can compute _in this universe_ and the forward flow of time is an emergent property from the universe’s innate entropic guarantees.

Time is “pre-sorted” for us thanks to entropy, enabling us to define algorithmic complexities over the time domain in the first place.

polemic3y ago

Moot in this case. Musk has gutted the work-force and instigated a wide range of changes. All bets are off when new code is being introduced.

1 more reply

typon3y ago

Spot on. Absolutely hate this attitude that software sitting there just gathers wear and tear as if it's a mechanical device. Software is written with a particular target platform in mind: x86, ARM, Nvidia GPUs, FPGA soft-processor etc. If the hardware you are running on doesn't change, your software should still function. If the specs of that target platform don't change, your software should still function. If the specs of the target platform change but a hard-working compiler engineer has done the work to make sure your software gracefully uses the new features (for example, a compiler optimizing using AVX instructions), your software should still function.

The fact that most software doesn't continue to function even on the same platform, and on the same hardware, is a massive indictment of the software industry's standard practices.

4 more replies

rapind3y ago

Don't forget that users have been trained to be incredibly fault tolerant as a result of how flaky general software can be. Now that cars are having BSOD that tolerance may reach new levels or just evaporate.

duckmysick3y ago

> It's possible to make things which last.

Things last when you take good care of them.

CoffeeOnWrite3y ago

But patching

fomine33y ago

Twitter-2010 can't handle Twitter-2011 workload

mtejoOP3y ago

Thanks I agree, I think small issues can build up over time, itll be interesting to see what happens. Wish I could see the future post mortem docs from the outages

nemo44x3y ago

Turning it off and then back on again probably fixes the issue. There’s very unlikely a grand ticking time bomb just waiting to bring it all down. Recycling servers will probably keep it running.

Things do still need to be fixed of course.

koolba3y ago

> Turning it off and then back on again probably fixes the issue.

Turning a large scale system entirely off and on is never simple. Invariably you’ll run into some kind of circular dependency that must be manually investigated. And even tracking those down becomes tricky.

Classic examples are things like DNS, service locators, or authentication systems. And large tech companies are notorious for NIH-syndrome for all of those.

1 more reply

colesantiago3y ago

> Entropy always wins in the end. No matter how good the original engineering and planning, without maintenance it will all fall apart soon enough.

This seems to be more true of Mastodon than Twitter.

I can't imagine any self hosted Mastodon instance staying up longer than twitter.

acdha3y ago

First, Mastodon isn’t the only option - it’s not like both couldn’t fail just as both AIM and MySpace did.

Second, Mastodon runs on open protocols. That has good and bad points - for example, it won’t grow as quickly as a project with huge corporate backing - but it does mean that there’s a more direct link between the community and its longevity. Twitter isn’t just flailing because Musk is doing management by bong rip but also because he’s desperately trying to get out of a financial hole. Open source projects have different kinds of financial challenges but they’re never on the hook trying to fill a hole measured in billions of dollars, either. Given the number of communities older than Twitter I’d say it’s far from proven that Twitter will outlast anything.

Robotbeat3y ago

I found it really interesting that one of the main Iceland instances is being run on a Raspberry Pi out of some student's living room in Sweden. Over 500 users at the moment. https://types.pl/@tritlo/109383888427885539

throwaway0x7E63y ago

why? a single app running on a single server is several orders of magnitude more resilient than a spaghetti clusterfuck of services upon services. twitter could be brought down by a single expired certificate

1 more reply

areoform3y ago

There's a pattern of diverging expectations here, one is the non-technical/naïve one,

    - Twitter is going to go down tomorrow and it's all over. RIP.

The second is,

    - Twitter is going to experience a failure cascade over time.

The third is,

    - It's all going to be fine.

I suspect that the real question is, how many individual wires can break before the cable holding the suspended platform snaps?

I am not that good of a developer, but watching Twitter I can't help but be reminded of Arecibo, except at a larger, more abstract scale. There was no single massive event that caused the failure, rather a series of factors and events, tiny cables breaking that eventually leads to a failure cascade that then causes the suspended platform to crash.

From what I can tell, in the past week or so,

    - Twitter's copyright system failed

    - Two Factor Authentication broke down (it seems to be back up?)

    - (anecdata) Tweets have been loading sporadically for me and other people, sometimes we try to open a tweet and it says that it doesn't exist. Happens more frequently with new/recent tweets.

    - (unconfirmed) Twitter's managed account backend is behaving "strangely." For e.g., "One of my campaign managers logged in last week and found all our paused creatives from the past 6 years had been reactivated." from https://www.teamblind.com/post/i-told-my-team-to-pause-our-750kmonth-twitter-ads-budget-last-week-4dnbo1Ft ———— Friends have told me other similar stories

Are these failures symptomatic of a larger problem, or are they well-isolated parts misbehaving? Can Twitter even experience a failure cascade like Arecibo? Can that be paused/stopped?

I am asking this question because I don't know. And I'd like to develop a better mental model to understand what happens next.

WatchDog3y ago

Also anecdote, but twitter has historically been pretty unreliable for me.

Without a reliable twitter systems status history pre-acquisition, the reports of failures, like the issues with the 2FA system, don't mean a whole lot.

jakear3y ago

Definitely this. I regularly see people complaining about ancient bugs as if they were new after the acquisition. Nope, you just are engaged enough to notice them now.

hobofan3y ago

After the failwhale days, Twitter has been quite stable for me with the only exception being the live-refresh features on tabs of the website that have been open for days (which I don't think many websites would handle well).

There has been a serious degradation in the quality since the acquisition:

- Sporadically loading tweets - I could go on some tweets and refresh the page multiple times, with tweets fading in and out of existence showing "This tweet is unavailable"

- Tweets that quote tweets of accounts you have blocked behave weird in multiple ways. Sometimes it's just showing a "This tweet is unavailable" instead of "This tweet is from an account you have blocked", and a few interacting with them crashed my timeline on mobile, having to restart the app

- On a few occurences, every third tweet on the timeline was an ad

- 1-2 notifications from crypto spam bots reaching me every day. The same thing previously was filtered out quite reliably (I assume), since then it happened ~once ever 6 months

--------

And those are only the things I've seen personally. Yeah, they are no deal-breakers, and mostly sporadic failures, but it very much feels like a service that is degrading by the day.

insightcheck3y ago

It looks like direct messages may also be buggy at present, according to a couple of anecdotes on r/Twitter at: https://old.reddit.com/r/Twitter/comments/z1b4zc/cant_sendre...

moistly3y ago

Perhaps the failure mode is like the Tacoma Narrows Bridge, with Elon as the wind, the undulations his whipsaw changes of mind, and the employees the hapless cars being flung into the river.

shagymoe3y ago

When I was learning programming, I "wrote" an internal app where a supervisor input some information in a form, it connected to an oracle database and then printed out an outbound manifest for a truck full of car parts.

I put "wrote" in quotes because what I had actually done was install Apache and PHP Nuke on a Windows NT server and then modified an existing form page to do what I wanted.

I wrote that application in 2002 and never had a problem with it. I never restarted Apache or the server or did any maintenance. I didn't even upgrade Windows NT if I'm being honest. Windows NT became unsupported in something like 2006 and I left the company in 2010.

I received a call in 2018 from the last person still working there who knew who I was. It had finally fallen over and they wanted to know if I could help. That was the first time it ever had an issue.

skc3y ago

Twitter pretty much passed it's first real test with flying colors since Elon took over.

The world cup is on and the site didn't collapse.

That is a huge win and while I don't believe Twitter will ever be profitable, I think Elon will be feeling rather smug right now with his skeleton staff.

zxcvbn40383y ago

Now that Elon has cleaned house, when is Twitter hiring again? It might be a good opportunity to get in on the new ground floor.

pcthrowaway3y ago

Who sees the foundation of a high-rise starting to crack and thinks, "perfect time to get in on the ground floor"?

Ground floor is exactly where you don't want to be when the structure collapses under its own weight.

Rather than bring in additional specialists to reinforce it, Musk has partially evacuated the building at least.

1 more reply

piskerpan3y ago

Why would anyone want to work in a dumpster fire?

1 more reply

UberFly3y ago

The new HARDCORE ground floor.

inawarminister3y ago

I've sent Elon an email asking for internship, though sadly I can't move to Sanfran and live in the office floor for 12 weeks like @geohot can. Still...

justahuman743y ago

How would stock grants work, cause I doubt he'll IPO twitter? Or would it just be netflix-style cash offers

didip3y ago

I am mostly curious about the Mesos layer itself.

Mesos is dead. So you need in-house expertise to patch it without being able to leverage community knowledge.

Does Twitter retain enough people to manage Mesos?

thakoppno3y ago

The Mesos detail struck me as a risk for the reason you stated.

Also, the article seems to suggest Twitter only has two datacenters. That seems surprising for the global reach of the company. Perhaps there are other smaller datacenters that are not prepared to handle the entirety of the site’s traffic.

My current thinking is there’s time to figure out how to operate the current system before it runs into issues that would render it degraded for a prolonged period of time. I noticed TLS certs have already rotated for instance. That was my best guess for simple thing that could fail if managed poorly.

mtejoOP3y ago

The company was moving off it. I wish I could find a twitter eng blog about it. Interesting there's a bunch of not directly from twitter sources about the decision though.

wyclif3y ago

I was surprised to hear that a cutting edge social media app uses Mesos. Why did they choose that over other options?

rickette3y ago

Twitter didn't start yesterday. There was a time when Mesos was all the hotness and k8s was this new thing that looked promising but wasn't nearly production ready.

Apple is also a big Mesos user, but also moving to k8s.

__warlord__3y ago

IIRC Mesos was an internal tool (at Twiiter) that got released under the Apache umbrella later on.

Then Mesosphere (a company) wanted to bring it to the enterprise market but at the time was competing with Kubernetes... and we all know who won.

2 more replies

ggm3y ago

The article is good, and informative but a little odd in one respect. It uses air-quotes to introduce concepts like "rack" and "shelf" in a DC, but does O(logn) notation mid-flow.

If you don't know what a rack is, how are you meant to know what the Order of scaling function means? Thats a highly computer sciences specific notation, and if you grok O(n) you know what a rack, a host, a DC is.

mtejoOP3y ago

Hah fair point, I was trying to write for a more general audience. Part of blogging for me is to improve my written communication skills. Appreciate this!

ggm3y ago

Something like "scales linearly" or "scales exponentially" probably covers it, for whatever case it was (I forget)

really good article btw. really enjoyed reading it.

might pay to flag when systems you talk about are twitter-internal or are open-source. People love that kind of thing. "wow: twitter uses the gnu C compiler" type thing.

housecarpenter3y ago

I get your point that both of these are specialized notations/terminologies, but it's entirely possible for somebody to understand algorithmic complexity without knowing what "a rack, a host, a DC" is, there's no inherent connection between the two. As somebody whose interests tend to be more theoretical than practical, myself prior to reading the article would be an example.

1 more reply

irrational3y ago

Very interesting article. But the constant misuse of “to” for “too” kept throwing me out of reading mode. I wonder why my brain does that instead of reading past it since I know what is actually meant.

blamazon3y ago

Interesting, I didn't notice despite reading the article start too finish. The brain is so mysterious!

stoniejohnson3y ago

Irony!

smegsicle3y ago

> "to many" - 5 matches

he did get it right one time though

mtejoOP3y ago

haha I'll go fix that. sorry :(

1 more reply

mradek3y ago

Tangent but I do hope that Musks trim down results in orgs that have less “executives” and a layered cake of a org structure, and more autonomous small teams that execute on shared overarching initiatives.

I really don’t understand why so many tech companies have like 8 layers of engineering levels. If the argument is that you need more money so more levels, just have a bigger band. Don’t chase titles they don’t mean shit. Not to mention the management stacks that seem to just hang out in meetings and take pvt. I haven’t worked at a proper startup but I’ve been on projects where a dozen or so people rebuilt apps used by tens of millions of people in a few months, or launched completely new applications for bigger companies.

Now that I work in a tech company (not big tech, but still a multibillion dollar corp) I’ve noticed that since IPO we have added a ton of bureaucracy whereas back in the day we were small teams building completely new and at times complex features. Literally was in a meeting earlier this month where people were patting each other on the back because we added a single attribute to a table. I’m obviously reducing the entire initiative to a small thing but it kind of explains all we had to do. It’s soul crushing but with the economy the way it is I must deal with it. Hell I’m down to show up next Monday and work for Elon if he wants a go getter. At least I’ll get to DO stuff. Or any other startup in SF if they are hiring.

shitloadofbooks3y ago

A lot of those upper layers and the bureaucracy which is seen as "bloat by the bottom layers comes about from compliance. The compliance burden grows as the business grows (e.g PCI, insurance and government requirements kick in at certain thresholds).

Sure engineering adding a new widget to the site might increase profits by 12%, but all that bureaucracy can prevent the company losing its payment provider or breaching a government regulation which might cause the company to close overnight. So if the stakes are +12% vs -100%, who is actually doing the most important work?

I don't disagree with your desire to work somewhere lean and task-focused, but I think it's almost impossible for that to happen anywhere but small(er) workplaces.

Karrot_Kream3y ago

I work at a Big Tech adjacent (or Big Tech, depends on your definition) company and was there pre-IPO and it happens for the reason that varjag described in their sibling comment. What makes it even more idiotic is, as the pre-IPO/lean culture dilutes, more people will use their level to pull rank in meetings, as the level itself becomes more of a target than the work/goals. Then the politics around levels will become ultra-competitive. Google didn't get to their tortuous promo process in a vacuum after all.

chrismarlow93y ago

If 10 engineers can build 5 features in a week, then why can't 50 build 25? We easily understand why as engineers on HN, and even managers and investors understand why. But that wont stop them from trying to solve that problem. And some of them do! All of these big FANG companies are as big as they are because at some point they figured out how to solve this problem. But just like time and growth break architectures and software, they also break processes, and they eventually encounter a level where they can't solve it anymore. Then they stagnate, then they either die or solve the new problems. They have to grow to keep investors, and to grow they have to keep investors. That's why it doesn't stop at 12 dudes building an app. But sometimes it does just fine. It's really just a matter of who's investing and what returns they're looking for.

summerlight3y ago

> Tangent but I do hope that Musks trim down results in orgs that have less “executives” and a layered cake of a org structure, and more autonomous small teams that execute on shared overarching initiatives.

The core problem is that this approach doesn't really scale out because communication overhead exhibits quadratic growth to the orgs size if it's untamed. The feasible options are:

  1. Let the complexity bring chaos across the org
  2. One decision maker rule them all
  3. Gives some sort of management structures to the org

Your proposal is somehow between the option 1 and 2. The option 1 works pretty well for smaller orgs and it might scale to a quite sizable business if the members are generally competent so org-wide trust can be well-established. But anyway you'll hit a road blocker eventually since people cannot spend all of their time on communication overheads. The option 2 just moves the burden of entire complexity into a single personnel so it's not really a reproducible solution but more of a mere luck.

Hence the option 3 is the only remaining option for regular orgs and many smart people tried to figure out the best structure (or at least best practices) but unfortunately we don't have a definitive answer yet. Google-style "tech level" is one of the tool to reduce communication overhead by setting a common structure for expectations (e.g. "we have 1 L6 and 3 L5s to take that project" is generally easier to convince than length explanations of your team members). It's not ideal but it somehow works so it's adopted.

You're likely right that you'll be much more productive if you can get rid of those bureaucracies, but getting other folks convinced is a completely different story. Trust takes time to propagate and people have a limited time to spend on it. This obviously could be drastically simplified if you can work with Elon (or similar style leaders) directly but his time is extremely limited so there will always be only a small number of people who can enjoy that privilege...

varjag3y ago

> If the argument is that you need more money so more levels, just have a bigger band.

Someone earning a lot more than you at the same nominal position leads to a lot of resentment: the perception is that you are clearly wronged here. On the other hand a rank system makes this less objectionable and offers at least some roadmap to a similar income. "Oh, she's SWE L9000 naturally she'd earn that"

wnevets3y ago

I know it became a meme but did people actually think it would just crash overnight? Aren't a lot of people on hacker news in tech?

uluyol3y ago

I wonder how big Twitter's infra costs are (capex). Based on the article, they need to build capacity > 2x peak loaf to tolerate one of the DCs failing. But if they instead had 4 and still only tolerated 1 failure, they'd only need capacity > 1.33x peak load.

This doesn't account for the extra overhead associated with extra DCs, but it seems like there's opportunities for major effeciency wins.

bigtones3y ago

Well Elon is planning on shutting down one of their three data centers, the Sacramento data center - so they must have a lot of extra capacity.

https://www.datacenterdynamics.com/en/news/report-elon-musk-...

midoridensha3y ago

Go Elon! Shutting down a datacenter will save a lot of energy. I applaud this move. (Though I suppose they could just sell it to some other company.)

Now he just needs to shut down the other 2.

hot_gril3y ago

You can usually do more with less if the remaining few are up to the challenge. It will come at some cost, like having to "move fast and break things" more. There can even be a positive side if the previous situation had a lot of friction due to "too many cooks," as I've experienced on several teams. What's more worrisome is the alleged 80hr work weeks. Yes, two SWEs on a team aren't as good as one SWE working double hours, since SWE-time is known not to scale that way. But burning people out, aside from being unethical, is risky when nobody is ready to fill the spot.

Musk has a point, but it feels more like he's doing this out of financial desperation.

I'll bet it starts going down a bit more often, nothing too severe. Bigger issue might be inability to roll out new features, especially if they dig a horrible tech debt hole. Maybe someone has more details, I haven't kept up, but it seems like the $8 verification rollout got botched because they cut corners on actually verifying the subscribers.

justahuman743y ago

> cut corners on actually verifying the subscribers

Did they actually verify anything other than the ability to pay $8? It seems wild to me that they thought it would work out just fine

1 more reply

ineedasername3y ago

I wasn't really expecting a large failure quite yet. I'm guessing there are enough people left with secondary or tertiary knowledge of systems to keep things ticking over for now. The question is how long that state of affairs can continue before some combination of worker burnout or deferred maintenance will stretch things too thin. Then, all-hands emergency work to get things running could contribute even more to burnout -> resignations etc.

They could stay ahead of things if they hire into areas related to infra maintenance that got hit the hardest, before issues reach the point of cascading failure. Or maybe put everything not critical to keeping the site up & running on hold as remaining devs get a bit of cross training using the limited people w/ institutional knowledge that remain as trainers.

For myself, I don't see much value in Twitter, in terms of net social value. It's format seems pretty much designed for only the most surface level discussions, part of what I believe leads to some of its toxicity: it's simply too hard to have conversations complex enough to invite enough discourse for that to tip the productive/toxic ratio a bit more positive.

So I've been rooting for Twitter to fail for years. I'm not rooting against Musk: I like his other businesses and at least a portion of their ongoing success is tied to his persona. (At least before the Twitter stuff: His actions there may impact his relationship with major banks that his other businesses will rely on, and Musk & JP Morgan, a bank that wasn't included in the deal & therefore won't be hurt by it, is already on somewhat negative terms with Musk. The twitter deal has added a few more to that list, and other banks have undoubtedly taken note. Some bank will always finance the regular sorts of things any large corporation need for him, but they're all going to be pricing in some additional risk. That's not really a big deal, it's just that I think Musk's persona has previously been a net positive for his companies and now it's lost at least a little bit of that.)

yieldcrv3y ago

I don't think it's that crazy. Private Equity and Wall Street doesn't have pressure for resilience, it has pressure for churn. As long as revenue and spending... exists... things are fine. Ideally they are numbers that grow. This doesn't lend itself to clearing out a development backlog, or engineers doing the most important things, it lends itself to rapid iteration and justifications for the iterations.

So now that it is owned by someone that doesn't need churn and just needs to reduce cost, people can focus on discreet resilience factors, just like any small tightly held software operation. Many of Twitter's pet projects go away, the ad sales relationship have to get re-evaluated too, but the core consumer product that everyone sees can be made resilient and operate cheaper.

acdha3y ago

The problem is that it’s run by someone who overpaid badly and needs to come up with significantly more money to pay for the debt he saddled the company with, and then his actions seriously disrupted ad revenue. That puts him more in the PE playbook of cutting costs as deeply as possible even at the extent of long-term growth. Unlike his other companies there isn’t strong government support to drive business for Twitter.

yieldcrv3y ago

He's going to sell Tesla shares to pay off his debts, he's knocked off $4 billion already, he'll be able to keep doing that, make offers to buyout stakes of Twitter from people that really don't want to be involved in this shit show. Twitter will operate at lower costs, and also not be profitable for him given current information. That's financially fine.

1 more reply

diebeforei4853y ago

Musk has been in a similar position before when he took over Tesla. At that point it was struggling to make the Roadster, they were facing having a marginal cost of manufacturing a car higher than the sales price.

Musk took over and made cuts. I don't think anyone could argue that those cuts were at the cost of Tesla's long-term growth. They were needed because of an imminent liquidity problem. Tesla has hired great people since then.

danans3y ago

> Private Equity and Wall Street doesn't have pressure for resilience, it has pressure for churn. As long as revenue and spending... exists... things are fine. Ideally they are numbers that grow. This doesn't lend itself to clearing out a development backlog, or engineers doing the most important things, it lends itself to rapid iteration and justifications for the iterations.

Did we read the same article, because it seemed clear that all that resilience work was done when the company was public shareholder owned, not under the new private owner.

yieldcrv3y ago

I read the article, I wanted to use the space to point out that Elon's fortunate that it has been done, and also that the technology infrastructure things he needs to focus on aren't impossibly challenging to do, if people are expecting it to be.

So operating Twitter with 80% fewer engineers isn't the voyeuristic suicide that many of us are hoping to be amused by.

a-dub3y ago

i think i prefer a culture where one is rewarded for automating their job.

moves like musk's are bad for the industry. they produce more systems with "job security" built in.

TravHatesMe3y ago

> more systems with "job security" built in.

How common is this?

AtlasBarfed3y ago

Twitter is a lot of drama, but ultimately this is a common private equity play:

1) take private

2) fire/layoff until things break, patch up / rehire with cheaper labor. Repeat as necessary

3) spit shine/repackage what's left with a theoretically more appealing balance sheet.

4) resell to another sucker...uh, buyer, or take public again.

Fundamentally Musk bought 44 billion for 5 billion of annual revenue, and presumably 5 billion in costs. Unlikely to add revenue in twitter's model, he's cutting costs.

Honestly, at this point, is there revolutionary technology that is needed to keep the lights on at twitter? Do you need graphQL gods and SREs that would ace any Amazon raising the bar and master of silicon valley interviewing?

Nah. He'll honestly probably outsource a ton of the upkeep.

It's ugly, but the Twitter board and shareholders took their money and ran, and abandoned the product and the workforce. They could have backed out and let Elon Musk off the hook of his dumb contract with them, but they just wanted the money, and sold to someone that just wants to get his money out of it too.

dehrmann3y ago

There are a few key differences. PE tries not to massively overpay, it'll put more effort into targeted layoffs, and it'll maintain enough of a "business as usual" posture to not scare away advertisers. Edit: they also put in a CEO whose full-time job is being CEO of that company.

But yes, this is very bad execution of the PE playbook.

midoridensha3y ago

Elon thinks he's brilliant at everything he tries his hand at, so he thought he'd jump into the private-equity arena and show everyone how brilliant he is.

usrusr3y ago

4) who'd that sucker be? How much hubris would it take to sell investors on the idea of salvaging a failing company that "the guy who killed the internal combustion engine" and "the guy who made rockets land on their tail" couldn't make profitable? If it comes to selling off a failure, Musk's reputation will be quite a dealbreaker. The days of whatever happens, if all else fails we can still sell to Yahoo are long gone. Perhaps China might be interested, for obvious reasons, but surely not after Twitter has gone down fighting, fading away into irrelevancy in the process.

1 more reply

kaba03y ago

I mean, they got an offer for roughly a 10x deal. At that point it was in the best interest of shareholders to sell. Especially that it was yet to turn a profit.

Like, I like my car, but would probably sell it for an order of magnitude its worth.

npn3y ago

Chinese developers are famous for making apps serving 1.4 billion people, with at most 1/3 salary of US, and the willingness to do 996 a whole year around.

I would be surprised that no company in the west take advanced of that.

2 more replies

redeyedtreefrog3y ago

SpaceX and Tesla made affordable, reusable rockets, and affordable, high performance electric cars with a long range. Those things have obvious value, but required solving big engineering challenges to make them a reality. Musk doesn't seem to have any clear goal with twitter. There are vague ideas about making it subscription-funded rather than advertising-funded, and about having less censorship, but neither are things the majority of people want.

I imagine the site will mostly continue to more-or-less work, despite all the layoffs. They still have thousands of staff. The network effect of Twitter is so big that people will continue to use it even if fail whales become more common. Others suggest that it will soon crash and burn, or that Musk will get bored and sell it for a few billion. Or that Musk is a genius who will make some sort of amazing Twitter 2.0 that does for social media what Tesla did for electric cars. But without any appealing long term vision, and with an owner who bought it to satisfy their ego rather than with any real plan, the reality may be more boring. I imagine it will just languish for many years, with occasional manufactured drama, and occasional downtime, but no real innovation. Maybe to be eventually supplanted by something else in 10 years or so.

mattbrewsbytes3y ago

> Or that Musk is a genius who will make some sort of amazing Twitter 2.0

Based on texts exposed via the lawsuit about the purchase, I don't think this is the case. I don't think he, or his advisors, understand that the product at twitter (and other social media) is content moderation. You can have a vision of whatever type of content or pricing scheme you want but without solid content moderation you will lose advertisers, gain lawsuits (people were posting movies the other day) and lose users because the "feed" becomes a muddied mess. Users are only really the product when you can moderate their content to have profit via advertisers.

bushbaba3y ago

If there’s no code changes, reliability is greatly increased. Probably 99% of all outages and incidents are from a bad code change.

Within most companies you don’t want to pause innovation.

PeterStuer3y ago

In every software company of over 50 employees I have seen up close it would be fairly to identify 50% of people that had minimal to negative contribution.

That said, it takes some time to get to know the real workings of a company, knowledge required to select the right 50%.

That said, afaict the current mo is keep the lights on for the current offering while a new team builds a new product for that enormous user base that offers something much more profitable.

YokoZar3y ago

> drove some large cost savings projects

Having worked as SRE on efficiency projects for BigCo, it is not at all uncommon to recover more than your lifetime salary in company savings with only a few months work or even less. The scale of things is so immense that even a slightly better handling of things can lead to outsize returns.

Laying off someone like that, rather than putting them on full-time efficiency work, is an obvious waste.

andy_ppp3y ago

There seems to be a load of people in this thread making statements on both sides that a) the article doesn’t prove there were too many people at twitter b) the article implies there were too many people at Twitter c) Twitter is a simple app why so complicated etc.

I would say this; Twitter at this scale is extremely complicated. We don’t have enough information to know what all these roles were doing and how many were truly superfluous. The article gives an interesting insight into something extremely complicated- it might be a slice repeated in a lot of areas. I’m of the opinion that you can do a lot more with less but there’s no need to send emails to everyone in the organisation being an arsehole and essentially accusing them of being lazy, then rehiring people who never worked there just for the meme.

Anyway Musk will blow up some rockets, put enormous pressure on people to fix them, and come up smelling of roses again even if he treats a load of people like crap along the way. His fans will say he’s a genius and his enemies will say he’s a dickhead. And so it goes.

raspberry13373y ago

Musk makes leftists lose their shit and all rational functioning, even on Hacker News; that is my reflection.

socialismisok3y ago

Not just leftists. His fans are equally out of touch in the other direction.

tzm3y ago

Looks like 28% of Twitter employees took the offer for 3 months of severance.. not the rumored 75% or 80%.

yalogin3y ago

The reason why Twitter hasn’t crashed is it’s well written and well orchestrated. Once a bug comes in and crashes something that is when the chaos starts. It’s almost a guarantee that the current crop do not know how to fix the bug. It will be interesting to see how they handle that.

trollied3y ago

> It’s almost a guarantee that the current crop do not know how to fix the bug

Seeing lots of comments like this. Why do you feel it’s ok to say that, when you don’t actually work there or have intimate knowledge of who is currently working there? The pure speculation and nonsense surrounding Twitter at the moment is plain awful.

1 more reply

Marazan3y ago

Once thing I've seen that I think is overlooked in the whole "is Twitter's tech stack inefficient" chat is the nature of the service.

A lot of the "Twitter could be run off my laptop" style comments seem to come from people who run, effectively, Read-Only services. They might serve data at thousands of queries per second but the data itself is _slow_. It is video files or music streams or other data that updates infrequently or, if there is dynamic data involved, it comes from a very small number of fixed sources.

Twitter deals with thousands of posts per second that are subject to huge fluctuations in the density _and_ those posts have to be disseminated to the millions of users. Twitter is a two sided problem.

jiggawatts3y ago

I've heard this elsewhere, and when Twitter was originally founded in 2006 it wouldn't have been a particularly hard problem.

Twitter processes roughly 10K tweets per second. Even if you bloat out the text quite a lot to account for encoding overheads, metadata, etc, etc... and assume that each one is 10KB, then this is just 1 Gbps. A single NIC on an old server.

Okay, I get it, Twitter needs a lot of data too. Lists of users, etc...

Twitter has 450 million monthly active users, which sounds like a lot, but even if there's 1MB of profile data per user such as who they are following, that's just 400 TB.

That's... not that much these days. A large-but-not-enormous database cluster.

Sure, there's "historical" data, but that can be compressed and put on cheap cold storage, like S3 or whatever.

Give me a few million annually as an opex budget, a small team of decent developers, and I can guarantee you I could whip up a cloud-hosted service that can process tweets at Twitter's scale.

Obviously, what I can't replicate is the much larger set of tools and systems behind the scenes that are used for moderation, analytics, ad sales, etc...

1 more reply

csomar3y ago

Did the OP confirm Elon theory that most of the stuff is not needed?

> For four of those years I was the sole SRE for the Cache team. There was a few before me, and the whole team I worked with, where a bunch came and went. But for four years I was the one responsible for automation, reliability and operations in the team. I designed and implemented most of the tools that are keeping it running so I think I’m qualified to talk about it. (There might be only one or two other people)

If you only need one person for the caching department (which is, as I understand, is critical as it delivers most of the data); then maybe you need a handful other dozen engineers and there you have a functional Twitter.

That or the OP is full of himself. Kinda like Musk?

vba6163y ago

I can believe that a dozen talented engineers could in principle suffice for Twitter.

But who believes those 12 engineers still work there? The author of this specific item is in fact not there any more.

And a lot of other people are needed to bring in revenue, don't you think? Nobody is paying for a beautiful caching system.

It's like if I doubled my weight in the last ten years. Half of me is bloat, and yet, there is no possibility bisection will improve my health.

mapme3y ago

One SRE, many SWE. Also have fun asking someone to be permanently oncall with one person on the team.

The cache clusters size are also described here for anyone who wants a good technical read over speculation. https://www.usenix.org/system/files/osdi20-yang.pdf

csomar3y ago

The OP claims he did the implementation (so he was the software engineer too?):

> I designed and implemented most of the tools that are keeping it running so I think I’m qualified to talk about it.

2 more replies

okdood643y ago

SWEs can share the oncall rotation with SRE.

1 more reply

dsq3y ago

Its possible to teach a horse to eat less and less each day. They just have this unfortunate tendency to die just after you succeed in weaning them off food altogether.

A horse, of course, is not same as a reduction in force.

Engineers don't eat hay, neither does a horse have a 401(k).

kar11813y ago

For all the hand wringing about keeping the lights on, Twitter more than doubled its headcount in 3 n bit years. I'm not sure anyone knows how much of a skeleton crew is required to keep things ticking along anymore given that sort of hiring binge.

rubiquity3y ago

Great post. Thanks for the insight on how mature Twitter’s Cache infrastructure tooling is. Having done similar work at a different place where you’re exposed to the realities of racks and racks of machines I could appreciate the evolution of the stack and knocking off problems one by one.

I suspected that the hopes and fears of Twitter’s quick demise were overstated and shared[0] some of those thoughts last week, on Twitter of course.

Also hi from another infrastructure person in SD!

0 - https://twitter.com/bitsandhops/status/1593637241527578625

mkl953y ago

I have seen this happen twice at a smaller scale. Turnover is too high quantitatively and qualitatively to survive by hiring new people, so Musk's "solution" will probably be to oursource a lot of stuff to some consultancy firm. Twitter will almost inevitably rot within 5 years. But don't expect anything noticeable overnight.

Edit: also the reason Musk wants to shut down all those microservices has nothing to do with money, it's that he doesn't have enough manpower to maintain them. If Twitter is like most non FAANG out there those "microservices" are coupled with each other, so good luck.

dboreham3y ago

...because it's only been a week and it takes longer for things to rot badly?

jdminhbg3y ago

I think this post is aimed at non-technical "RIP Twitter" posters from last week who are surprised it didn't fall apart in 24 hours.

npn3y ago

There was a long twitter thread from some one proclaimed 10+ years SRE listing a lot of things that could bring the server down. Lot of them are pretty far-fetched and outdated.

Too bad I did not bother to save the link.

1 more reply

randyrand3y ago

servers don’t really rot if they don’t rely on 3rd party cloud software.

Linux doesn’t upgrade itself.

the web is very standards based.

Hacker news has pretty much been running for decades on the same software hardware stack

kaba03y ago

I’m fairly sure HN had outages as well.

1 more reply

treme3y ago

turns out CEO with exceptional track record & success in 2 different industries is some what competent.

I'm pretty surprised to see the occasional redditesque "Elon is an idiot" rhetoric on hn tho.

bearmode3y ago

In spite of him, not because of him.

1 more reply

superdug3y ago

I will just wait for the bug the breaks the camels back here.

There will be a necessary breaking change that there will be no support for that will cascade into the downtime the media is demanding at the moment.

textbookrental3y ago

I guess I am just a classic HN reader, so I wonder why Twitter needed so many people.

Also, I find it pretty interesting how this is playing out so far.

I'm starting to think that Mr. Enron Musk is going to pull this off.

Kye3y ago

>> "How the cache’s keep running"

It's a little thing, but it's the little things that cascade into big things that require teams of people to fix. Here, we see this person who's quite proud of his automations making the simple, but harmless error of using a possessive instead of a plural. Humans are good at reading through this sort of thing.

Computers are not. Somewhere deep in all his scripts is a misplaced ', `, or ’ waiting to break something.

ellen3643y ago

Linear extrapolation seems to be a common theme in the comments. People take a product they know and then essentially say “Twitter only has X times as many users, surely they only need X times as many engineers. The other employees were excess!”

I haven’t worked on Twitter-scale products. But I do wonder if it’s safe to draw a straight line between services that have middling and large scales. Things don’t necessarily work the same at the extremes.

SMAAART3y ago

> if you have a cluster of servers where serving 1000 requests a second might cost $1000, you can instead use the cache to store the responses and serve it from that cache server instead. Then you would have a small cluster for $100 and a cheap and large cache cluster of servers maybe for another $100. The numbers are just examples to illustrate the point.

Reddit needs to implement this if they want to get to the next level.

pie314isi3y ago

So, this? https://www.youtube.com/watch?v=3ztFaYvgDAg

rvba3y ago

> When I joined the team the first project I had was to swap old machines that were being retired for new machines. There were no tools or automation to do this, I was given a spreadsheet with server names.

Writing a program to store a list of servers to be swapped instead of keeping them in a spreadsheet sounds a bit like buying a brewery when you want to drink 1 beer. Program used by a team of one sounds like over-engineering.

ojosilva3y ago

This spreadsheet was, from the OPs narrative, the master-list used by the whole operations team. An operations team that is now reduced to "ashes" - therefore not a bad idea at all to have your server inventory centralized.

AtNightWeCode3y ago

Anybody working in the biz knows about technical debt. At Twitter that will skyrocket now. My firm believe and experience tells me that keeping everything somewhat up to date over time always pays off.

The problem is not that things will break. Things break ALL the time. The problem is more about avoiding cascading effects and the time it takes to fix stuff.

bawolff3y ago

I think it would be more surprising if it did go down immediately.

Presumably in the turmoil nobody is deploying new code. Sure shit happens, but i imagine that twitter is mature enough that they aren't having weekly critical incidents, especially during times where nothing is changing.

I still think twitter is going to have some disaster, but give it a few weeks.

clayhacks3y ago

He says at the end his automation allowed him to automate more. A great positive feedback loop. I see the inverse is also true. Where having no automation makes automating new things harder. Without enough engineers I think they will hit some issue and just not have the bandwidth to fix it in a reasonable amount of time

dchia3y ago

Why do people assume that things would break so easily? I'm pretty sure Twitter (ex)team did tons of scalability testing with loads that are 10x or even 100x. Plus random 100x to 1000x spikes to account for breaking news. And all of these are done automatically, constantly, and consistently over the years.

aryehof3y ago

There will always be those that think that Twitter can be written by a single programmer in a day or so. How hard can it be for such a simple idea after all, given that there are hundred of tutorials producing a version in just about every language just a search away.

Keep it running? One part-time should surely do it?

csmpltn3y ago

Why Twitter didn’t go down, you're asking. Short answer? H1Bs: https://twitter.com/elonmusk/status/1593899029531803649/phot...

new2yc3y ago

Someone please post this as a win for open source, nsidb. I don't know how

https://lichess.org/blog/Y3u1mRAAACIApBVn/settlement-reached...

Existenceblinks3y ago

Well, debugging a company in prod mode like this is adventurous. Just like we remove code to see if things break, and put it back piece by piece until it's not. That's how some of us get rid of bloat. (Yes, I'm aware of orange and apple comparison)

mesozoic3y ago

As an SRE I know when people aren't working things rarely break (not ever but much more rare)

karol3y ago

I don't understand why would a mature platform just fall down without its engineering team. There are surely a lot of small fires but 80/20 rule with regards to triaging should keep the lights on when properly applied.

Speaking as an experienced single tech founder.

Nokinside3y ago

More than 50% of large distributed system downtime is caused by configuration errors, failed software updates, or network attacks.

With configuration error, or failed software update everything can go down at once, or there are cascades of restarts that kill the system.

notyourday3y ago

News at 11: people who waste 90% of their work time on HN while collecting a-la FAANG salaries because they are used to bullshitting their management and getting away with it are terrified of the management that can look at the code.

tiffanyh3y ago

Fail Whale.

I think people just miss seeing the fail-whale graphic.

https://www.techopedia.com/definition/1987/fail-whale

bdcravens3y ago

I wonder if the Twitter situation may be the event that causes the industry to push back on large technical organization sizes.

I wonder how many reading this are starting to feel nervous about their own roles.

newbieuser3y ago

Considering the maintenance cost of a twitter-scale site, it is inevitable that even a very small-scale problem they will experience will become much larger with the snowball effect.

b99andla3y ago

Why would twitter go down due to staff layoffs? The stuff I build would keep running for years without my involvement. Do you mean that other systems are hand cranked?

squegles3y ago

Not sure when OP left, but Twitter has 3 datacenters now.

mtejoOP3y ago

Over the summer, Yeah there is three but I didn't think the 3rd was really running yet. It was kind of just there with some small things dripping in.

dekhn3y ago

The article didn't address what happens when Mesos goes down. How much of the team running the underlying cluster infrastructure remains?

nokeya3y ago

So, from 7500 people only 1 (ONE) person was responsible for cache, one of the most crucial parts of the service?

octodog3y ago

Thanks for the write-up OP! Appreciate it.

totorovirus3y ago

If Twitter survives and people shut up their mouths complaining after few months we'll see who was right.

EagnaIonat3y ago

It's already breaking down. HW/SW aside, a lot of the services had teams monitoring them, which don't exist anymore.

Some of the side effects of that:

- Trends are not working correctly.

- Copyright reporting is not working.

- Appealing flagged tweets isn't being responded to.

It's only a matter of time before these get abused with no one to fix them.

NaturalPhallacy3y ago

>- Trends are not working correctly.

"Trends" were subject to Twitter's "trends blacklist" before; something that leaked a few years ago. Maybe they're working correctly now that they're unencumbered. Can you describe how they're not working now?

>- Appealing flagged tweets isn't being responded to.

Mine was responded to in a handful of hours. Much faster than I was expecting.

>It's only a matter of time before these get abused with no one to fix them.

"the walls are closing in" - https://www.youtube.com/watch?v=rLEchPZm318

1 more reply

ReptileMan3y ago

And then the other big tech companies will take note. I personally know of highly compensated teams in VMware that do almost nothing.

eshork3y ago

I read ~60% of your post, and noticed it repeats, so I stopped reading. Not sure if you are aware

Giorgi3y ago

It did not go down because huge companies don't usually go due to reddit woke culture.

npn3y ago

Let don't kid ourselves here, even if Twitter crashes and needs a few weeks or even a few months to recover it doesn't matter. People who leaved will comeback again, after a year everything will be like nothing ever happened.

And even if Twitter lost 50% of it users and revenues, it is still more profitable than have to pay the salary of 7000 people.

kaba03y ago

More profitable than what? It was and is losing money, plus now it also has some serious debt on top. With staff that actually know their shit (and a sane CEO), it may be possible to turn it profitable, but I don’t think it will happen as things currently stand.

npn3y ago

7000 people x $300k = $2.1B And that's salary alone. The operation cost to serve all those people is a lot more.

Twitter revenue last year was around $5B.

uwagar3y ago

automation is indeed key. IRC networks run with no obvious employees and salarys and been running fine since the dawn of the internet. i see Twitter as a centralised, walled IRC. the # was pinched from IRC if u remember.

kaba03y ago

An IRC network server can serve what, a few hundred thousands of concurrent users’ text messages on a single PC.

Twitter has around 200 million daily user, peaking up to a billion, running on their own server farm. They are not remotely comparable, especially if we note that read-write dbs are notoriously hard to scale.

1 more reply

reacharavindh3y ago

Has anybody suggested the idea of rewriting twitter in Rust yet to Elon? Memory safe, needing fewer servers(hence less $), and the story sells itself to someone like Elon who jumps on such impulses...

^ is a snark. I'm not suggesting that they should..

nigamanth3y ago

Twitter won't go down that easily. Besides, Elon Musk hasn't bought it for no reason. I'm sure that he has his plans with Twitter to implement new features and whatnot to restore it.

Till now, he's fired a lot of engineers probably because:

a) he doesn't need so many. b) he thinks he can do the job himself.

randomsearch3y ago

Is it essential for Twitter to run their own datacentres?

lolsoftware3y ago

I'm curious how/if the ostensible decline in active users is affecting things, and to what degree.

user_3y ago

> Twitter added 1.6M daily active users this past week, another all-time high

https://twitter.com/elonmusk/status/1594865247323660290

datalopers3y ago

Interesting how this summer all we heard from Musk was how Twitter was overflowing with bots and spam, but now that he runs the place he sure spends a lot of time bragging about all-time high "user" counts.

acdha3y ago

Want to bet that’s not unconnected to sacking the people who tried to block bots?

2 more replies

kaba03y ago

Catastrophe tourism. Also, contrary to Elon’s shitty graph with cut off y-axis (can we stop doing that?), that’s like a 1% change around a major American event.

dehrmann3y ago

Louis Rossmann said something like "there's a difference between people showing up to laugh at you vs. laugh with you."

1 more reply

throwaway53713y ago

is twitter going to be profitable if it can run with 10% of the workforce?

eBombzor3y ago

Was this a thing people actually believed? Other than the typical room temperature iq Twitter user.

NaturalPhallacy3y ago

All the people saying twitter is now dying remind me of "the walls closing in" on Trump, who is running for president in 2024: https://www.youtube.com/watch?v=rLEchPZm318

fragmede3y ago

The "why hasn’t the site gone down" stuff is the wrong conversation. The engineers built a train engine, and this robber baron held up the train and forced the engineers off. Inertia will keep the train going for a good long while because they did such a good job of building the train.

At some point the engin’s fluids will run out and need topping off. There's a warning system though so someone'll get notified before the engine blows up, so it's entirely possible this hardcore crew of H1Bs pulls it off. (Great job replacing the SSL cert!) At some point the train’s boiler will need retrofitting. A new team could successfully replace the boiler, but doing it while the engine keeps running just isn't an easy job.

The handful of devs and SREs at Twitter will do their damnedest to keep the train, err, site running. They might even succeed. That doesn’t prove the site was over-engineered, it means they built a damn fine train engined that successfully pulled the whole train up and over the mountain pass.

Twitter wasn't run like a lean start-up type business. Because wasn’t one. Society needs better than that. The vulture capitalism mindset is what's wrong with this county. What's the barest minimum I can pay people to work for me? Not; what can I pay them to let them live prosperous lives but what's the bare, subsistence minimum.

America needs a middle class and these people were part of it. You don’t get a middle class by injudiciously paring things down to a not-even-skeleton crew that’s going to be worked till they burn out, only to get fired.

Twitter was only $14 million from profitability in Q2, and any idiot can randomly fire people to find that much in that large an organization, but cutting costs like that is to misunderstand the situation entirely. Just like starving yourself isn't healthy and won’t improve your self image, randomly cutting costs like that saves money, but doesn’t result in a healthy business.

fatneckbeardz3y ago

ever notice that the truly competent people can explain things in really simple language?

yuppie_scum3y ago

I hope they consider moving to Kubernetes from Mesos.

rubiquity3y ago

Why

wankle3y ago

I can't help but feel it might not be legal to post all those details.

mk_stjames3y ago

Out of all the comments here, I can't believe this is the only one that shares my sentiment. I'm not a software engineer, but in my area of engineering if I ever revealed anything publicly as detailed as this kind of architecture; for instance, the failure modes effect analysis framework use at a certain aerospace company... or details of the simulation models involving the crash analysis at a certain car company, so soon after departing one of those companies, I'd expect a letter from a lawyer within a week. In most places I've worked I don't think I could even post anything on the internet invoking the name of the company unless it was a repost of some public information the company itself had previously released. Do the vast majority of people here on HN/in Silicon Valley not have 5-year NDA's? Or do they and they just not care?

asutekku3y ago

Well twitter (allegedly) does not have a legal team anymore so ¯\_(ツ)_/¯

HeyLaughingBoy3y ago

What law do you think is being broken?

encryptluks23y ago

Breach of contract possibly. Usually companies don't want you revealing their internal infrastructure publicly unless you get permission. Who knows though.

2 more replies

randyrand3y ago

given that it’s showing Twitter in good light, it would be super silly to pursue anything IMO, even if it were illegal.

What would be a realistic fine in terms of damages (in $$)?

You could argue its actually negative damages.

1 more reply

HeavyStorm3y ago

Objectively, the article proves that there was bloat since all things are automated.

tyiz3y ago

She mentioned you: https://youtu.be/g-voQsFY6SE

She was assistant to the gender health support board.

zapdrive3y ago

I know most of my comment is going to be redundant, but I need to say it in order to point to it an year or two from now as an "I told you so!"

Twitter is going to be here for a long long time. It's not going to suddenly shut down, it's not going to slowly decline, there is not going to be mass abandonment.

Elon might be cocky, but at the end of the day he is a successful businessman. He didn't just sent out that loyalty pledge email out of cockyness. He really wanted to kick out everyone who doesn't believe in him and his beliefs. And he sent that email well after understanding how Twitter runs and putting his loyal people from Tesla ans SpaceX in charge of operations.

The woke crowd needs to get out of denial and start coping with the fact that Twitter is not the bastion of unopposed woke ideology anymore.

kaba03y ago

He doesn’t understand shit, you don’t fire that many people before even having as crude of an understanding as was seen on that whiteboard. Also, people working on embedded systems for rockets know jack-shit about a system serving petabytes of data.

faizmokhtar3y ago

What kind of woke ideology exactly? Sorry not familiar the current climate in USA.

acknorabotr3y ago

It's a form of progressive ideology that is largely based on symbolic actions and relies on mechanisms like shame and shunning, struggle sessions, etc.

So instead of actually doing any sort of real tangible action to solve problems, you mostly spend your time denouncing your enemies, including the ones in your own ranks that are not pure enough.

It's very much associated with the left-wing parties in many Western countries, who have for the most part traded their historical defense of working classes for this new brand of performative progressive politics.

linuxhansl3y ago

Definition of woke: alert to injustice in society, especially racism

Somehow, some folks are now saying this is bad.

fractalb3y ago

Twitter will soon experience the negative effect of the mass layoffs and die slowly. If it survives, please don't ascribe it to Elon's management skills. Don't forget the employees who are stretching themselves there to rescue Twitter.

j / k navigate · click thread line to collapse

Why Twitter didn’t go down: From a real Twitter SRE (opens in new tab)

1351 comments