The biggest thing that brings down a site is changes. Typically code changes, but also schema/data changes, infra/network/config changes, etc. As long as nothing changes, and you don't run out of disk space (from logs for example), things stay working pretty much just fine. The trick is to design it to be as immutable and simple as possible.
There are other things that can bring a site down, like security issues, or bugs triggered by unusual states, too much traffic, etc. But generally speaking those things are rare and don't bring down an entire site.
The last thing off the top of my head that will absolutely bring a site down over time, is expired certs. If, for any reason at all, a cert fails to be regenerated (say, your etcd certs, or some weird one-off tool underpinning everything that somebody has to remember to regen every 360 days), they will expire, and it will be a very fun day at the office. Over a long enough period of time, your web server's TLS version will be obsoleted in new browser versions, and nobody will be able to load it.
You're totally right; if you don't make changes to the software, it's unlikely to spontaneously stop working, especially after that first 6-12 months of "hardening" where bugs are found and patched.
Many people working in tech have never been exposed to a piece of software which isn't being constantly changed in small increments and forced upon end users. People are assuming that software is inherently unstable simply because they never use anything that isn't a "cloud service".
This probably comes off as "old man yells at cloud" but I'm not trying to bash cloud here. The cloud/SaaS approach has a ton of advantages for both consumers and businesses. But the average tech person in their 20s vastly underestimates how stable software can be when you aren't constantly pushing new features.
For a social media / user-generated content application, the macro storage concerns are a lot more important than the micro ones. By this I mean, care more about overall fleet-wide capacity for product DBs and media storage, instead of caring about a single server filling up its disk with logs.
With UGC applications, product data just grows and grows, forever, never shrinking. Even if the app becomes less popular over time, the data set will still keep growing -- just more slowly than before.
Even if your database infrastructure has fully automated sharding, with bare metal hosting you still need to keep doing capacity planning and acquiring new database hardware. If no one is doing this, it's game over, there's simply nowhere to store new tweets (or new photos, or whichever infra tier runs out of hardware first...)
Staffing problems in other eng areas can exacerbate this. For example, if automated bot detection becomes inadequate, bot posting volume goes way up and takes up an increasing amount of storage space.
From today's Casey Newton's newsletter:
In early December, a number of Twitter’s security certificates are set to expire — particularly those that power various back-end functions of the site. (“Certs,” as they are usually called, serve to reassure users that the website they are visiting is authentic. Without proper certs, a modern web browser will refuse to establish the connection or warn users not to visit the site). Failure to renew these certs could make Twitter inaccessible for most users for some period of time.
We’re told by some members of Twitter’s engineering team that the people responsible for renewing these certs have largely resigned — raising concerns that Twitter’s site could go down without the people on hand to bring it back. Others have told us that the renewal process is largely automated, and such a failure is highly unlikely. But the issue keeps coming up in conversations we have with current and former employees.
In my experience as a data engineer, unusual states are one of the leading causes of issues, at least after something is built for the first time. You can spend half a year running into weird corner cases like "this thing we assumed had to always be a number apparently can arbitrarily get filled in with a string, now everything is broken."
Also, conditions changing causing code changes is the norm, not the exception, definitely in the beginning but also often later. Most services aren't written and done - they evolve as user needs evolve and the world evolves.
Aren't these changes inevitable, though? There is no such thing as bug free code.
Another thing that forces consistent code changes is compliance reasons- any time a 0-day is discovered or some library we're using comes out with a critical fix, we would have to go update things that hadn't been touched sometimes in years.
At my last job, I spent a significant amount of time just re-learning how to update and deploy services that somebody who left the company years ago wrote, usually with little-to-no documentation. And yes, things broke when we would deploy the service anew, but we were beholden to government agencies to make the changes or else lose our certifications to do business with them.
Eventually, Twitter will have to push code changes, if only to patch security vulnerabilities. Just waiting for another Heartbleed to come around...
Simple example: you have a DB with a table with an auto incrementing table. You chose a small integer type for the primary key and after years this just worked fine, you finally saturate that integer type you can no longer insert rows in the table. Imagine now this has cascading effects in other systems that depend on this database indirectly and you end up with an "outage"
Absolutely agreed. In that vein, there is such a thing as too much automation. Sometimes, build chains are set up to always pull in the newest and the freshest -- and given the staggering number of dependencies software generally has, this might mean, small changes all the time. Even when your code does not change, it can eventually break.
It's been my experience that a notable part of software development (in the cloud age, anyway) is about keeping up with all the small incremental changes. It takes bodies to keep up with this churn, bodies which twitter now does not have.
It'll be interesting to keep observing this. So far it's been a testament to the teams that built it and set up the infra -- it keeps running, despite a monkey loose in a server room. It's very impressive.
* Power outages and general acts of God
* Resource utilization
How do your databases perform when their CPUs are near capacity? Or disks? Or I/O? I've seen Postgres do some "weird s%$#": where query times don't go exponential but they go hockey stick.
* Fan-out and fan-in
These can peg CPU, RAM, I/O. Peg any one of these and you're in trouble. Even run close to capacity for any one of these and you're liable to experience heisenbugs. Troublesome fan-out and fan-in can sometimes be a result of...
* Unintended consequences
The engineering decision made months or years ago may have been perfectly legitimate for the knowledge available at the time. However, we live in a volatile, uncertain, complex, and ambiguous (VUCA) world; conditions change. If your inputs deviate qualitatively or quantitatively significantly, you risk resource utilization issues or, possibly, good ol' fashioned application errors.
"No battle plan survives contact with the enemy." -- Stormin' Norman
Same with software systems. They're living entities that can only maintain homeostasis so long as their environment remains predictable within norms. Deviate far enough from that and boom.
You can't do this without a lot of people. Sure you could pare it down, maybe improve some architecture, but without a ton of people involved who understand the systems and how they connect, when things might go south they may never return.
So yeah, I totally agree with you. No code changes = long life.
I agree with your assessment, but I do want to highlight that this condition is not rare for Twitter. Load is very spiky, sometimes during predictable periods (e.g., the World Cup, New Year's Eve) and sometimes during unpredictable periods (e.g., Queen Elizabeth II's death, the January 6th US Capitol attack). It isn't going to cause a total site failure (anymore), but it can degrade user experience in subtle or not-so-subtle ways.
An aside on the "anymore", there was a time when the entire site did go down due to high-traffic events. A lot of the complication in the infrastructure was built to add resiliency and scalability to the backend services to allow Twitter to handle these events more gracefully. That resiliency is going to help keep the services up even if maintenance is understaffed and behind a learning curve.
At least for expired certs, most people have learned the hard way just how bad that is, and either implemented automated renewal (thank to heavens for cert-manager, LetsEncrypt, AWS ACM and friends) or where that doesn't work (MS AD...) monitoring.
Now, those change freezes even extended to preventative maintenance, one of the dual PSUs in a core switch went bad and we couldn't get an exception to replace it... for 6 months. We got an exception when the second one went down and we had to move a few connections to its still alive mate.
Well, Elon is talking about a massive amount of changes coming down the pipe, so I guess we'll see how that goes!
I was partly expecting the rest of the article to explain to me why exactly it wasn't just bloat. But it goes on talking about this 1~3-person cache SRE team that built solid infra automation that's really resilient to both hardware and software failures. If anything, the article might actually persuade me that it was all bloat.
First of all, how does it persuade you of that? The article touches a really small (though incredibly important for up-time) subject.
Secondly, in any large company, the majority is 'bloat'. It's security engineers, code reviews, data architecture, HR, internal audit teams, content moderators, ccrum masters and I can keep going. In a start-up many of these roles can be ignored, becaus growth > stability. In a large organization, part of the bloat helps insure a certain amount of stability that's necessary to keep an organization alive.
If a product is mature enough, like Twitter seems to be, removing engineers won't instantly crash the product. It'll happen slowly. Bugs will creep in, because less time is spent on review and over all architecture. Security issues will creep in because of about the same issues and less oversight. Then, once this causes enough issues for the product to actually crash, the right people to fix it quickly might not be there anymore. That's when fixing the issues suddenly takes a lot more time.
If the current state of affairs at Twitter keeps up, it'll probably be a slow descent into chaos. Especially with Elon pushing for new features to be implemented quickly, inevitably by people who cannot fully understand the implications of said features, because 80% of knowledge is missing.
It takes _effort_ to make it work this smoothly now, _and in the future_.
SRE is about _preventing_ issues. Not mopping up after them.
To me, the article read like every succesfull sysadmin story: there's no fires, so sysadmin must be bloat.
1) HR 2) Legal 3) Sales 4) Marketing 5) Payroll 6) Admin staff 7) Most of Engineering, other than the bare minimum of L1/L2/3 support.
As someone paraphrased, a car without breaks and steering wheel works just fine until you hit the first bend.
If Twitter survives this without any major harm it will have profound consequences for the whole software industry.
All just to put caching in front of services that actually do anything.
Same here. I guess his header was on point in why Twitter is still up; but I was also interested in hearing about why Twitter actually needs all those people. If it can be run with 50-80% of the staff gone, that does sound like some bloat at least.
The article makes a point that the reason Twitter is running ok on 20% of personnel at this moment is exactly because it was build to be resilient, not because the personnel was bloated. A large part of this so called bloat, the 80%, was responsible for Twitter to be running right now. Calling this bloat implies it is actually not important for Twitter to be available all the time (or at all).
Not for me
This is almost exactly like the new manager coming in, noticing that the floors and surfaces are all clean, all the systems work, the trash is emptied, etc., and so deciding that the entire maintenance staff is unnecessary and firing them.
The place doesn't become a decrepit pigsty the next morning; it slowly degrades.
Same for these systems. They were designed, built, tuned, and maintained over the course of years to go from requiring constant manual intervention to running largely unattended and with a good buffer of ready hardware and automatic failover for failures. That "largely" in "largely unattended" is doing some very heavy lifting.
The system WILL require human intervention to keep running, and more than just a skeleton crew. The only question is whether it will happen before the new crew gets up to speed to handle the inevitable degradation.
This does NOT mean that the SREs were bloat - it means that they were doing an excellent job and could safely take a break. We're now in just the two-week vacation zone - same as if the entire SRE team went on a holiday. We'd expect it to work. Now let's see what happens in two months.
The engineer was doing stability planning for 6 months out for the purpose of cost optimization. I guess we can assume that the costs of infrastructure is about to go up and reliability is about to go down in the coming months.
It's become an adult daycare, https://twitter.com/DavidSacks/status/1561096423243800576
Twitter's layoffs followed by that 1AM photo of hackers at work is terrifying to lifestyle employees.
It's the Return of the Nerds.
Google, Meta, Netflix, Microsoft are all watching.
... for the Cache component. There are many others.
Big tech maintains talent so that they won't use their knowledge of the system to produce an identical competitor without the technical debt or investor liability of the original.
The only concern Id have is that by having so many people, your design probably comes to rely on them whereas a smaller team would be forced to make the system easier to maintain.
Personally, if I were Elon, I’d build an entirely new backend and point the clients to that rather than trying to incrementally improve what they have.
Get 50-100 10x engineers that are loyal to Elon, with big equity stakes, and crush it
That he’s not going to realise these totally obvious first order consequences people are raising seems unlikely.
I'm not a friend of Elon's, but outside of the flashiness of the whole thing, I don't think his firing spree was wholly unwarranted.
The other day I saw a video of a bunch of people at twitter leaving that have been there for a decade or so. I mean wholly crap, this reminds me of old German industry where people retire in the place they started.
Also new owner: "What even is Mesos? Why are we running something called Aurora? Obviously pure bloat. Fire the lot of them."
In terms of changes to the Platform ditto. It's not difficult to make these changes that a team of 100's of devs who are not 100% aware of what's there already can't figure it out. I've taken over systems that I knew very little of and were pretty big (not as big as twitter) and I managed for years to make changes without drastically breaking stuff. In any event if they do break stuff they will be able to fix what they have broken.
No, the real failure here is the massive debt burden and the fact that there is no way that twitter can ever service that amount of debt. Note that before EM took over it was ticking along with a relatively small loss. If they had cut headcount by maybe 10% they would have been break even easily. There is no way that's possible with $4million of interest per day. They have to radically change the way they monetise the platform to get to that level. I don't think they will ever get there and Musk will sell off at a bargain basement price at some point in the future to pay back the debt.
I’m not an expert in math but it’s seems pretty possible.
Plus they’ll have a lot more impressions to sell now that people are allowed to speak. Ad rates might drop but then someone will write an article about how they are getting great CPC on Twitter and everything will be back to normal after the blue checks have their sob fest.
Sure, you can run the platform with 1/10 headcount with significantly degraded user experiences (say ~98%). This is not a problem for startups but people usually have higher expectations for established companies. As always, the last 2% is a hard problem and business doesn't really want to deal with a such unreliable platform. You wanna onboard big advertisers which potentially spend $100M ARR? Then you need to assign a dedicated account manager to handle all customer escalations. PMs then triage and plan their feature requests and later engineers implement it. Which all adds up.
And they also uses your competitor's product, like Google, FB, TikTok etc etc... Twitter is a severely underdog here, so you need to support at least a minimal, essential subset of features in those products to convince them to spend their money on Twitter. That alone takes hundreds of engineers, data scientists and PM thanks to modern ad serving stacks with massive complexity.
Yeah, it ultimately boils down into a simple fact that it's really hard to take other folk's money. You need to first earn trust from them. They want to see if your product is capable of following a modern standard of digital ad serving for now and foreseeable futures. Twitter has spent lots of time for earning trusts and the original post is one evidence of such efforts. And this usually needs more man power. You might be able to do that in a more efficient manner, but I don't think that's as simple as firing 75% of your entire headcount.
Now the systems are stable but human workers either be sick, leave, or die eventually.
Rising the pay has diminishing returns. You can't prevent workers leaving because of lost of interests, be sick or die by throwing more money at them.
The article wrote about achieving stability by the distributed system so an unexpected death of one rack doesn't affect the service availability. The same can be done for the human workers unexpectedly not working anymore. Have a multiple workers doing the same things improve stability.
Sure, it's inefficient in terms of money. But alternative is one sick important employee catch a COVID-19 and die lost the knowledge of the system. Documents doesn't solve it because you want the manual operation available right now rather than a few months later when replaced workers learned from the documents.
It's largely focussed on the event stream behind the core service and data analytics. There's maybe one entry on the main data store and one on search over the last few years.
They had wine on tap.
https://www.tiktok.com/@realpankhilpatel/video/7159187292631...
Normal people don't have vacations like that.
There were multiple executives making $10m/yr+
There were board members
There were shareholders
Why did all of them not stop this headcount increase if it's as easily reduced as "too much headcount bad, smaller headcount good"? These are paid professionals who are supposedly wealthy, good at their jobs, smart, informed, etc.
How can us commenters on HackerNews sit from our armchair and say "ah, goofballs should've just not let headcount get so high!"
These qualified people thought at the time it was a good idea to get up to 7.5k people. How were they all wrong?
https://nitter.lacontrevoie.fr/libsoftiktok/status/158539526...
Elon thinks he knows what he’s doing, but what he is going to be left with are people who are willing to work hard by his standards, but not necessarily smart.
The simple truth is Elon knows nothing about the actual work involved in tech. He knows words or elicits help from others on what to say that sounds like tech speak (RPCs!), but when it comes to being truly knowledgeable in this space, he is losing his most valuable assets because of his amazingly poor managerial and ownership style.
I know there are a lot of Elon fans on this site, and will disagree with all of this; but his abilities have not at all been proven. Yes, he knows how to spend money to claim credit for technical advances, but until he actually has his hands dirty in the muck of the hard work of tech, he will always be a glorified self-promoter with no substance.
And Twitter will suffer for it.
Kevin Watson, who developed the avionics for Falcon 9 and Dragon and previously managed the Advanced Computer Systems and Technologies Group within the Autonomous Systems Division at NASA's Jet Propulsion laboratory: "Elon is brilliant. He’s involved in just about everything. He understands everything. If he asks you a question, you learn very quickly not to go give him a gut reaction.
He wants answers that get down to the fundamental laws of physics. One thing he understands really well is the physics of the rockets. He understands that like nobody else. The stuff I have seen him do in his head is crazy.
He can get in discussions about flying a satellite and whether we can make the right orbit and deliver Dragon at the same time and solve all these equations in real time. It’s amazing to watch the amount of knowledge he has accumulated over the years."
And if you're going to claim that his successes have been due to the people surrounding him who actually know what they are doing, then all that tells me is that you are acknowledging that he knows how to surround himself with people who know what they are doing.
We're not fans (I'm certainly not), but it takes a special kind of mind to look at Musk's track record of successes and conclude that his latest project is doomed.
And he hates bloated inefficient teams. His decrees on meetings are infamous. Tripling the team at Twitter implies a lot of internal politics, fiefdoms, communication overhead, and generally a lot of headless chickens running around. There's no nice way to fix such a team. A sledge hammer is one way to fix it and obviously he likes getting results quickly.
So, the notion of laying off most of that team was a foregone conclusion. The notion that a lot of the better people would get upset about that and leave as well is also highly predictable. What's left is a team with some gaps but also a lot of breathing room. And he can always lure key people back in by throwing money at them.
Simple plan. It might actually work. At the cost of a bit of drama, temporary instability, and lots of free publicity. Exactly his style. Cringe worthy and effective. I can see the logic here.
It's not about being a fan or not, it's that you're not actually providing any real insight other than signalling how smart you are.
Ever watch someone do CAD/CAM modelling? They need extreme precision of input that AR sausage fingers just aren't going to help with. You need a num-pad and a good mouse with a stepped click wheel.
I get it. We share some similar neurodivergence traits. He wants to be right in the detail. Constantly jumping from interest to interest, seeing the hidden patterns and connections that aren't apparent to others. But there are a times when I know I just need to shut up and let someone more experienced talk despite my brain wanting to lead every discussion right into solution mode, or providing additional context mode.
I've spent the last 6 years in management consulting (without formal business education), I agree with him when he says MBAs are useless. We know that the best solutions come from diverse teams with diverse backgrounds, skills and knowledge. Not 5 clones who know how to build value driver trees, not to say the tools they bring aren't useful, but they can be incredibly limiting.
For someone who hates MBAs he's sure going about this take-over like someone who barely passed one (i.e. knows more than enough to be dangerous). Sure, you're hemorrhaging money in operations. You need to cut costs and find new revenue streams.
What are your biggest costs?
Labor. Slash / Burn. The old McKinsey 7% FTE reduction will give you some extra operating cash from the years remaining budget and you know it's not so much that people (in fear of their jobs) won't just pick up the slack to keep everything moving. Do it quick because you need to rip the band-aid off and get rid of all that accrued leave, restricted cash etc. off your books too.
Equipment. Redundancy? Sounds like unused resources we can fire sale.
Contracts. Renegotiate? The only two meaningful levers are price and quantity. Start cutting quantity now, renegotiate price later.
This is all dummies guide stuff and tends to go terribly in reality when implemented all at once all together.
For instance, research has shown companies that lay-off when under pressure end up underperforming against the ones who chose not to.
Now who's going to help build and operate those new revenue streams?
Quick fixes for a quick buck and a whole lot of extra risk.
You can screenshot this: Musk will cut down costs, make Twitter profitable, and take it public again when markets are better placed.
The markets will give the newly listed Twitter the same "Musk-boost" as Tesla and ramp up the valuation to $100B.
You can get rid of 80% of the work force and the existing homeostasis systems will keep things running smoothly despite known day-to-day chaos.
Where you’re really going to run into trouble is inventing responses to novel chaos and gradually changing times.
The bigger a ship is, the slower it is to turn.
IBM is a "tech" company that employs 282,000 employees, and when was the last time they invented something? I don't remember the last time I heard IBM in the news about something they made.
The bigger the company, you often times find less innovation and more administration & bureaucracy.
The reason startups can survive is because of its small size that makes it very flexible and adaptable to chaos and change, that gives it the edge over bigger companies.
In modern software environments, the entropy is almost violent -- the changes in all the constituent dependencies are constant and relentless. Something frozen in time does not stand a chance, unless it's entirely stand-alone and dependency-free -- an unlikely scenario with a service of Twitter's size.
What did you make of Mudge's report regarding resiliency of data-centers?
> Insufficient data center redundancy, 59 without a plan to cold-boot or recover from even minor overlapping data center failure, raising the risk of a brief outage to that of a catastrophic and existential risk for Twitter's survival.
- https://techpolicy.press/wp-content/uploads/2022/08/whistleb...
All 10 of the "top trends" for me were about how twitter was dead/dying and armchair experts on reddit said it was only a matter of hours at this point.
Maybe it was close (in reference to external factors not implied in your very insightful post!) but its amazing how confident people are with opinions on events they have 0 insight into. Everyone knows how to solve the war in Ukraine or world hunger but how rare is the "consensus" (in terms of up-votes or popularity) right... just something that got me thinking. Thanks again for this article. I always love seeing details of tech ops!
My theory is that the economic problems have been stressing the systems and that makes the "bones" more apparent. The core concept of this country is checks and balances of two opposing sides. I think it was a great improvement over authoritarianism, but you can see that it is definitely not the ideal final paradigm.
1) The site goes down [you look like you can't do you job]
2) The site stays up [you can do your job] but the lay off looks like a good decision.
You are screwed either way.
https://mobile.twitter.com/elonmusk/status/15945006557246095... (screenshot: https://raw.githubusercontent.com/aboxwithrocksinit/test-buc...)
If this is the new town square, you can forward my mail to a cabin in the woods.
Maybe the guy is really losing it
1. Generally, with large, complex systems like this, everything works, until it doesn't. All the big boys have major outages periodically. I just can't fathom how Twitter is going to handle the eventual certainty of a major outage when, as the author notes, in some cases there are teams that have 0 people left.
2. More than the technical issues, betting that Twitter will go bankrupt is the easiest bet one can make. Musk saddled Twitter with a shit ton of debt - even if things worked as they did before he had to cut tons of people due to the debt burden.
The issue I see is that #2 directly works against #1. Musk has said it will be lots of intense work adding new features to try to raise revenue. But making a ton of changes, probably with lots of shortcuts to get them out the door quickly, especially when so much institutional knowledge has walked out the door, will make keeping the site stable even that much harder.
And of course people in 'important' roles who've been laid off are going to say the company is doomed, these are the probably the worst source to go off. They're not going to say "oh yeah I didn't do much at all really, just bossed people around and spent my budget every year."
Now just because they're "in tech" doesn't mean they have any idea about Twitter, but they should at least know enough to know they don't know what's going to happen, but obviously they're not actually using their brains when posting comments like that. Point is a lot of people opposed to Musk have been participating in a spiraling echo chamber of fairy tales and wishful thinking, it's not just journalists (although clearly they're printing lies with ulterior motvies too, as usual).
Not everyone has the same knowledge and skills. Not everything is documented. Not everything in the documentation is current and correct. (especially after recent changes) It's not even that they won't stay up, but depending on who left the company, the oncall response may look radically different and have different time to resolution.
I don't think it's that obvious, it's one thing to just leave everything running but Elon was also talking about changes he was making (Ex. turning off a bunch of "microservices" because they don't do anything). If you turn off the wrong thing and don't have anybody left who knows how to properly turn it back on again, then you're in a pretty bad situation. It doesn't seem like that happened, but I don't also think we have enough information to say how close it was to happening.
Then I saw how bitter and nasty a lot of the online communities I belonged to were. I mention that advertisers were of course mad, the stated goal is to reduce reliance on ad revenue. I was met with people attacking me as an idiot with clickbait articles of “proof” that ad models were irrelevant to twitters problems etc, as if I was doing anything other than quoting musks official reasons for the purchase. Suddenly Twitter was this great beacon of graceful discussion and Musk has ruined it.
People just love to be mad, no one had anything nice to say about Twitter and in a flick of a switch they’re holding completely opposite opinions. We’ve always been at war with Eurasia, it seems
I disagree; math would be a closer analogy. And indeed, arithmetic still works like it did a millenia ago. Closer to the present, I have binaries from the late 80s that still work today (and I use them semi-regularly.)
Indeed, much of the impetus of the software industry seems to be to propagate the illusion that software somehow needs constant "maintenance" and change. For the preservation of their own self-interests, of course; much like the company that makes physical objects too robust and runs out of customers, planned obsolescence and the desire to change things and justify it so they can be paid to do something are still there.
It's possible to make things which last. Unfortunately, much of the time, other economic considerations preclude that.
Things do still need to be fixed of course.
This seems to be more true of Mastodon than Twitter.
I can't imagine any self hosted Mastodon instance staying up longer than twitter.
- Twitter is going to go down tomorrow and it's all over. RIP.
The second is, - Twitter is going to experience a failure cascade over time.
The third is, - It's all going to be fine.
I suspect that the real question is, how many individual wires can break before the cable holding the suspended platform snaps?I am not that good of a developer, but watching Twitter I can't help but be reminded of Arecibo, except at a larger, more abstract scale. There was no single massive event that caused the failure, rather a series of factors and events, tiny cables breaking that eventually leads to a failure cascade that then causes the suspended platform to crash.
From what I can tell, in the past week or so,
- Twitter's copyright system failed
- Two Factor Authentication broke down (it seems to be back up?)
- (anecdata) Tweets have been loading sporadically for me and other people, sometimes we try to open a tweet and it says that it doesn't exist. Happens more frequently with new/recent tweets.
- (unconfirmed) Twitter's managed account backend is behaving "strangely." For e.g., "One of my campaign managers logged in last week and found all our paused creatives from the past 6 years had been reactivated." from https://www.teamblind.com/post/i-told-my-team-to-pause-our-750kmonth-twitter-ads-budget-last-week-4dnbo1Ft ———— Friends have told me other similar stories
Are these failures symptomatic of a larger problem, or are they well-isolated parts misbehaving? Can Twitter even experience a failure cascade like Arecibo? Can that be paused/stopped?I am asking this question because I don't know. And I'd like to develop a better mental model to understand what happens next.
Without a reliable twitter systems status history pre-acquisition, the reports of failures, like the issues with the 2FA system, don't mean a whole lot.
I put "wrote" in quotes because what I had actually done was install Apache and PHP Nuke on a Windows NT server and then modified an existing form page to do what I wanted.
I wrote that application in 2002 and never had a problem with it. I never restarted Apache or the server or did any maintenance. I didn't even upgrade Windows NT if I'm being honest. Windows NT became unsupported in something like 2006 and I left the company in 2010.
I received a call in 2018 from the last person still working there who knew who I was. It had finally fallen over and they wanted to know if I could help. That was the first time it ever had an issue.
The world cup is on and the site didn't collapse.
That is a huge win and while I don't believe Twitter will ever be profitable, I think Elon will be feeling rather smug right now with his skeleton staff.
Ground floor is exactly where you don't want to be when the structure collapses under its own weight.
Rather than bring in additional specialists to reinforce it, Musk has partially evacuated the building at least.
Mesos is dead. So you need in-house expertise to patch it without being able to leverage community knowledge.
Does Twitter retain enough people to manage Mesos?
Also, the article seems to suggest Twitter only has two datacenters. That seems surprising for the global reach of the company. Perhaps there are other smaller datacenters that are not prepared to handle the entirety of the site’s traffic.
My current thinking is there’s time to figure out how to operate the current system before it runs into issues that would render it degraded for a prolonged period of time. I noticed TLS certs have already rotated for instance. That was my best guess for simple thing that could fail if managed poorly.
If you don't know what a rack is, how are you meant to know what the Order of scaling function means? Thats a highly computer sciences specific notation, and if you grok O(n) you know what a rack, a host, a DC is.
I really don’t understand why so many tech companies have like 8 layers of engineering levels. If the argument is that you need more money so more levels, just have a bigger band. Don’t chase titles they don’t mean shit. Not to mention the management stacks that seem to just hang out in meetings and take pvt. I haven’t worked at a proper startup but I’ve been on projects where a dozen or so people rebuilt apps used by tens of millions of people in a few months, or launched completely new applications for bigger companies.
Now that I work in a tech company (not big tech, but still a multibillion dollar corp) I’ve noticed that since IPO we have added a ton of bureaucracy whereas back in the day we were small teams building completely new and at times complex features. Literally was in a meeting earlier this month where people were patting each other on the back because we added a single attribute to a table. I’m obviously reducing the entire initiative to a small thing but it kind of explains all we had to do. It’s soul crushing but with the economy the way it is I must deal with it. Hell I’m down to show up next Monday and work for Elon if he wants a go getter. At least I’ll get to DO stuff. Or any other startup in SF if they are hiring.
Sure engineering adding a new widget to the site might increase profits by 12%, but all that bureaucracy can prevent the company losing its payment provider or breaching a government regulation which might cause the company to close overnight. So if the stakes are +12% vs -100%, who is actually doing the most important work?
I don't disagree with your desire to work somewhere lean and task-focused, but I think it's almost impossible for that to happen anywhere but small(er) workplaces.
The core problem is that this approach doesn't really scale out because communication overhead exhibits quadratic growth to the orgs size if it's untamed. The feasible options are:
1. Let the complexity bring chaos across the org
2. One decision maker rule them all
3. Gives some sort of management structures to the org
Your proposal is somehow between the option 1 and 2. The option 1 works pretty well for smaller orgs and it might scale to a quite sizable business if the members are generally competent so org-wide trust can be well-established. But anyway you'll hit a road blocker eventually since people cannot spend all of their time on communication overheads. The option 2 just moves the burden of entire complexity into a single personnel so it's not really a reproducible solution but more of a mere luck.Hence the option 3 is the only remaining option for regular orgs and many smart people tried to figure out the best structure (or at least best practices) but unfortunately we don't have a definitive answer yet. Google-style "tech level" is one of the tool to reduce communication overhead by setting a common structure for expectations (e.g. "we have 1 L6 and 3 L5s to take that project" is generally easier to convince than length explanations of your team members). It's not ideal but it somehow works so it's adopted.
You're likely right that you'll be much more productive if you can get rid of those bureaucracies, but getting other folks convinced is a completely different story. Trust takes time to propagate and people have a limited time to spend on it. This obviously could be drastically simplified if you can work with Elon (or similar style leaders) directly but his time is extremely limited so there will always be only a small number of people who can enjoy that privilege...
Someone earning a lot more than you at the same nominal position leads to a lot of resentment: the perception is that you are clearly wronged here. On the other hand a rank system makes this less objectionable and offers at least some roadmap to a similar income. "Oh, she's SWE L9000 naturally she'd earn that"
This doesn't account for the extra overhead associated with extra DCs, but it seems like there's opportunities for major effeciency wins.
https://www.datacenterdynamics.com/en/news/report-elon-musk-...
Musk has a point, but it feels more like he's doing this out of financial desperation.
I'll bet it starts going down a bit more often, nothing too severe. Bigger issue might be inability to roll out new features, especially if they dig a horrible tech debt hole. Maybe someone has more details, I haven't kept up, but it seems like the $8 verification rollout got botched because they cut corners on actually verifying the subscribers.
Did they actually verify anything other than the ability to pay $8? It seems wild to me that they thought it would work out just fine
They could stay ahead of things if they hire into areas related to infra maintenance that got hit the hardest, before issues reach the point of cascading failure. Or maybe put everything not critical to keeping the site up & running on hold as remaining devs get a bit of cross training using the limited people w/ institutional knowledge that remain as trainers.
For myself, I don't see much value in Twitter, in terms of net social value. It's format seems pretty much designed for only the most surface level discussions, part of what I believe leads to some of its toxicity: it's simply too hard to have conversations complex enough to invite enough discourse for that to tip the productive/toxic ratio a bit more positive.
So I've been rooting for Twitter to fail for years. I'm not rooting against Musk: I like his other businesses and at least a portion of their ongoing success is tied to his persona. (At least before the Twitter stuff: His actions there may impact his relationship with major banks that his other businesses will rely on, and Musk & JP Morgan, a bank that wasn't included in the deal & therefore won't be hurt by it, is already on somewhat negative terms with Musk. The twitter deal has added a few more to that list, and other banks have undoubtedly taken note. Some bank will always finance the regular sorts of things any large corporation need for him, but they're all going to be pricing in some additional risk. That's not really a big deal, it's just that I think Musk's persona has previously been a net positive for his companies and now it's lost at least a little bit of that.)
So now that it is owned by someone that doesn't need churn and just needs to reduce cost, people can focus on discreet resilience factors, just like any small tightly held software operation. Many of Twitter's pet projects go away, the ad sales relationship have to get re-evaluated too, but the core consumer product that everyone sees can be made resilient and operate cheaper.
Did we read the same article, because it seemed clear that all that resilience work was done when the company was public shareholder owned, not under the new private owner.
moves like musk's are bad for the industry. they produce more systems with "job security" built in.
How common is this?
1) take private
2) fire/layoff until things break, patch up / rehire with cheaper labor. Repeat as necessary
3) spit shine/repackage what's left with a theoretically more appealing balance sheet.
4) resell to another sucker...uh, buyer, or take public again.
Fundamentally Musk bought 44 billion for 5 billion of annual revenue, and presumably 5 billion in costs. Unlikely to add revenue in twitter's model, he's cutting costs.
Honestly, at this point, is there revolutionary technology that is needed to keep the lights on at twitter? Do you need graphQL gods and SREs that would ace any Amazon raising the bar and master of silicon valley interviewing?
Nah. He'll honestly probably outsource a ton of the upkeep.
It's ugly, but the Twitter board and shareholders took their money and ran, and abandoned the product and the workforce. They could have backed out and let Elon Musk off the hook of his dumb contract with them, but they just wanted the money, and sold to someone that just wants to get his money out of it too.
But yes, this is very bad execution of the PE playbook.
Like, I like my car, but would probably sell it for an order of magnitude its worth.
I would be surprised that no company in the west take advanced of that.
I imagine the site will mostly continue to more-or-less work, despite all the layoffs. They still have thousands of staff. The network effect of Twitter is so big that people will continue to use it even if fail whales become more common. Others suggest that it will soon crash and burn, or that Musk will get bored and sell it for a few billion. Or that Musk is a genius who will make some sort of amazing Twitter 2.0 that does for social media what Tesla did for electric cars. But without any appealing long term vision, and with an owner who bought it to satisfy their ego rather than with any real plan, the reality may be more boring. I imagine it will just languish for many years, with occasional manufactured drama, and occasional downtime, but no real innovation. Maybe to be eventually supplanted by something else in 10 years or so.
Based on texts exposed via the lawsuit about the purchase, I don't think this is the case. I don't think he, or his advisors, understand that the product at twitter (and other social media) is content moderation. You can have a vision of whatever type of content or pricing scheme you want but without solid content moderation you will lose advertisers, gain lawsuits (people were posting movies the other day) and lose users because the "feed" becomes a muddied mess. Users are only really the product when you can moderate their content to have profit via advertisers.
Within most companies you don’t want to pause innovation.
That said, it takes some time to get to know the real workings of a company, knowledge required to select the right 50%.
That said, afaict the current mo is keep the lights on for the current offering while a new team builds a new product for that enormous user base that offers something much more profitable.
Having worked as SRE on efficiency projects for BigCo, it is not at all uncommon to recover more than your lifetime salary in company savings with only a few months work or even less. The scale of things is so immense that even a slightly better handling of things can lead to outsize returns.
Laying off someone like that, rather than putting them on full-time efficiency work, is an obvious waste.
I would say this; Twitter at this scale is extremely complicated. We don’t have enough information to know what all these roles were doing and how many were truly superfluous. The article gives an interesting insight into something extremely complicated- it might be a slice repeated in a lot of areas. I’m of the opinion that you can do a lot more with less but there’s no need to send emails to everyone in the organisation being an arsehole and essentially accusing them of being lazy, then rehiring people who never worked there just for the meme.
Anyway Musk will blow up some rockets, put enormous pressure on people to fix them, and come up smelling of roses again even if he treats a load of people like crap along the way. His fans will say he’s a genius and his enemies will say he’s a dickhead. And so it goes.
Seeing lots of comments like this. Why do you feel it’s ok to say that, when you don’t actually work there or have intimate knowledge of who is currently working there? The pure speculation and nonsense surrounding Twitter at the moment is plain awful.
A lot of the "Twitter could be run off my laptop" style comments seem to come from people who run, effectively, Read-Only services. They might serve data at thousands of queries per second but the data itself is _slow_. It is video files or music streams or other data that updates infrequently or, if there is dynamic data involved, it comes from a very small number of fixed sources.
Twitter deals with thousands of posts per second that are subject to huge fluctuations in the density _and_ those posts have to be disseminated to the millions of users. Twitter is a two sided problem.
Twitter processes roughly 10K tweets per second. Even if you bloat out the text quite a lot to account for encoding overheads, metadata, etc, etc... and assume that each one is 10KB, then this is just 1 Gbps. A single NIC on an old server.
Okay, I get it, Twitter needs a lot of data too. Lists of users, etc...
Twitter has 450 million monthly active users, which sounds like a lot, but even if there's 1MB of profile data per user such as who they are following, that's just 400 TB.
That's... not that much these days. A large-but-not-enormous database cluster.
Sure, there's "historical" data, but that can be compressed and put on cheap cold storage, like S3 or whatever.
Give me a few million annually as an opex budget, a small team of decent developers, and I can guarantee you I could whip up a cloud-hosted service that can process tweets at Twitter's scale.
Obviously, what I can't replicate is the much larger set of tools and systems behind the scenes that are used for moderation, analytics, ad sales, etc...
> For four of those years I was the sole SRE for the Cache team. There was a few before me, and the whole team I worked with, where a bunch came and went. But for four years I was the one responsible for automation, reliability and operations in the team. I designed and implemented most of the tools that are keeping it running so I think I’m qualified to talk about it. (There might be only one or two other people)
If you only need one person for the caching department (which is, as I understand, is critical as it delivers most of the data); then maybe you need a handful other dozen engineers and there you have a functional Twitter.
That or the OP is full of himself. Kinda like Musk?
But who believes those 12 engineers still work there? The author of this specific item is in fact not there any more.
And a lot of other people are needed to bring in revenue, don't you think? Nobody is paying for a beautiful caching system.
It's like if I doubled my weight in the last ten years. Half of me is bloat, and yet, there is no possibility bisection will improve my health.
The cache clusters size are also described here for anyone who wants a good technical read over speculation. https://www.usenix.org/system/files/osdi20-yang.pdf
A horse, of course, is not same as a reduction in force.
Engineers don't eat hay, neither does a horse have a 401(k).
I suspected that the hopes and fears of Twitter’s quick demise were overstated and shared[0] some of those thoughts last week, on Twitter of course.
Also hi from another infrastructure person in SD!
0 - https://twitter.com/bitsandhops/status/1593637241527578625
Edit: also the reason Musk wants to shut down all those microservices has nothing to do with money, it's that he doesn't have enough manpower to maintain them. If Twitter is like most non FAANG out there those "microservices" are coupled with each other, so good luck.
Linux doesn’t upgrade itself.
the web is very standards based.
Hacker news has pretty much been running for decades on the same software hardware stack
I'm pretty surprised to see the occasional redditesque "Elon is an idiot" rhetoric on hn tho.
There will be a necessary breaking change that there will be no support for that will cascade into the downtime the media is demanding at the moment.
Also, I find it pretty interesting how this is playing out so far.
I'm starting to think that Mr. Enron Musk is going to pull this off.
It's a little thing, but it's the little things that cascade into big things that require teams of people to fix. Here, we see this person who's quite proud of his automations making the simple, but harmless error of using a possessive instead of a plural. Humans are good at reading through this sort of thing.
Computers are not. Somewhere deep in all his scripts is a misplaced ', `, or ’ waiting to break something.
I haven’t worked on Twitter-scale products. But I do wonder if it’s safe to draw a straight line between services that have middling and large scales. Things don’t necessarily work the same at the extremes.
Reddit needs to implement this if they want to get to the next level.
Writing a program to store a list of servers to be swapped instead of keeping them in a spreadsheet sounds a bit like buying a brewery when you want to drink 1 beer. Program used by a team of one sounds like over-engineering.
The problem is not that things will break. Things break ALL the time. The problem is more about avoiding cascading effects and the time it takes to fix stuff.
Presumably in the turmoil nobody is deploying new code. Sure shit happens, but i imagine that twitter is mature enough that they aren't having weekly critical incidents, especially during times where nothing is changing.
I still think twitter is going to have some disaster, but give it a few weeks.
Keep it running? One part-time should surely do it?
https://lichess.org/blog/Y3u1mRAAACIApBVn/settlement-reached...
Speaking as an experienced single tech founder.
With configuration error, or failed software update everything can go down at once, or there are cascades of restarts that kill the system.
I think people just miss seeing the fail-whale graphic.
I wonder how many reading this are starting to feel nervous about their own roles.
And even if Twitter lost 50% of it users and revenues, it is still more profitable than have to pay the salary of 7000 people.
^ is a snark. I'm not suggesting that they should..
Till now, he's fired a lot of engineers probably because:
a) he doesn't need so many. b) he thinks he can do the job himself.
At some point the engin’s fluids will run out and need topping off. There's a warning system though so someone'll get notified before the engine blows up, so it's entirely possible this hardcore crew of H1Bs pulls it off. (Great job replacing the SSL cert!) At some point the train’s boiler will need retrofitting. A new team could successfully replace the boiler, but doing it while the engine keeps running just isn't an easy job.
The handful of devs and SREs at Twitter will do their damnedest to keep the train, err, site running. They might even succeed. That doesn’t prove the site was over-engineered, it means they built a damn fine train engined that successfully pulled the whole train up and over the mountain pass.
Twitter wasn't run like a lean start-up type business. Because wasn’t one. Society needs better than that. The vulture capitalism mindset is what's wrong with this county. What's the barest minimum I can pay people to work for me? Not; what can I pay them to let them live prosperous lives but what's the bare, subsistence minimum.
America needs a middle class and these people were part of it. You don’t get a middle class by injudiciously paring things down to a not-even-skeleton crew that’s going to be worked till they burn out, only to get fired.
Twitter was only $14 million from profitability in Q2, and any idiot can randomly fire people to find that much in that large an organization, but cutting costs like that is to misunderstand the situation entirely. Just like starving yourself isn't healthy and won’t improve your self image, randomly cutting costs like that saves money, but doesn’t result in a healthy business.
She was assistant to the gender health support board.
Twitter is going to be here for a long long time. It's not going to suddenly shut down, it's not going to slowly decline, there is not going to be mass abandonment.
Elon might be cocky, but at the end of the day he is a successful businessman. He didn't just sent out that loyalty pledge email out of cockyness. He really wanted to kick out everyone who doesn't believe in him and his beliefs. And he sent that email well after understanding how Twitter runs and putting his loyal people from Tesla ans SpaceX in charge of operations.
The woke crowd needs to get out of denial and start coping with the fact that Twitter is not the bastion of unopposed woke ideology anymore.