1) those directly involved with the incident, or employees of the same company. They have too much to lose by circumventing the PR machine.
2) people at similar companies who operate similar systems with similar scale and risks. Those people know how hard this is and aren’t likely to publicly flog someone doing their same job based on uninformed speculation. They know their own systems are Byzantine and don’t look like what random onlookers think it would look like.
So that leaves the rest, who offer insights based on how stuff works at a small scale, or better yet, pronouncements rooted in “first principles.”
Especially in a time where the gates have come crashing down to pronouncements of, "now anybody can learn to code by just using LLMs," there is a shocking tendency to overly simplify and then pontificate upon what are actually bewilderingly complicated systems wrapped up in interfaces, packages, and layers of abstraction that hide away that underlying complexity.
It reminds me of those quantum woo people, or movies like What the Bleep Do We Know!? where a bunch of quacks with no actual background in quantum physics or science reason forth from drastically oversimplified, mathematics-free models of those theories and into utterly absurd conclusions.
This happens when your terms are underspecified: someone says “Netflix’s servers are struggling under load” and while people in similar efforts know that basically just equivalent to “something is wrong” and the whole conversation is basically esoteric to most people outside a few specialized teams, these other people jump to conclusions and start having conversations based on their own experience having to do with what is (to them) related (and usually fashionable, because that is how most smaller players figure out how to do things).
Whenever an HN thread covers subjects where I have direct professional experience I have to bite my tongue while people who have no clue can be as assertive and confidently incorrect as their ego allows them to be.
As one of thoes who cant help themselves; the way you phrase it feels a bit too cynical, I've always interpreted it as people want to help, but don't want to offer something that's wrong. Which is basically how falsifiable science works. It's so much easier to refute the assertion that birds generate lift with tiny backpacks with turboprops attached. Than it is to explain the finer details of avian flight mechanics. I couldn't describe above a superficial level how flapping works, but I can confidently refute the idea of a turboprop backpack. (Everyone knows birds gave up the turboprop design during the great kerosene shortage of 1128)
This comes from both first-hand experience of talking to several of their directors when consulted upon on how to make certain systems of theirs better.
It's not just a matter of guarantees, it's a matter of complexity.
Like right now Google search is dying and there's nothing that they can do to fix it because they have given up control.
The same thing happened with Netflix where they wanted to push too hard to be a tech company and have their tech blogs filled with interesting things.
On the back end they went too deep on the microservices complexity. And on the front end for a long time they suffered with their whole RxJS problem.
So it's not an objective matter of what's better. It's more cultural problem at Netflix. Plus the fact that they want to be associated with "Faang" and yet their product is not really technology based.
"Microservices" have nothing to do with it.
Netflix regularly puts out blog articles proudly proclaiming that they process exabytes of logs per microsecond or whatever it is that their microservices Rube Goldberg machine spits out these days, patting themselves on the back for a heroic job well done.
Meanwhile, I've been able to go on the same rant year after year that they're still unable to publish more than five subtitle languages per region. These are 40 KB files! They had an employee argue with me about this in another forum, saying that the distribution of these files is "harder than I thought".
It's not hard!
They're solving the wrong problems. The problems they're solving are fun for engineers, but pointless for the business or their customers.
From a customer perspective Netflix is either treading water or noticeably getting worse. Their catalog is smaller than it was. They've lost licensing deals for movies and series that I want to watch. The series they're producing themselves are not things I want to watch any more. They removed content ratings, so I can't even pick something that is good without using my phone to look up each title manually!
Microservices solve none of these issues (or make it worse), yet this is all we hear about when Netflix comes up in technology discussions. I've only ever read one article that is actually relevant to their core business of streaming video, which was a blog about using kTLS in BSD to stream directly from the SSD to the NIC and bypassing the CPU. Even that is questionable! They do this to enable HTTPS... which they don't need! They could have just used a cryptographic signature on their static content, which the clients can verify with the same level of assurance as HTTPS. Many other large content distribution networks do this.
It's 100% certain that someone could pretend to be Elon, fire 200-500 staff from the various Netflix microservices teams and then hire just one junior tech to figure out how to distribute subtitles... and that would materially improve customer retention while cutting costs with no downside whatsoever.
Every tech company massively inflated their headcount during the leadup to the Twitter acquisition because money was free.
I interviewed at Meta in 2021 and asked an EM what he would do if given a magic wand to fix one problem at the company. His response: "I would instantly hire 10,000 more engineers."
Elon famously did the opposite and now revenue is down 80%.
Subtitles are also complicated because you have to deal with different media player frameworks on the +40 different players you deal with. Getting those players, which you may not own, to recognise multiple sub tracks can be a PITA.
Things look simple to a junior developer, but those experience in building streaming platforms at scale know there are dragons when you get into the implementation. Sometimes developers and architects do over complicate things, but smart leaders avoid writing code, so its an assumption to say things are being made over complicated.
You lost me. Netflix built a massive CDN, a recommendation engine, did dynamic transcoding of video, and a bunch of other things, at scale, quite some years before everyone else. They may have enshittified in the last five years, but I dont see any reason why they dont have a genuinely legitimate claim to being a founder member of the FAANG club.
I have a much harder time believing that companies with AI in their name or domain are doing any kind of AI, by contrast.
Can you explain where this is relevant to buffering issues?
Also, you are very wrong regarding failure modes. The larger the service, the more failure modes it has. Moreover, in monoliths if a failure mode can take down/degrade the whole service, all other features are taken down/degraded. Is having a single failure mode that brings down the whole service what you call fewer points of failure?
That’s service would technically be a “microservice” even if it is a large service.
I’m genuinely curious about the reasoning behind that statement. It’s very possible that you are using a different set of assumptions or definitions than I am.
Absolutely. I think a great filter for developers is determining how well they understand this. Over-simplification of problems and certainty about one’s ability to build reliable services at scale is a massive red flag to me.
I have to say some of the hardest challenges I’ve encountered were in e-commerce, too.
It’s a lot harder and more interesting than I think many people realize. I learned so much working on those projects.
In one case, the system relied on SQLite and god damn did things go sideways as the company grew its customer base. That was the fastest database migration project I’ve ever been on, haha.
I often think it could have worked today. SQLite has made huge leaps in the areas we were struggling. I’m not sure it would have been a forever solution (the company is massive now), but it would have bought us some much-needed time. It’s funny how that stuff changes. A lot of my takeaways about SQLite 10 years ago don’t apply quite the same anymore. I use it for things now that I never would have back then.
And for limit checking, how often do you write array limit handlers? And if the BE contract doesn't specify? Additionally, it will need as a regression unit test, because who knows when the next developer will remove that limit check.
An effective operational culture has methods for removing those people from the conversations that matter. Unfortunately that earns you a reputation for being “cutthroat” or “lacking empathy.”
Both of those are real things, but it’s the C players who claim they are being unfairly treated, when in fact their limelight-seeking behavior is the problem.
If all that sounds harsh, like the kitchen on The Bear, well…that’s kinda how it is sometimes. Not everyone thrives in that environment, and arguably the ones who do are a little “off.”
In one case I was doing an upgrade on an IPTV distribution network for a cable provider (15+ years ago at this point). This particular segment of subscribers totalled more than 100k accounts. I did validation of the hardware and software rev installed on the routers in question prior to my trip to the data center (2+ hour drive). I informed management that the currently running version on the router wasn't compatible with this hardware rev of card I was upgrading to. I was told that it would in fact work, that we had that same combination of hw/sw running elsewhere. I couldn't find it when I went to go look at other sites. I mentioned it in email prior to leaving I was told to go.
Long story short, the card didn't work, had to back it out. The HA failover didn't work on the downgrade and took down all of those subscribers as the total outage caused a cascading issue with some other gear in this facility. All in all it was during off-peak time of day, but it was a waste of time and customer sat.
this is where you get up and leave
That’s a bold claim given that people with inside knowledge could post here without disclosing they are insiders.
Is that some kind of No True Scotsman?
For every thread like this, there are likely people who are readers but cannot be writers, even though they know a lot. That means the active posters exclude that group, by definition.
These threads often have interesting and insightful comments, so that’s cool.
GP clearly meant some people not everybody. You are the one making bold claims.
It’s a very different problem to distributing video on demand which is Netflix’s core business.
We (yep) don't know the exact details, but we do get sent snapshots of full configs and deployments to debug things... we might not see exact load patterns, but it's enough to know. And if course we can't tell due to NDAs.
now take this realization and apply it to any news article or forum post you read and think about how uninformed they actually are.
Reputational damage is going to be far more Netflix than the NFL if they totally club it.
That and this fight is going to likely be an order of magnitude more viewers than the Christmas NFL games if the media estimates on viewership were remotely accurate. You’re talking Super Bowl type numbers vs a regular season NFL game. The problems start happening at the margin of capacity most of the time.
Most people are consumers and at the end of the day, their ability to consume a (boring) match was disrupted. If this was PPV (I don't think it is) the paid extra to not get the quality of product they expected. I'm not surprised they dominate the conversation.
I'm also not going to criticise my peers because they could recognise me and I might want to work with them one day.
Stuff goes wrong, random internet people jump on the opportunity to speculate and say wildly off-the-mark comments, and the engineers trying to keep the ship from sinking have to sit quietly for fear of making the PR backlash worse.
Another person was observing the interview, for training purposes, and afterwards said to me: “Do you have kids? You have so much patience!”
And looking through the comments, this is just wrong.
If you code it to utilize high-bandwidth users upload, the service becomes more available as more users are watching -- not less available.
It becomes less expensive with scale, more available, more stable.
The be more specific, if you encode the video in blocks with each new block hash being broadcast across the network, just managing the overhead of the block order, it should be pretty easy to stream video with boundless scale using a DHT.
Could even give high-bandwidth users a credit based upon how much bandwidth they share.
With a network like what Netflix already has, the seed-boxes would guarantee stability. There would be very little delay for realtime streams, I'd imagine 5 seconds top. This sort of architecture would handle planet-scale streams for breakfast on top of the already existing mechanism.
But then again, I don't get paid $500k+ at a large corp to serve planet scale content, so what do I know.
The problems with using it as part of a distributed service have more to do with asymmetric connections: using all of the limited upload bandwidth causes downloads to slow. Along with firewalls.
But the biggest issue: privacy. If I'm part of the swarm, maybe that means I'm watching it?
[1]: Chainsaw: P2P streaming without trees, https://link.springer.com/chapter/10.1007/11558989_12
The torrent is an example of the system I am describing, not the same system. Torrents cannot work for live streams because the entire content is not hashable yet, so already you have to rethink how it's done. I am talking about adding a p2p layer on top of the existing streaming protocol.
The current streaming model would prioritize broadcasting to high-bandwidth users first. There should be millions of those in a world-scale stream.
Even a fraction of these millions would be enough to reduce Netflix's streaming costs by an order of magnitude. But maybe Netflix isn't interested in saving billions?
With more viewers, the availability of content increases, which reduces load on the centralized servers. This is the property of the system I am talking about, so think backwards from that.
With a livestream, you want the youngest block to take priority. You would use the DHT to manage clients and to manage stale blocks for users catching up.
The youngest block would be broadcast on the p2p network and anyone who is "live" would be prioritizing access to that block.
Torrent clients as they are now handle this case, in reverse; they can prioritize blocks closer the current timestamp to created an uninterrupted stream.
The system I am talking about would likely function at any scale, which is an improvement from Netflix's system, which we know will fail -- because it did.
1. Everyone only cares about the most recent "block". By the time a "user" has fully downloaded a block from Netflix's seedbox, the block is stale, so why would any other user choose to download from a peer rather from netflix directly?
2. If all the users would prefer to download from netflix directly rather than a p2p user, then you already have a somewhat centralized solution, and you gain nothing from torrents.
1. I exclusively download from a peer and my stream is measurably behind
2. I switch to a peer when Netflix is at capacity and then I have to wait for the peer to download from Netflix, and then for me to download from the peer. This will cause the same buffering issue that Netflix is currently being lambasted for.
This solution doesn’t solve the problem Netflix has
But it does seem the capacity of a hybrid system of Netflix servers plus P2P would be strictly greater than either alone? It's not an XOR.
And note that in this case of "live" streaming, it still has a few seconds of buffer, which gives a bandwidth-delay product of a few MB. That's plenty to have non-stale blocks and do torrent-style sharing.
If the solution to users complaining about buffering is to build a system with more inherent buffering then you are back at square one.
I think it’s might be helpful to look at netlfix’s current system as already a distributed video delivery system in which they control the best seeds. Adding more seeds may help, but if Netflix is underprovisioned from the start you will have users who cannot access the streams
Hell, in the US, this setup might actually be illegal because of the VPPA[0]. The only reason why it's not illegal for the MAFIAA to catch you torrenting is because of a fun legal principle where criminals are not allowed to avail themselves of the law to protect their crimes. (i.e. you can't sue over a drug deal gone wrong)
[0] Video Privacy Protection Act, a privacy law passed which makes it illegal to ask video providers for a list of who watched what, specifically because a reporter went on a fishing expedition with video data.
[1] Music and Film Industry Association of America, a hypothetical merger of the MPAA and RIAA from a 2000s era satire article