By having a large centralized and monolithic system, aren't they guaranteeing that mistakes cause huge splash damage and don't separate concerns?
You are effectively asking is why is there a single routing table for the internet.
To put in simple terms having a single routing table is what it makes it the internet we can share, otherwise it would just be a bunch of independent networks.
It certainly does not. If I peer with you, neither of us (generally) announce that route to our other peers, but often announce to our customers. There are many routes that are not visible to everyone, and there is no single routing table for the internet. Each BGP speaker ends up with their own routing table, although there are a lot of similarities.
No, they're asking why a single set of routers is in charge of announcing BGP routes for all of facebook. If you have multiple ASes, with independent configuration sources and independent routers broadcasting them, it's a lot harder to break everything at once.
This is partially due to design limitations of BGP, and partially due to it being nearly impossible to eliminate all sources of large scale failures in any highly complex system, and increasing the uptime of a system that already has a few nines costs an additional order of magnitude for each new nine. At some point you set your risk tolerance and have catastrophic failures now and then.
For all it’s flaws, BGP is the piece of the Internet that truly makes it decentralized. Without it, there would be a centralized routing table of some sort.
I recall major ISP's screwing up their routing tables in the past but never globally on this level.
https://www.bleepingcomputer.com/news/technology/ibm-cloud-g... (this one isn't clear, maybe BGP hijacking, and if so, not sure who the responsible party was)
https://www.catchpoint.com/blog/vodafone-idea-bgp-leak (not sure how major this one was)
You can practically search ISP bgp outage and get news about the last couple times they screwed up BGP and caused a big problem. Or service BGP and get a 50/50 chance of the service screwing up BGP or an ISP/country hijacking their routes and causing a big problem.
BGP is one of the best ways to break things at scale.
The anti-anti-pattern would be to have country specific AS's, kind of like a franchised restaurant pattern, where each country's version of fb owns their own AS that loosely forms back into the facebook mothership org, but from infrastructure point of view points to their own set of AS netspace.
It's not a full root cause analysis, to be sure, and leaves many open questions, but I definitely wouldn't describe it as painfully vague.
DNS is related to BGP only that without the right BGP routes in the routers, no packets can get to the facebook networks and thus the facbook DNS servers.
That their DNS servers were taken out was a side affect of the root issue - they withdrew all the routes to their networks from the rest of the Internet.
Not picking on you - but there has been a lot of confusion around DNS that is mostly a red herring and people should just drop it from the conversation. Everything on facebook networks disappeared, not just DNS. The main issue is they effectively took a pair of scissors to every one of their internet connections - d'oh!
Why? Why couldn't you just post that the RCA is still ongoing and that proper updates will follow? Otherwise all you get is meaningless fluff.
Its much better for some random committee in Congress to debate antitrust forever, instead of bigger committees and agencies debating national security threats
- Facebook was hacked
- They did it on purpose to bury the whistleblower story
- No one could access Facebook offices
- They had to cut open servers with angle grinders
- Disgruntled employees changed DNS records
- Lots of made up numbers for how much money Facebook/the rest of the economy was losing (or gaining)
They probably rushed out this blog post just to dispel some of these rumors.
It amazing how wrong people can be and how confident they are about being right.
Even sometimes fighting _me_ about things _I_ designed and built.
It’s quite sobering; taught me not to believe all the speculation I read.
DR downtime was about an hour, but the bank fired him anyway.
Given that Zuck lost a substantial amount of money, I wonder if the engineer faced any ramifications.
Sidenote: I asked the bank infrastructure team why the DR site was in the same earthquake zone, and they thought I was crazy. They said if there's an earthquake we'll have bigger problems to deal with.
Prostrate, he came before the COO expecting to be canned with much malice. The COO just asked if he learned his lesson and said all is forgiven.
Unless there was some kind of nefarious intent, it's very unlikely anyone will be 'punished'. The likely ramifications will be around changes to processes, tests, automations, and fallbacks to 1) prevent the root sequence of events from happening again and 2) make it easier to recover from similar classes of problems in the future.
Organizational failures require organizational solutions. That seems pretty obvious.
> Sidenote: I asked the bank infrastructure team why the DR site was in the same earthquake zone, and they thought I was crazy. They said if there's an earthquake we'll have bigger problems to deal with.
I bring this sort of thing up all the time in disaster planning. There are scales of disaster so big that business continuity is simply not going to be a priority.
The US, not even once.
The guy should have had "reload in 10", an outage window and config review. There must be more to this story than it being a firable offence for causing a P2 outage for an hour.
But yeah, likely not.
I wasn't aware that Stanley Kubrick was now in NetOps. /s
Whoops! Never attribute to malice that which can more easily be explained by stupidity and all that.
It looks like Zuckerberg doesn't have a personal Twitter though, nor does Jack Dorsey have a public Facebook page (or they're set not to show up in search).
He does: https://twitter.com/finkd
It was amazing. I’m sad he remove it.
hint: "some people".
The badge system should be local to the building. There are few actual reasons (sure, besides "efficiency") of why badge control should be centralized. Even less reasons for it to be a subdomain of fb. Another option would be to keep the system but make it failsafe (but it seems the newer generation doesn't know what that means). If the network goes down keep it at the last config. Badge validation should be offline first and added/removed ones should be broadcast periodically.
This is the same issue with smartlocks times the number of employees. Do you really want to add another point of failure between yourself and your home?
Akso, it's likely not on an fb subdomain, but something like office.security.fb-infra.com (example). It just happens to be that fb-infra.com is using the Facebook DNS server.
It's just more expensive and another thing to maintain, and still doesn't account for _all_ failure modes (what if you sync really frequently and a bad change was made deleting all accounts?)
You might need a break-glass account/badge somewhere. Sure, the angle-grinder works, but probably cost you 2h maybe?
> it's likely not on an fb subdomain, but something like office.security.fb-infra.com
Thanks, yeah, makes sense
This could be anything, potentially.
I'm not very knowledgeable in computer networking, but this could be as trivial as an incorrect update to a DNS record, right?
It happened to also kill the announcements for anycast DNS.
I had question: This is what we can only perceive through internet/routing table entries right?
Internal to FB, we don't know what had caused issues that led to the BGP UPDATE.
That's kind of what has been confusing me - there's a lot of speculation around FB's data center design and what actually happened, but we actually don't know for sure until they post an RCA - please correct me if I'm wrong here.
Which was and is a lame assumption. Stuff happens. SMTP wouldn't even be phased by this; it would just pick up where it left off.
I've seen far too many applications fail in bizzare ways because people make unrealistic assumptions like "X will ALWAYS be there". Sure it's highly unlikely, but when you have multiple things making the same dumb assumptions, on the inevitable day when multiple things that need X and X is suddenly no longer there then you start to get cascading effects of Y that relied on something that relied on X not being there when it is assumed that it would always be there so now Y fails, and then something dependent in the same way on Y unexpectedly fails and so on.
One should never assume that anything will "always" be available. That's an incredibly unrealistic assumption; and the more interconnected things become, the chances of these really nasty dependency chains/cascade failures skyrocket - leading to far worse outages and longer recovery times.
Will a real postmortem follow? Or is this the best we are gonna get?
Although it only covers their API and business apps, not the site itself.
Still, a bit unexpected behaviour though.
I can completely picture a world in which many people bought some ads yesterday morning (say, to promote an event that occured yesterday evening), the ads were never displayed to anyone, and FB will keep the money, thank you.
I found a paper that describes the process in detail. See page 10-11:
https://web.archive.org/web/20211005034928/https://research....
Phase Specification
P1 Small number of RSWs in a random DC
P2 Small number of RSWs (> P1) in another random DC
P3 Small fraction of switches in all tiers in DC serving web traffic
P4 10% of switches across DCs (to account for site differences)
P5 20% of switches across DCs
P6 Global push to all switches
We classify upgrades in two classes: disruptive and non-disruptive, depending on if the upgrade affects existing forwarding state on the switch. Most upgrades in the data center are non-disruptive (performance optimizations, integration with other systems, etc.). To minimize routing instabilities during non-disruptive upgrades, we use BGP graceful restart (GR) [8]. When a switch is being upgraded, GR ensures that its peers do not delete existing routes for a period of time during which the switch’s BGP agent/config is upgraded. The switch then comes up, re-establishes the sessions with its peers and re-advertises routes. Since the upgrade is non-disruptive, the peers’ forwarding state are unchanged.
Without GR, the peers would think the switch is down, and withdraw routes through that switch, only to re-advertise them when the switch comes back up after the upgrade. Disruptive upgrades (e.g., changes in policy affecting existing switch forwarding state) would trigger new advertisements/withdrawals to switches, and BGP re-convergence would occur subsequently. During this period, production traffic could be dropped or take longer paths causing increased latencies. Thus, if the binary or configuration change is disruptive, we drain (§3) and upgrade the device without impacting production traffic. Draining a device entails moving production traffic away from the device and reducing effective capacity in the network. Thus, we pool disruptive changes and upgrade the drained device at once instead of draining the device for each individual upgrade. Push Phases. Our push plan comprises six phases P1-P6 performed sequentially to apply the upgrades to agent/config in production gradually.
We describe the specification of the 6 phases in Table 4. In each phase, the push engine randomly selects a certain number of switches based on the phase’s specification. After selection, the push engine upgrades these switches and restarts BGP on these switches. Our 6 push phases are to progressively increase scope of deployment with the last phase being the global push to all switches. P1-P5 can be construed as extensive testing phases: P1 and P2 modify a small number of rack switches to start the push. P3 is our first major deployment phase to all tiers in the topology.
We choose a single data center which serves web traffic because our web applications have provisions such as load balancing to mitigate failures. Thus, failures in P3 have less impact to our services. To assess if our upgrade is safe in more diverse settings, P4 and P5 upgrade a significant fraction of our switches across different data center regions which serve different kinds of traffic workloads. Even if catastrophic outages occur during P4 or P5, we would still be able to achieve high performance connectivity due to the in-built redundancy in the network topology and our backup path policies—switches running the stable BGP agent/config would re-converge quickly to reduce impact of the outage. Finally, in P6, we upgrade the rest of the switches in all data centers.
Figure 7 shows the timeline of push releases over a 12 month period. We achieved 9 successful pushes of our BGP agent to production. On average, each push takes 2-3 weeks
Hey what is our internal BGP called again? AS32934?
"Yeah"
"OOK."
Networks have grown so large and complex that the only reasonable way of managing them is through SDN, and a small mistake in configuration might results in a cascading effect on the whole infrastructure.
That's also true for the entire (western) internet. We've ended up with a centralized market where a few key players, e.g. cloud providers/CDNs/DNS (Amazon/Google/Microsoft/Akamai/Fastly/Cloudflare) can easily break large parts of the internet. See Akamai outage in July.
I am not sure why they had to mention this specifically. This makes it sound like an external attack.
https://mobile.twitter.com/mikeisaac/status/1445196576956162...
https://engineering.fb.com/2021/08/09/connectivity/backbone-...
No, that just happens during uptime.
NO CARRIER
We YOLO'd our BGP experiment to prod. It failed.
https://web.archive.org/web/20210626191032/https://engineeri...
After all the scandals, leaks, whistleblowers etc it would take more than a DNS record wipe to take down the Facebook mafia.