Update about the October 4th outage (opens in new tab)

(engineering.fb.com)

274 pointsve554y ago224 comments

224 comments

139 comments · 39 top-level

0xy4y ago· 16 in thread

Knowing almost nothing about networking, isn't the way Facebook handles networking somewhat of a monolithic anti-pattern? Why is a single update responsible for taking out multiple services and why wouldn't each product or even each region within each product have their own routes, for resiliency which can then be used to rollout changes slower?

By having a large centralized and monolithic system, aren't they guaranteeing that mistakes cause huge splash damage and don't separate concerns?

manquer4y ago

BGP has to converge to a single routing table.

You are effectively asking is why is there a single routing table for the internet.

To put in simple terms having a single routing table is what it makes it the internet we can share, otherwise it would just be a bunch of independent networks.

toast04y ago

> BGP has to converge to a single routing table.

It certainly does not. If I peer with you, neither of us (generally) announce that route to our other peers, but often announce to our customers. There are many routes that are not visible to everyone, and there is no single routing table for the internet. Each BGP speaker ends up with their own routing table, although there are a lot of similarities.

1 more reply

Dylan168074y ago

> You are effectively asking is why is there a single routing table for the internet.

No, they're asking why a single set of routers is in charge of announcing BGP routes for all of facebook. If you have multiple ASes, with independent configuration sources and independent routers broadcasting them, it's a lot harder to break everything at once.

1 more reply

colechristensen4y ago

Without in depth technical details of what exactly happened and how, it’s hard to say. With what we have heard and seen the answer is probably that there are some opportunities to build in more resiliency, but ultimately the kind of failure that causes domino effects is impossible to eliminate completely.

This is partially due to design limitations of BGP, and partially due to it being nearly impossible to eliminate all sources of large scale failures in any highly complex system, and increasing the uptime of a system that already has a few nines costs an additional order of magnitude for each new nine. At some point you set your risk tolerance and have catastrophic failures now and then.

andrewxdiamond4y ago

You are correct, but this problem isn’t Facebook’s doing. This is just how BGP works. Even big players like Verizon can screw it up and break the Internet.

For all it’s flaws, BGP is the piece of the Internet that truly makes it decentralized. Without it, there would be a centralized routing table of some sort.

tw044y ago

Quite the opposite. Back in the day you would've had to login dozens if not hundreds of routers individually to push the change, and it likely would've been caught after screwing up the first one. This is the result of SDN (software defined networking) and being able to push a change globally from one command.

I recall major ISP's screwing up their routing tables in the past but never globally on this level.

kanbara4y ago

https://www.techjuice.pk/country-blocked-youtube-globally/

https://www.bleepingcomputer.com/news/security/major-bgp-lea...

toast04y ago

https://www.itproportal.com/news/misconfigured-centurylink-d...

https://www.bleepingcomputer.com/news/technology/ibm-cloud-g... (this one isn't clear, maybe BGP hijacking, and if so, not sure who the responsible party was)

https://www.catchpoint.com/blog/vodafone-idea-bgp-leak (not sure how major this one was)

You can practically search ISP bgp outage and get news about the last couple times they screwed up BGP and caused a big problem. Or service BGP and get a 50/50 chance of the service screwing up BGP or an ISP/country hijacking their routes and causing a big problem.

BGP is one of the best ways to break things at scale.

mftb4y ago

Ultimately I think you are right, but I don't think this is the right way to think about the question. It's not as though they thought they were creating a monolith on purpose by following a pattern for creating monoliths. They thought they were following best practices and building a distributed system, without single-points of failure. I think a more productive way to think about it might be what's the psychology and practices that led them to be wrong? My best guess is that at some points in the stack DNS is opaque, and at other points, particularly for trained network people, it becomes transparent (i.e. invisible) and disappears (like a mirage), so then they make both the network and physical locks dependent on it (and BGP) and... lock themselves out when it fails.

barkingcat4y ago

If you want 1 facebook.com entity (that in turn controls instagram, whatsapp,etc) , then I get why a single AS change would take out globally.

The anti-anti-pattern would be to have country specific AS's, kind of like a franchised restaurant pattern, where each country's version of fb owns their own AS that loosely forms back into the facebook mothership org, but from infrastructure point of view points to their own set of AS netspace.

aaronax4y ago

Don't AS "things" only affect the IP addresses? Surely you could have DNS records for facebook.com pointing to IP addresses in multiple ASes?

ptd4y ago

One might argue that the fact that you care about things like this is why you don’t run Facebook.

silisili4y ago

Not an antipattern. Route announcements/withdrawals are well, serious things and should likely flow through a single source. The idea of a 'microservice' based route update makes any network person uneasy.

spoonjim4y ago

I mean if you own 3 independent businesses, each of which would be worth over $100 billion, and you break all of them simultaneously for an entire day, including your internal email and your badge entry systems, yes, that is definitionally an anti-pattern.

tshaddox4y ago

Surely “anti-pattern” doesn’t just mean “anything with negative outcomes.” Couldn’t this just be a really big mistake that isn’t indicative of an anti-pattern?

1 more reply

wmf4y ago

One of the usual justifications for acquisitions is to save money using common infrastructure. Instagram and WhatsApp haven't been independent for a while.

geerlingguy4y ago· 12 in thread

Gotta love how painfully vague this is. Sounds like a PR piece for investors, not an engineering blog piece.

johnduhart4y ago

I think you need to re-adjust your expectations, it's not reasonable to have a fully fleshed out RCA blog post available within hours of incident resolution. Most other cloud providers take a few days for theirs.

padolsey4y ago

I mean, not an RCA per se, but info more akin to cloudflare's blog post would be v welcome IMHO: https://blog.cloudflare.com/october-2021-facebook-outage/

1 more reply

tinus_hn4y ago

It’s not reasonable to demand any details at all, it’s nice of them to notify people of what went wrong but it really is none of our business.

1 more reply

ajkjk4y ago

Disagree -- it's here to establish something that a lot of people have been speculating about, which is whether it's hacking-related. It doesn't say much because its purpose is to deliver a single bit of information: { hacking: boolean }

yuliyp4y ago

It's less vague than you realize. It points out that the problem was within Facebook's network between its datacenters. This not only suggests it's related to express backbone, but also suggests that the DNS BGP withdrawal which Cloudflare observed was not the primary issue.

It's not a full root cause analysis, to be sure, and leaves many open questions, but I definitely wouldn't describe it as painfully vague.

EricE4y ago

A point of distinction - there is no "DNS BGP withdrawl".

DNS is related to BGP only that without the right BGP routes in the routers, no packets can get to the facebook networks and thus the facbook DNS servers.

That their DNS servers were taken out was a side affect of the root issue - they withdrew all the routes to their networks from the rest of the Internet.

Not picking on you - but there has been a lot of confusion around DNS that is mostly a red herring and people should just drop it from the conversation. Everything on facebook networks disappeared, not just DNS. The main issue is they effectively took a pair of scissors to every one of their internet connections - d'oh!

2 more replies

paxys4y ago

RCAs take time. It's best to issue vague statements during and right after an incident rather than make guesses.

KronisLV4y ago

> It's best to issue vague statements during and right after an incident rather than make guesses.

Why? Why couldn't you just post that the RCA is still ongoing and that proper updates will follow? Otherwise all you get is meaningless fluff.

1 more reply

rplnt4y ago

Not worth clicking really, everything is in the url.

vmception4y ago

Its important for many stakeholders to understand it wasn’t a hack/exploit or malicious third party or malicious insider

Its much better for some random committee in Congress to debate antitrust forever, instead of bigger committees and agencies debating national security threats

bawolff4y ago

It clearly is a PR piece for investors and customers. And that's ok, not everything is an eng blog.

retSava4y ago

Pointing out that this is published under _engineering_.fb.com.

paxys4y ago· 12 in thread

It was quite ironic that while every Facebook property was offline there was an immense amount of misinformation about the incident perpetuated across the internet (including right here on HN) which everyone just believed as fact.

colechristensen4y ago

I didn’t see any disinformation, just initial reports that it was DNS which were later explained to be caused by BGP.

paxys4y ago

- It was government intervention

- Facebook was hacked

- They did it on purpose to bury the whistleblower story

- No one could access Facebook offices

- They had to cut open servers with angle grinders

- Disgruntled employees changed DNS records

- Lots of made up numbers for how much money Facebook/the rest of the economy was losing (or gaining)

They probably rushed out this blog post just to dispel some of these rumors.

10 more replies

jensensbutton4y ago

Seriously. I also saw lots of posts about how quiet it would be with Facebook down, but I don't think I've ever been exposed to so many stories and so much chatter about Facebook in a single day.

megablast4y ago

Do you think people on Facebook just talk about facebook??

tayo424y ago

I work in a different social media company that has had some visible outages. Its always hilarious to see how wrong people are with their confident speculation. It's a good reminder that people online are often full of shit.

dijit4y ago

I work in video games.

It amazing how wrong people can be and how confident they are about being right.

Even sometimes fighting _me_ about things _I_ designed and built.

It’s quite sobering; taught me not to believe all the speculation I read.

2 more replies

oversighzed4y ago

Lol I had the exact same lesson too. Saw people spewing falsehoods as facts and being misled by others. I’ve started to take everything I see online with a pinch of salt.

DSingularity4y ago

Dang I missed the misinformation.

heisenbit4y ago

The problem is not misinformation per se but the largest social media data processor running algorithms boosting such information for profit.

tomrod4y ago

Like what?

bawolff4y ago

Lots of people blaming dns

2 more replies

runawaybottle4y ago

We almost went down the ‘this is a subterfuge to delete whistleblower evidence’ rabbit hole.

3 more replies

go_prodev4y ago· 7 in thread

I worked with a network engineer who misconfigured a router that was connecting a bank to it's DR site. The engineer had to drive across town to manually patch into the router to fix it.

DR downtime was about an hour, but the bank fired him anyway.

Given that Zuck lost a substantial amount of money, I wonder if the engineer faced any ramifications.

Sidenote: I asked the bank infrastructure team why the DR site was in the same earthquake zone, and they thought I was crazy. They said if there's an earthquake we'll have bigger problems to deal with.

xchaotic4y ago

"I worked with a network engineer who misconfigured a router that was connecting a bank to it's DR site. The engineer had to drive across town to manually patch into the router to fix it. DR downtime was about an hour, but the bank fired him anyway." so prod wasn't down and he fixed it in a hour and they fired the guy who knew how to fix such things so quickly. Idiot manager at the bank.

datavirtue4y ago

Had a DBA once who was playing around with database projects in visual studio and he managed to hose the production database in the course of it. This caused our entire system to go down.

Prostrate, he came before the COO expecting to be canned with much malice. The COO just asked if he learned his lesson and said all is forgiven.

go_prodev4y ago

I agree it was very heavy handed, but I suspect there was more at play (not the first mistake, and some regulatory reporting that may have looked bad for higher ups)

quartz4y ago

Facebook has a very healthy approach to incident response (one of the reasons it's so rare for the site to go down at all despite the enormous traffic and daily code pushes).

Unless there was some kind of nefarious intent, it's very unlikely anyone will be 'punished'. The likely ramifications will be around changes to processes, tests, automations, and fallbacks to 1) prevent the root sequence of events from happening again and 2) make it easier to recover from similar classes of problems in the future.

ethbr04y ago

I've never understood companies that fire individuals when policies were followed and an incident happened. Or, when no policies existed. Or, when policies are routinely bypassed.

Organizational failures require organizational solutions. That seems pretty obvious.

1 more reply

AnIdiotOnTheNet4y ago

Harsh. Unless there is more to the story being fired for a mistake like that is ridiculous. Everyone fucks up occasionally, and on the scale of fuck ups I've certainly done worse than a 1hr DR site outage, as I'm sure pretty much anyone who's ever run infrastructure has. A consistent pattern of fucking up is grounds for termination, but not any one off instance unless there was an extreme level of negligence on display.

> Sidenote: I asked the bank infrastructure team why the DR site was in the same earthquake zone, and they thought I was crazy. They said if there's an earthquake we'll have bigger problems to deal with.

I bring this sort of thing up all the time in disaster planning. There are scales of disaster so big that business continuity is simply not going to be a priority.

drcross4y ago

>DR downtime was about an hour, but the bank fired him anyway

The US, not even once.

The guy should have had "reload in 10", an outage window and config review. There must be more to this story than it being a firable offence for causing a P2 outage for an hour.

sydthrowaway4y ago· 7 in thread

Any FB throwaway know if someone got fired for this?

phreeza4y ago

I think it would be extremely unusual and counterproductive for someone in the trenches to get fired about this, as it is clearly a failure of procedure that this was even possible and so hard to recover from. Large companies I am aware of have a no-blame postmortem culture around this stuff. There may be people suffering consequences at a higher level in the SRE division, though I doubt this will happen in a timeframe of hours after the outage.

1 more reply

yalok4y ago

The Bootcamp training at FB explicitly mentions that such things are not a fire-able offense - the attitude is around learning - if you managed to bring everything down, let’s learn together how you managed to do this… :)

sydthrowaway4y ago

I don't think this is the case. Wasn't TechLead fired for SEV events?

1 more reply

ridaj4y ago

If it's the intern, talk about a learning moment

albert_e4y ago

An intern bringing Facebook.com properties down would be a learning moment for the company and for the world

bpodgursky4y ago

I heard rumors it was triggered by a PR auto-merge bot.

reilly30004y ago

That bot is definitely getting fired.

lionkor4y ago· 6 in thread

So this is pure conspiracy theory, but to me this could be a security issue. What if something deep in the core of your infrastructure is compromised? Everything at risk? Id ask my best engineer, hed suggest to shut it down, and the best way to do that is to literally pull the plug on what makes you public. Tell everyone we accidentally messed up a BGP and thats it.

But yeah, likely not.

colordrops4y ago

Speaking of conspiracies, one that is floating around is that this was done to cover up spread of information around the Pandora Leak.

fragmede4y ago

BGP is public routing information and multiple external sources are able to confirm that aspect of the story. It makes for a good conspiracy theory but the BGP withdrawal is as real as the Moon landing.

throw0101a4y ago

> It makes for a good conspiracy theory but the BGP withdrawal is as real as the Moon landing.

I wasn't aware that Stanley Kubrick was now in NetOps. /s

laurent924y ago

If Facebook had been under actual attack, and defended by taking itself off the internet… that would be the most hands-on approach to security.

can16358p4y ago

Even though it's probably not that, I must admit the fact that I absolutely love reading theories like this.

EricE4y ago

Many have pointed out that a couple of weeks ago Facebook had a paper out on how they had implemented a fancy new automated system to manage their BGP routes.

Whoops! Never attribute to malice that which can more easily be explained by stupidity and all that.

imgabe4y ago· 5 in thread

It just occurred to me to wonder if Facebook has a Twitter account and if they used it to update people about the outage. It turns out they do, and they did, which makes sense. Boy, it must have been galling to have to use a competing communication network to tell people that your network is down.

It looks like Zuckerberg doesn't have a personal Twitter though, nor does Jack Dorsey have a public Facebook page (or they're set not to show up in search).

melvinmt4y ago

> It looks like Zuckerberg doesn't have a personal Twitter though

He does: https://twitter.com/finkd

nostromo4y ago

His LinkedIn photo used to be this really awkward laptop camera photo of roughly this face: (-_-)

It was amazing. I’m sad he remove it.

imgabe4y ago

Ah, I saw that one, but it wasn't verified so I figured it was an imposter. It has only a handful of tweets from 2009 and 1 from 2012, but it could really be him, I suppose.

1 more reply

jell4y ago

They have an official account. https://twitter.com/Facebook/status/1445061804636479493

hint: "some people".

e94y ago

I’m not sure they are competing though. They serve different purposes and co-exist pretty well together.

raverbashing4y ago· 4 in thread

The badge story only shows how people are looking for "efficiency" where it doesn't matter, with predictable results.

The badge system should be local to the building. There are few actual reasons (sure, besides "efficiency") of why badge control should be centralized. Even less reasons for it to be a subdomain of fb. Another option would be to keep the system but make it failsafe (but it seems the newer generation doesn't know what that means). If the network goes down keep it at the last config. Badge validation should be offline first and added/removed ones should be broadcast periodically.

This is the same issue with smartlocks times the number of employees. Do you really want to add another point of failure between yourself and your home?

Sebb7674y ago

Having the badge system work from a single point has a lot of advantages for a company like FB: HR can update info from everywhere (they might not be in the same office), you can immediately deny or block a card everywhere, you have an audit log etc.. They're not having this for fun.

Akso, it's likely not on an fb subdomain, but something like office.security.fb-infra.com (example). It just happens to be that fb-infra.com is using the Facebook DNS server.

robinson-wall4y ago

It's still possible to design a badge system on an independent network (think just switched within a building) which syncs a local copy of the authoritative ldap from the corp domain, so your badge readers stay working if the link to the corp domain goes away.

It's just more expensive and another thing to maintain, and still doesn't account for _all_ failure modes (what if you sync really frequently and a bad change was made deleting all accounts?)

raverbashing4y ago

Sure, this fits under the "few actual reasons" but think for a moment: does it make sense that access to a building is controlled only (keyword here) through a centralized location somewhere? Some DB who knows where? With no fallback?

You might need a break-glass account/badge somewhere. Sure, the angle-grinder works, but probably cost you 2h maybe?

> it's likely not on an fb subdomain, but something like office.security.fb-infra.com

Thanks, yeah, makes sense

bryan_w4y ago

The issue I would imagine is that during this outage some people needed badge access that had previously been revoked due to covid. All of the caching doesn't help if your source of truth is offline.

shahsyed4y ago· 4 in thread

> configuration changes on the backbone routers that coordinate network traffic between our data centers caused issues

This could be anything, potentially.

I'm not very knowledgeable in computer networking, but this could be as trivial as an incorrect update to a DNS record, right?

toast04y ago

Backbone routers don't usually deal with hostnames or DNS. This is pretty much saying they done broke BGP. And it sounds like they're saying that they broke it in a way that prevented accessing their data centers from the PoPs, and we know from the long downtime that it prevented accessing the BGP configuration system from darn near anywhere.

It happened to also kill the announcements for anycast DNS.

dodobirdlord4y ago

There seems to be nothing uncertain about the immediate cause of the issue - Facebook revoked all of their BGP routes, and all of their IP addresses couldn't receive packets until they were restored.

shahsyed4y ago

Understood.

I had question: This is what we can only perceive through internet/routing table entries right?

Internal to FB, we don't know what had caused issues that led to the BGP UPDATE.

That's kind of what has been confusing me - there's a lot of speculation around FB's data center design and what actually happened, but we actually don't know for sure until they post an RCA - please correct me if I'm wrong here.

1 more reply

toast04y ago

The didn't revoke all their routes, FWIW, just a lot of them (including the anycast DNS routes)

1 more reply

stormdennis4y ago· 4 in thread

The mobile whatsapp app should notify that the whatsapp servers are down and not allow you to just send messages that won't arrive for six hours

dTal4y ago

The app is designed under the assumption that Facebook servers are never down. If you can't reach the servers, the problem is assumed to be client-side, in which case they have decided the best UI is to keep retrying (not unreasonably in a mobile context). The only way to disambiguate "no internet service" (extremely common) with "Facebook dropped off the internet" (black-swan rare) is to ping some other, third party service. Unless that third party service has infrastructure as good as Facebook's, it will drown in pings the moment Facebook genuinely goes offline. I can see why Facebook wouldn't want to open that can of worms, if they even envisaged this failure mode (unlikely considering the chaos it caused).

EricE4y ago

>The app is designed under the assumption that Facebook servers are never down.

Which was and is a lame assumption. Stuff happens. SMTP wouldn't even be phased by this; it would just pick up where it left off.

I've seen far too many applications fail in bizzare ways because people make unrealistic assumptions like "X will ALWAYS be there". Sure it's highly unlikely, but when you have multiple things making the same dumb assumptions, on the inevitable day when multiple things that need X and X is suddenly no longer there then you start to get cascading effects of Y that relied on something that relied on X not being there when it is assumed that it would always be there so now Y fails, and then something dependent in the same way on Y unexpectedly fails and so on.

One should never assume that anything will "always" be available. That's an incredibly unrealistic assumption; and the more interconnected things become, the chances of these really nasty dependency chains/cascade failures skyrocket - leading to far worse outages and longer recovery times.

1 more reply

stormdennis4y ago

Good points. Then it should just tell the user that they appear to have no internet service.

herald674y ago

yup, they should. I couldn't send the messages as well and thought my mobile had some issues and tried rebooting it

andrewxdiamond4y ago· 3 in thread

This more or less confirms what we’ve heard, and I appreciate the speed, but it’s incredibly lame from a details point of view.

Will a real postmortem follow? Or is this the best we are gonna get?

badtux4y ago

Having been on the team that issued postmortems before, I can tell you that we said as little as possible in as vague a way as possible while meeting our minimum legal requirements. Actual Facebook customers (i.e. those who pay money to Facebook) will get a slightly more detailed release. But the whole goal is to give as little information as possible while appearing to be open. As an engineer that makes me growl, but that's how it is in this litigous world -- don't want to give someone a reason to sue.

Dylan168074y ago

Sue for what, that they couldn't do with zero information? I don't buy that excuse. (Not that I blame you for the excuse.)

laegooose4y ago

How would you explain that AWS, GCE, Cloudflare, GitLab publish very detailed post-mortems?

4 more replies

advpetc4y ago· 3 in thread

Just out of curiosity, does Facebook have a status page? Like http://status.twitter.com?

paxys4y ago

https://status.fb.com/

Although it only covers their API and business apps, not the site itself.

jcims4y ago

It was also down during the outage.

1 more reply

advpetc4y ago

Looks like it! I remember trying to access this site during the down time and it turns out it’s not accessible at the moment

cheesecake_luvr4y ago· 2 in thread

On a side note: when I browse to that page in Firefox (92.0.1) from HN I can't go back to HN - the back arrow is disabled. What gives?

DevoidSimo4y ago

Do you have the facebook container extension? That closes the current tab, opens a new tab with a container, then goes to the facebook link. Reopening the last closed tab works for me, although I haven't noticed this before since I always open links in a new tab.

cheesecake_luvr4y ago

Tried Edge, it works as expected. Tried turning off Facebook Container, it also works as expected. So you are right good Sir!

Still, a bit unexpected behaviour though.

niko0014y ago· 2 in thread

It would be interesting to estimate what dollar value can be ascribed to a x-hour FB outage, both in terms of lost ad revenue for FB itself as well missed conversions/revenue for businesses running ads on FB/IG.

phtrivier4y ago

Does anyone know if FB's advertisement contracts even have an SLA ?

I can completely picture a world in which many people bought some ads yesterday morning (say, to promote an event that occured yesterday evening), the ads were never displayed to anyone, and FB will keep the money, thank you.

wepple4y ago

Don’t forget WhatsApp users switching to Signal and possibly never returning

Jugurtha4y ago· 2 in thread

The first thing people here thought of was that it was the gouvernement denying access to these websites as it usually does for a number of reasons.

judge20204y ago

It was pretty quickly deemed a global phenomenon, so no comments on posts about it said that. Also, enough people here on HN know how to investigate dns and bgp to have found the problem within the first 30 minutes, first with DNS then the revelation that every BGP route associated with them was withdrawn.

Jugurtha4y ago

I have been unclear. By "here", I meant the country I am in.

1 more reply

reilly30004y ago· 2 in thread

So their actual deployment process is quite rigorous and should have a tight blast radius. After lots of emulated and canary testing, their deployments are phased out over weeks. I don't see how a bad push could have done what happened yesterday.

I found a paper that describes the process in detail. See page 10-11:

https://web.archive.org/web/20211005034928/https://research....

Phase Specification

P1 Small number of RSWs in a random DC

P2 Small number of RSWs (> P1) in another random DC

P3 Small fraction of switches in all tiers in DC serving web traffic

P4 10% of switches across DCs (to account for site differences)

P5 20% of switches across DCs

P6 Global push to all switches

We classify upgrades in two classes: disruptive and non-disruptive, depending on if the upgrade affects existing forwarding state on the switch. Most upgrades in the data center are non-disruptive (performance optimizations, integration with other systems, etc.). To minimize routing instabilities during non-disruptive upgrades, we use BGP graceful restart (GR) [8]. When a switch is being upgraded, GR ensures that its peers do not delete existing routes for a period of time during which the switch’s BGP agent/config is upgraded. The switch then comes up, re-establishes the sessions with its peers and re-advertises routes. Since the upgrade is non-disruptive, the peers’ forwarding state are unchanged.

Without GR, the peers would think the switch is down, and withdraw routes through that switch, only to re-advertise them when the switch comes back up after the upgrade. Disruptive upgrades (e.g., changes in policy affecting existing switch forwarding state) would trigger new advertisements/withdrawals to switches, and BGP re-convergence would occur subsequently. During this period, production traffic could be dropped or take longer paths causing increased latencies. Thus, if the binary or configuration change is disruptive, we drain (§3) and upgrade the device without impacting production traffic. Draining a device entails moving production traffic away from the device and reducing effective capacity in the network. Thus, we pool disruptive changes and upgrade the drained device at once instead of draining the device for each individual upgrade. Push Phases. Our push plan comprises six phases P1-P6 performed sequentially to apply the upgrades to agent/config in production gradually.

We describe the specification of the 6 phases in Table 4. In each phase, the push engine randomly selects a certain number of switches based on the phase’s specification. After selection, the push engine upgrades these switches and restarts BGP on these switches. Our 6 push phases are to progressively increase scope of deployment with the last phase being the global push to all switches. P1-P5 can be construed as extensive testing phases: P1 and P2 modify a small number of rack switches to start the push. P3 is our first major deployment phase to all tiers in the topology.

We choose a single data center which serves web traffic because our web applications have provisions such as load balancing to mitigate failures. Thus, failures in P3 have less impact to our services. To assess if our upgrade is safe in more diverse settings, P4 and P5 upgrade a significant fraction of our switches across different data center regions which serve different kinds of traffic workloads. Even if catastrophic outages occur during P4 or P5, we would still be able to achieve high performance connectivity due to the in-built redundancy in the network topology and our backup path policies—switches running the stable BGP agent/config would re-converge quickly to reduce impact of the outage. Finally, in P6, we upgrade the rest of the switches in all data centers.

Figure 7 shows the timeline of push releases over a 12 month period. We achieved 9 successful pushes of our BGP agent to production. On average, each push takes 2-3 weeks

marcosfelt4y ago

If they have such a rigorous release process, what could have caused all of the dns records to get wiped?

1970-01-014y ago

"Are you sure you want to remove ALL routes to AS32934? Type YES to confirm."

Hey what is our internal BGP called again? AS32934?

"Yeah"

"OOK."

herald674y ago· 2 in thread

Do you think DLT/ blockchain can minimize this from happening again in the future?

jackric4y ago

This is uh, no offense, but.. you are a robot, aren't you?

herald674y ago

nope

dave3334y ago· 2 in thread

I thought DARPA designed the internet to survive nuclear war - no single point of failure - clearly Facebook's network breaks that rule. They need a DNS of last resort that doesn't update fast.

paxys4y ago

Far easier to make a system resilient to bombing than a bad configuration update

Telluur4y ago

On your single point of failure. That might be true, but certainly isn't these days.

Networks have grown so large and complex that the only reasonable way of managing them is through SDN, and a small mistake in configuration might results in a cascading effect on the whole infrastructure.

That's also true for the entire (western) internet. We've ended up with a centralized market where a few key players, e.g. cloud providers/CDNs/DNS (Amazon/Google/Microsoft/Akamai/Fastly/Cloudflare) can easily break large parts of the internet. See Akamai outage in July.

vishesh924y ago· 2 in thread

> We also have no evidence that user data was compromised as a result of this downtime.

I am not sure why they had to mention this specifically. This makes it sound like an external attack.

KZerda4y ago

There were rumors early in the downtime that it was the responsibility of various outside groups. Saying, "no, your data was not impacted" is pretty standard in light of those rumors, even if they weren't the main ones spreading around after the initial reports.

tsimionescu4y ago

It doesn't make it sound like an attack, it's standard boilerplate to dispel any worries. It's natural for anyone to wonder downtime -> data loss? , so it's natural to reassure people that it wasn't the case.

gannon-4y ago· 1 in thread

This is a funny post to have suggested at the bottom of the article: https://engineering.fb.com/2021/08/09/connectivity/backbone-...

itronitron4y ago

Looks like the 'Failure Generator' was brought online.

eyelidlessness4y ago· 1 in thread

One of the things they restored was annoying sounds in the app every time I tap anything. Who knew that was DNS related!

eyelidlessness4y ago

I’ll take my downvotes but I’d be happy for anyone to explain why.

coliveira4y ago· 1 in thread

The best course of action is to split FB into separate companies. It is already neatly divided between instagram, WU and legacy facebook. That would be the best for the government to avoid disruptions.

colechristensen4y ago

Of all the reasons to break up big companies, protecting consumers from Instagram downtime is not one of them.

stephenhuey4y ago

Even though the angle grinder story wasn’t accurate, it’d still be interesting to know what percentage of the time to fix the outage was spent on regaining physical access:

https://mobile.twitter.com/mikeisaac/status/1445196576956162...

runawaybottle4y ago

It was interesting to visit the subreddits of random countries (eg /r/Mongolia) and see the top posts all asking if fb/Insta/WhatsApp being down was local or global. I got the impression this morning that it was only affecting NA and Europe, but it looks like it was totally global. The numbers must be staggering of the number of people trying to login.

supermatt4y ago

Sounds like they could do with some updates to their risk-driven backbone management strategy!

https://engineering.fb.com/2021/08/09/connectivity/backbone-...

crtasm4y ago

"To all the people and businesses around the world who depend on us, " ... yesterday was another example of why you shouldn't depend on us to such an extent.

metissec984y ago

Well that doesn't say a whole lot... I know it is early but they could use a little more detail. Even if it is just a timeline.

dev_tty014y ago

>We also have no evidence that user data was compromised as a result of this downtime.

No, that just happens during uptime.

dugo4y ago

Around the turn of the century, in a network the size of Europe, we had OOB comms to the core routers via ISDN/POTS. We experimented with mobile phones in the racks as well, much to the chagrin of the old telco guys running the PoPs.

dr_hooo4y ago

Why is this non-post on the frontage? It's PR only

wyldfire4y ago

Move fast and

NO CARRIER

r00tanon4y ago

"Post hoc ergo propter hoc"

r00tanon4y ago

Remember, remember, the 4th of October.

r00tanon4y ago

Yes. It is true. If you enter Facebook into Facebook. It will break the internet.

Elyes-ghorbel4y ago

Could you please be more clear about ''no evidence that user data was compromised''

trthomps4y ago

Reading this statement all I can think of is this scene https://www.youtube.com/watch?v=15HTd4Um1m4

1970-01-014y ago

TL;DR

We YOLO'd our BGP experiment to prod. It failed.

https://web.archive.org/web/20210626191032/https://engineeri...

andy-x4y ago

Such a BS. FB imagining that they are their own Internet but failing in a most miserable way because they need actual Internet to communicate.

rvz4y ago

It has been painfully admitted by the Facebook mafia that they know that they are the internet and farming the data of an entire civilisation; further evidence that this deep integration of their services needs to be broken up.

After all the scandals, leaks, whistleblowers etc it would take more than a DNS record wipe to take down the Facebook mafia.

j / k navigate · click thread line to collapse

224 comments

139 comments · 39 top-level

0xy4y ago· 16 in thread

By having a large centralized and monolithic system, aren't they guaranteeing that mistakes cause huge splash damage and don't separate concerns?

manquer4y ago

BGP has to converge to a single routing table.

You are effectively asking is why is there a single routing table for the internet.

To put in simple terms having a single routing table is what it makes it the internet we can share, otherwise it would just be a bunch of independent networks.

toast04y ago

> BGP has to converge to a single routing table.

1 more reply

Dylan168074y ago

> You are effectively asking is why is there a single routing table for the internet.

1 more reply

colechristensen4y ago

andrewxdiamond4y ago

You are correct, but this problem isn’t Facebook’s doing. This is just how BGP works. Even big players like Verizon can screw it up and break the Internet.

For all it’s flaws, BGP is the piece of the Internet that truly makes it decentralized. Without it, there would be a centralized routing table of some sort.

tw044y ago

I recall major ISP's screwing up their routing tables in the past but never globally on this level.

kanbara4y ago

https://www.techjuice.pk/country-blocked-youtube-globally/

https://www.bleepingcomputer.com/news/security/major-bgp-lea...

toast04y ago

https://www.itproportal.com/news/misconfigured-centurylink-d...

https://www.bleepingcomputer.com/news/technology/ibm-cloud-g... (this one isn't clear, maybe BGP hijacking, and if so, not sure who the responsible party was)

https://www.catchpoint.com/blog/vodafone-idea-bgp-leak (not sure how major this one was)

BGP is one of the best ways to break things at scale.

mftb4y ago

barkingcat4y ago

If you want 1 facebook.com entity (that in turn controls instagram, whatsapp,etc) , then I get why a single AS change would take out globally.

aaronax4y ago

Don't AS "things" only affect the IP addresses? Surely you could have DNS records for facebook.com pointing to IP addresses in multiple ASes?

ptd4y ago

One might argue that the fact that you care about things like this is why you don’t run Facebook.

silisili4y ago

spoonjim4y ago

tshaddox4y ago

Surely “anti-pattern” doesn’t just mean “anything with negative outcomes.” Couldn’t this just be a really big mistake that isn’t indicative of an anti-pattern?

1 more reply

wmf4y ago

One of the usual justifications for acquisitions is to save money using common infrastructure. Instagram and WhatsApp haven't been independent for a while.

geerlingguy4y ago· 12 in thread

Gotta love how painfully vague this is. Sounds like a PR piece for investors, not an engineering blog piece.

johnduhart4y ago

padolsey4y ago

I mean, not an RCA per se, but info more akin to cloudflare's blog post would be v welcome IMHO: https://blog.cloudflare.com/october-2021-facebook-outage/

1 more reply

tinus_hn4y ago

It’s not reasonable to demand any details at all, it’s nice of them to notify people of what went wrong but it really is none of our business.

1 more reply

ajkjk4y ago

yuliyp4y ago

It's not a full root cause analysis, to be sure, and leaves many open questions, but I definitely wouldn't describe it as painfully vague.

EricE4y ago

A point of distinction - there is no "DNS BGP withdrawl".

DNS is related to BGP only that without the right BGP routes in the routers, no packets can get to the facebook networks and thus the facbook DNS servers.

That their DNS servers were taken out was a side affect of the root issue - they withdrew all the routes to their networks from the rest of the Internet.

2 more replies

paxys4y ago

RCAs take time. It's best to issue vague statements during and right after an incident rather than make guesses.

KronisLV4y ago

> It's best to issue vague statements during and right after an incident rather than make guesses.

Why? Why couldn't you just post that the RCA is still ongoing and that proper updates will follow? Otherwise all you get is meaningless fluff.

1 more reply

rplnt4y ago

Not worth clicking really, everything is in the url.

vmception4y ago

Its important for many stakeholders to understand it wasn’t a hack/exploit or malicious third party or malicious insider

Its much better for some random committee in Congress to debate antitrust forever, instead of bigger committees and agencies debating national security threats

bawolff4y ago

It clearly is a PR piece for investors and customers. And that's ok, not everything is an eng blog.

retSava4y ago

Pointing out that this is published under _engineering_.fb.com.

paxys4y ago· 12 in thread

colechristensen4y ago

I didn’t see any disinformation, just initial reports that it was DNS which were later explained to be caused by BGP.

paxys4y ago

- It was government intervention

- Facebook was hacked

- They did it on purpose to bury the whistleblower story

- No one could access Facebook offices

- They had to cut open servers with angle grinders

- Disgruntled employees changed DNS records

- Lots of made up numbers for how much money Facebook/the rest of the economy was losing (or gaining)

They probably rushed out this blog post just to dispel some of these rumors.

10 more replies

jensensbutton4y ago

Seriously. I also saw lots of posts about how quiet it would be with Facebook down, but I don't think I've ever been exposed to so many stories and so much chatter about Facebook in a single day.

megablast4y ago

Do you think people on Facebook just talk about facebook??

tayo424y ago

dijit4y ago

I work in video games.

It amazing how wrong people can be and how confident they are about being right.

Even sometimes fighting _me_ about things _I_ designed and built.

It’s quite sobering; taught me not to believe all the speculation I read.

2 more replies

oversighzed4y ago

Lol I had the exact same lesson too. Saw people spewing falsehoods as facts and being misled by others. I’ve started to take everything I see online with a pinch of salt.

DSingularity4y ago

Dang I missed the misinformation.

heisenbit4y ago

The problem is not misinformation per se but the largest social media data processor running algorithms boosting such information for profit.

tomrod4y ago

Like what?

bawolff4y ago

Lots of people blaming dns

2 more replies

runawaybottle4y ago

We almost went down the ‘this is a subterfuge to delete whistleblower evidence’ rabbit hole.

3 more replies

go_prodev4y ago· 7 in thread

I worked with a network engineer who misconfigured a router that was connecting a bank to it's DR site. The engineer had to drive across town to manually patch into the router to fix it.

DR downtime was about an hour, but the bank fired him anyway.

Given that Zuck lost a substantial amount of money, I wonder if the engineer faced any ramifications.

xchaotic4y ago

datavirtue4y ago

Had a DBA once who was playing around with database projects in visual studio and he managed to hose the production database in the course of it. This caused our entire system to go down.

Prostrate, he came before the COO expecting to be canned with much malice. The COO just asked if he learned his lesson and said all is forgiven.

go_prodev4y ago

I agree it was very heavy handed, but I suspect there was more at play (not the first mistake, and some regulatory reporting that may have looked bad for higher ups)

quartz4y ago

Facebook has a very healthy approach to incident response (one of the reasons it's so rare for the site to go down at all despite the enormous traffic and daily code pushes).

ethbr04y ago

I've never understood companies that fire individuals when policies were followed and an incident happened. Or, when no policies existed. Or, when policies are routinely bypassed.

Organizational failures require organizational solutions. That seems pretty obvious.

1 more reply

AnIdiotOnTheNet4y ago

I bring this sort of thing up all the time in disaster planning. There are scales of disaster so big that business continuity is simply not going to be a priority.

drcross4y ago

>DR downtime was about an hour, but the bank fired him anyway

The US, not even once.

The guy should have had "reload in 10", an outage window and config review. There must be more to this story than it being a firable offence for causing a P2 outage for an hour.

sydthrowaway4y ago· 7 in thread

Any FB throwaway know if someone got fired for this?

phreeza4y ago

1 more reply

yalok4y ago

sydthrowaway4y ago

I don't think this is the case. Wasn't TechLead fired for SEV events?

1 more reply

ridaj4y ago

If it's the intern, talk about a learning moment

albert_e4y ago

An intern bringing Facebook.com properties down would be a learning moment for the company and for the world

bpodgursky4y ago

I heard rumors it was triggered by a PR auto-merge bot.

reilly30004y ago

That bot is definitely getting fired.

lionkor4y ago· 6 in thread

But yeah, likely not.

colordrops4y ago

Speaking of conspiracies, one that is floating around is that this was done to cover up spread of information around the Pandora Leak.

fragmede4y ago

throw0101a4y ago

> It makes for a good conspiracy theory but the BGP withdrawal is as real as the Moon landing.

I wasn't aware that Stanley Kubrick was now in NetOps. /s

laurent924y ago

If Facebook had been under actual attack, and defended by taking itself off the internet… that would be the most hands-on approach to security.

can16358p4y ago

Even though it's probably not that, I must admit the fact that I absolutely love reading theories like this.

EricE4y ago

Many have pointed out that a couple of weeks ago Facebook had a paper out on how they had implemented a fancy new automated system to manage their BGP routes.

Whoops! Never attribute to malice that which can more easily be explained by stupidity and all that.

imgabe4y ago· 5 in thread

It looks like Zuckerberg doesn't have a personal Twitter though, nor does Jack Dorsey have a public Facebook page (or they're set not to show up in search).

melvinmt4y ago

> It looks like Zuckerberg doesn't have a personal Twitter though

He does: https://twitter.com/finkd

nostromo4y ago

His LinkedIn photo used to be this really awkward laptop camera photo of roughly this face: (-_-)

It was amazing. I’m sad he remove it.

imgabe4y ago

Ah, I saw that one, but it wasn't verified so I figured it was an imposter. It has only a handful of tweets from 2009 and 1 from 2012, but it could really be him, I suppose.

1 more reply

jell4y ago

They have an official account. https://twitter.com/Facebook/status/1445061804636479493

hint: "some people".

e94y ago

I’m not sure they are competing though. They serve different purposes and co-exist pretty well together.

raverbashing4y ago· 4 in thread

The badge story only shows how people are looking for "efficiency" where it doesn't matter, with predictable results.

This is the same issue with smartlocks times the number of employees. Do you really want to add another point of failure between yourself and your home?

Sebb7674y ago

Akso, it's likely not on an fb subdomain, but something like office.security.fb-infra.com (example). It just happens to be that fb-infra.com is using the Facebook DNS server.

robinson-wall4y ago

It's just more expensive and another thing to maintain, and still doesn't account for _all_ failure modes (what if you sync really frequently and a bad change was made deleting all accounts?)

raverbashing4y ago

You might need a break-glass account/badge somewhere. Sure, the angle-grinder works, but probably cost you 2h maybe?

> it's likely not on an fb subdomain, but something like office.security.fb-infra.com

Thanks, yeah, makes sense

bryan_w4y ago

The issue I would imagine is that during this outage some people needed badge access that had previously been revoked due to covid. All of the caching doesn't help if your source of truth is offline.

shahsyed4y ago· 4 in thread

> configuration changes on the backbone routers that coordinate network traffic between our data centers caused issues

This could be anything, potentially.

I'm not very knowledgeable in computer networking, but this could be as trivial as an incorrect update to a DNS record, right?

toast04y ago

It happened to also kill the announcements for anycast DNS.

dodobirdlord4y ago

There seems to be nothing uncertain about the immediate cause of the issue - Facebook revoked all of their BGP routes, and all of their IP addresses couldn't receive packets until they were restored.

shahsyed4y ago

Understood.

I had question: This is what we can only perceive through internet/routing table entries right?

Internal to FB, we don't know what had caused issues that led to the BGP UPDATE.

1 more reply

toast04y ago

The didn't revoke all their routes, FWIW, just a lot of them (including the anycast DNS routes)

1 more reply

stormdennis4y ago· 4 in thread

The mobile whatsapp app should notify that the whatsapp servers are down and not allow you to just send messages that won't arrive for six hours

dTal4y ago

EricE4y ago

>The app is designed under the assumption that Facebook servers are never down.

Which was and is a lame assumption. Stuff happens. SMTP wouldn't even be phased by this; it would just pick up where it left off.

1 more reply

stormdennis4y ago

Good points. Then it should just tell the user that they appear to have no internet service.

herald674y ago

yup, they should. I couldn't send the messages as well and thought my mobile had some issues and tried rebooting it

andrewxdiamond4y ago· 3 in thread

This more or less confirms what we’ve heard, and I appreciate the speed, but it’s incredibly lame from a details point of view.

Will a real postmortem follow? Or is this the best we are gonna get?

badtux4y ago

Dylan168074y ago

Sue for what, that they couldn't do with zero information? I don't buy that excuse. (Not that I blame you for the excuse.)

laegooose4y ago

How would you explain that AWS, GCE, Cloudflare, GitLab publish very detailed post-mortems?

4 more replies

advpetc4y ago· 3 in thread

Just out of curiosity, does Facebook have a status page? Like http://status.twitter.com?

paxys4y ago

https://status.fb.com/

Although it only covers their API and business apps, not the site itself.

jcims4y ago

It was also down during the outage.

1 more reply

advpetc4y ago

Looks like it! I remember trying to access this site during the down time and it turns out it’s not accessible at the moment

cheesecake_luvr4y ago· 2 in thread

On a side note: when I browse to that page in Firefox (92.0.1) from HN I can't go back to HN - the back arrow is disabled. What gives?

DevoidSimo4y ago

cheesecake_luvr4y ago

Tried Edge, it works as expected. Tried turning off Facebook Container, it also works as expected. So you are right good Sir!

Still, a bit unexpected behaviour though.

niko0014y ago· 2 in thread

phtrivier4y ago

Does anyone know if FB's advertisement contracts even have an SLA ?

wepple4y ago

Don’t forget WhatsApp users switching to Signal and possibly never returning

Jugurtha4y ago· 2 in thread

The first thing people here thought of was that it was the gouvernement denying access to these websites as it usually does for a number of reasons.

judge20204y ago

Jugurtha4y ago

I have been unclear. By "here", I meant the country I am in.

1 more reply

reilly30004y ago· 2 in thread

I found a paper that describes the process in detail. See page 10-11:

https://web.archive.org/web/20211005034928/https://research....

Phase Specification

P1 Small number of RSWs in a random DC

P2 Small number of RSWs (> P1) in another random DC

P3 Small fraction of switches in all tiers in DC serving web traffic

P4 10% of switches across DCs (to account for site differences)

P5 20% of switches across DCs

P6 Global push to all switches

Figure 7 shows the timeline of push releases over a 12 month period. We achieved 9 successful pushes of our BGP agent to production. On average, each push takes 2-3 weeks

marcosfelt4y ago

If they have such a rigorous release process, what could have caused all of the dns records to get wiped?

1970-01-014y ago

"Are you sure you want to remove ALL routes to AS32934? Type YES to confirm."

Hey what is our internal BGP called again? AS32934?

"Yeah"

"OOK."

herald674y ago· 2 in thread

Do you think DLT/ blockchain can minimize this from happening again in the future?

jackric4y ago

This is uh, no offense, but.. you are a robot, aren't you?

herald674y ago

nope

dave3334y ago· 2 in thread

I thought DARPA designed the internet to survive nuclear war - no single point of failure - clearly Facebook's network breaks that rule. They need a DNS of last resort that doesn't update fast.

paxys4y ago

Far easier to make a system resilient to bombing than a bad configuration update

Telluur4y ago

On your single point of failure. That might be true, but certainly isn't these days.

vishesh924y ago· 2 in thread

> We also have no evidence that user data was compromised as a result of this downtime.

I am not sure why they had to mention this specifically. This makes it sound like an external attack.

KZerda4y ago

tsimionescu4y ago

gannon-4y ago· 1 in thread

This is a funny post to have suggested at the bottom of the article: https://engineering.fb.com/2021/08/09/connectivity/backbone-...

itronitron4y ago

Looks like the 'Failure Generator' was brought online.

eyelidlessness4y ago· 1 in thread

One of the things they restored was annoying sounds in the app every time I tap anything. Who knew that was DNS related!

eyelidlessness4y ago

I’ll take my downvotes but I’d be happy for anyone to explain why.

coliveira4y ago· 1 in thread

colechristensen4y ago

Of all the reasons to break up big companies, protecting consumers from Instagram downtime is not one of them.

stephenhuey4y ago

Even though the angle grinder story wasn’t accurate, it’d still be interesting to know what percentage of the time to fix the outage was spent on regaining physical access:

https://mobile.twitter.com/mikeisaac/status/1445196576956162...

runawaybottle4y ago

supermatt4y ago

Sounds like they could do with some updates to their risk-driven backbone management strategy!

https://engineering.fb.com/2021/08/09/connectivity/backbone-...

crtasm4y ago

"To all the people and businesses around the world who depend on us, " ... yesterday was another example of why you shouldn't depend on us to such an extent.

metissec984y ago

Well that doesn't say a whole lot... I know it is early but they could use a little more detail. Even if it is just a timeline.

dev_tty014y ago

>We also have no evidence that user data was compromised as a result of this downtime.

No, that just happens during uptime.

dugo4y ago

dr_hooo4y ago

Why is this non-post on the frontage? It's PR only

wyldfire4y ago

Move fast and

NO CARRIER

r00tanon4y ago

"Post hoc ergo propter hoc"

r00tanon4y ago

Remember, remember, the 4th of October.

r00tanon4y ago

Yes. It is true. If you enter Facebook into Facebook. It will break the internet.

Elyes-ghorbel4y ago

Could you please be more clear about ''no evidence that user data was compromised''

trthomps4y ago

Reading this statement all I can think of is this scene https://www.youtube.com/watch?v=15HTd4Um1m4

1970-01-014y ago

TL;DR

We YOLO'd our BGP experiment to prod. It failed.

https://web.archive.org/web/20210626191032/https://engineeri...

andy-x4y ago

Such a BS. FB imagining that they are their own Internet but failing in a most miserable way because they need actual Internet to communicate.

rvz4y ago

After all the scandals, leaks, whistleblowers etc it would take more than a DNS record wipe to take down the Facebook mafia.

j / k navigate · click thread line to collapse