Surely the notion of who owns an AS should be cryptographically held so that an update has to be signed. Updates should be infrequent so the cost is felt on the control plane, not on the data plane.
I'm sure there's a BGPSec or whatever like all the other ${oldTech}Sec but I don't know if there is a realistic solution here or if it's IPv6 style tech.
0: I looked it up before posting and it's 3000 leakers with 12 million leaks per quarter https://blog.qrator.net/en/q3-2022-ddos-attacks-and-bgp-inci...
Locally, BGP is peer-to-peer — literally! — and no particular peer is forced to check everything, and nobody's even trying to make a single global routing table so local agreements can override anything at a higher level.
A wholesale protocol replacement is unlikely, but definitely more doable than replacing something like IP.
I'll bet JGC can write his own ticket by now, but unretiring would be really bad optics. He's on the board though and still keeping a watchful eye. But a couple more of these and CFs reputation will be in the gutter.
And instead on focusing on maintaining those, they decided to go for more money, first adding new features on their products (at the risk of breaking them) and then adding new products altogether in a move to start being an actual cloud provider.
Priorities shifted from the quality products to pushing features daily, and the person who built and maintained the good products probably left or have been assigned to shinier products, leaving the base to decay.
As a daily user, its quite frustrating to have a console that is getting far worse than AWS/Azure, and features that are more a POC than actual production-ready features.
Your legacy is one of showing how to apply good engineering principles to complex problems at scale and I think CF is risking that reputation right now.
I feel like something such as a route leak should not be something that happens to Cloudflare. I’m surprised they set their systems up to allow this human error.
One thing to their credit though: BGP is full of complexity and it definitely isn't the first time that something like this goes wrong, it is just that at CF scale the impact is massive so there is no room for fuckups. But doing this sort of thing right 100% of the time is a really hard problem, and I'm happy I'm not in any way responsible for systems this important.
Whoever is responsible learned a lot of valuable lessons today (you hope).
The focus has been on new features and moving fast for quite some years vs reliability.
They don't, because at the end of the day it's not their problem, the money rolls in regardless.
It's sad, but it's how it is. If they cared, these things wouldn't happen. They have a lot of responsibility, but show none whatsoever.
In this case, the timeline states "IMPACT STOP" was at 20:50 UTC and the first post to their status page was 12 minutes later at 21:02 UTC:
"Cloudflare experienced a Network Route leak, impacting performance for some networks beginning 20:25 UTC. We are working to mitigate impact."
Is there any way to test these changes against a simulation of real world routes? Including to ensure that traffic that shouldn’t hit Cloudflare servers, continues to resolve routes that don’t hit Cloudflare?
I have to imagine there’s academic research on how to simulate a fork of global BGP state, no? Surely there’s a tensor representation of the BGP graph that can be simulated on GPU clusters?
If there’s a meta-rule I think of when these incidents occur, it’s that configuration rules need change management, and change management is only as good as the level of automated testing. Just because code hasn’t changed doesn’t mean you shouldn’t test the baseline system behavior. And here, that means testing that the Internet works.
You can get access to view of routes from different parts of networks but you do not have access to those routers policies, so no
> I have to imagine there’s academic research on how to simulate a fork of global BGP state, no? Surely there’s a tensor representation of the BGP graph that can be simulated on GPU clusters?
Just simulating your peers and maybe layer after is most likely good enough. And you can probably do it with a bunch of cgroups and some actual routing software. There are also network sims like GNS3 that can even just run router images
Set a simulation router to have the same state but a new config, and compute the routing table and what routes would he advertised to peers.
Confirm the diff in routing table and advertised routes is reasonable.
This change seemed to mostly be about a single location. Other BGP config changes leading to problems are often global changes, but you can check diffs and apply the config change one host at a time. You can't really make a simultaneous change anyway. Maybe one host changing is ok, but the Nth one causes a problem... CF has a lot of BGP routers, so maybe checking every diff is too much, but at least check a few.
Is that something out of the box on routers? I don't know, people with BGP routers never let me play with them. But given the BGP haiku, I'd want something like that before I messed around with things. For the price you pay for these fancy routers, you should be able to buy an extra few to run sandboxed config testing on. You could also simulate with open source bgp software, but the proprietary BGP daemon on the router might not act like the open source one does.
This is decentralization in action. You have to take the good with the bad.
(disclaimer: shitpost. my shitpost.)
Flapping is bad in the networking world.
Flapping BGP routes, specifically, is bad because it can stress all BGP routers involved to the point where they can “go crazy”. Routes are explicitly advertised, so if you keep changing the routes, you are tasking the router CPU to process new stuff, discard it and process new stuff. In fact, BGP route flaps are specifically the focus of an entire RFC: https://datatracker.ietf.org/doc/html/rfc2439
More in general, a flapping link (on/off/on/off) can really mess with TCP.
Flapping in the networking world is not something you want to do intentionally.
And obviously you don’t do this on every individual route change - you batch them so it’s a release train.
If you think there’s better techniques other than “don’t break things” I’m all for it.
Your "1-minute flap" can propagate and trigger load on every single DFZ BGP router on the planet. That's not cheap.
And 1 minute is too short to even propagate across carriers. There are all kinds of timers working to reduce previous point; your update can still be propagating half an hour later. It can also change state for when you do it for real. And worst of all, BGP routes can get stuck. It's rare, but a real problem.
And stuck routes are a problem but not one this would make worse since those routes would get stuck from normal changes anyway.
The propagation problem isn’t real because clearly most route advertisements that handle most of the traffic actually happen quickly. You shouldn’t care about the long tail - you want to minimize the risk of your new route. The old route being present isn’t a problem and the new route disappearing back to the old also shouldn’t be a problem UNLESS the new route was buggy in which case you wanted to rollback anyway.
TLDR: these don’t feel like risks unique to advertising and then undoing it given the route publishing already has to be handled anyway AND cloudflare is a major Tier 1 ISP and handles a good chunk of the entire internet’s traffic. This isn’t about a strategy for some random tier 2/3 ISP.
Basically, my understanding (simplified) is:
- they originally had a Miami router advertise Bogota prefixes (=subnets) to Cloudflare's peers. Essentially, Miami was handling Bogota's subnets. This is not an issue.
- because you don't normally advertise arbitrary prefixes via BGP, policies were used. These policies are essentially if/then statements, carrying out certain actions (advertise or not, add some tags or remove them,...) if some conditions are matched. This is completely normal.
- Juniper router configuration for this kind of policy is (simplifying):
set <BGP POLICY NAME> from <CONDITION1>
set <BGP POLICY NAME> from <CONDITION2>
set <BGP POLICY NAME> then <ACTION1>
set <BGP POLICY NAME> then <ACTION2>
...
- prior to the incident, CF changed its network so that Miami didn't have to handle Bogota subnets (maybe Bogota does it on its own, maybe there's another router somewhere else)
- the change aimed at removing the configurations on Miami which were advertising Bogota subnets
- the change implementation essentially removed all lines from all policies containing "from IP in the list of Bogota prefixes". This is somewhat reasonable, because you could have the same policy handling both Bogota and, say, Quito prefixes, so you just want to remove the Bogota part.
HOWEVER, there was at least one policy like this:
(Before)
set <BGP POLICY NAME> from is_internal(prefix) == True
set <BGP POLICY NAME> from prefix in bogota_prefix_list
set <BGP POLICY NAME> then advertise
(After)
set <BGP POLICY NAME> from is_internal(prefix) == True
set <BGP POLICY NAME> then advertise
Which basically means: if you have an internal prefix advertise it
- an "internal prefix" is any prefix that was not received by another BGP entity (autonomous system)
- BGP routers in Cloudflare exchange routes to one another. This is again pretty normal.
- As a result of this change, all routes received by Miami through some other Cloudflare router were readvertised by Miami
- the result is CF telling the Internet (more accurately, its peers) "hey, you know that subnet? Go ask my Miami router!"
- obviously, this increases bandwidth utilization and latency for traffic crossing the Miami router.
This didn’t catch the fact that removing that line essentially removed all conditions, allowing received routes to be re-advertised by the Miami router.
Communities are useful in this case, but this kind of thing could have happened with any kind of configuration.
Example:
(Before)
set firewall family inet filter FILTER NAME term TERM1 from source-address 10.10.10.1
set firewall family inet filter FILTER NAME term TERM1 from destination-port ssh
set firewall family inet filter FILTER NAME term TERM1 then discard
What happens when you remove references to 10.10.10.1, maybe because that IP is not blacklisted anymore? You’re simply removing one condition, leaving all ssh traffic to be discarded. That’s essentially what happened with the BGP outage, only here you have no BGP communities to save you.
That’s why I re-read the RCA, because this kind of incident is way more general than BGP-specific misconfigurations.
Why even bother to write an article about it then haha