Traditional monitoring systems like Nagios and Icinga have settings where they only open events/alerts if a check failed three times in a row, because spurious failed checks are quite common.
If you spam your operators with lots of alerts for monitoring checks that fix themselves, you stress the unnecessarily and create alert blindness, because the first reaction will be "let's wait if it fixes itself".
I've never operated a service with as much exposure as CF's DNS service, but I'm not really surprised that it took 8 minutes to get a reliable detection.
Were it much less than 1.1.1.1 itself, taking longer than a minute to alarm probably wouldn’t surprise me, but this is 1.1.1.1, they’re dealing with vasts amounts of probably fairly consistent traffic.
Let's say you've got a metric aggregation service, and that service crashes.
What does that result in? Metrics get delayed until your orchestration system redeploys that service elsewhere, which looks like a 100% drop in metrics.
Most orchestration take a sec to redeploy in this case, assuming that it could be a temporary outage of the node (like a network blip of some sort).
Sooo, if you alert after just a minute, you end up with people getting woken up at 2am for nothing.
What happens if you keep waking up people at 2am for something that auto-resolves in 5 minutes? People quit, or eventually adjust the alert to 5 minutes.
I know you often can differentiate no data and real drops, but the overall point, of "if you page people constantly, people will quit" I think is the important one. If people keep getting paged for too tight alarms, the alarms can and should be loosened... and that's one way you end up at 5 minutes.
When you are building systems like 1.1.1.1 having an alert rollup of five minutes is not acceptable as it will hide legitimate downtime that lasts between 0 and 5 minutes.
You need to design systems which do not rely on orchestration to remediate short transient errors.
Disclosure: I work on a core SRE team for a company with over 500 million users.
Before you fire a quick alarm, check that the node is up, check that the service is up etc.
They have a rather significant vested interest in it being reliable.
Don't you normally have 2 DnS servers listed on any device. So was the second also down, if not why didn't it go to that.
Unfortunately, the configuration mistake that caused this outage disabled Cloudflare's BGP advertisements of both 1.1.1.0/24 and 1.0.0.0/24 prefixes to its peers.
Btw, I really don't understand why it does not accept an IP (1.1.1.1), so you have to give an address (one.one.one.one). It would be more sensible to configure a DNS server from an IP rather than from an address to be resolved by a DNS server :/
Normal DNS can normally be changed in your connection settings for a given connection on most flavours of Android.
But I understand why Cloudflare can’t just say “use 8.8.8.8 as your backup”.
If your device doesn't support proper failover use a local DNS forwarder on your router or an external one.
In Switzerland I would use Init7 (isp that doesn't filter) -> quad9 (unfiltered Version) -> eu dns0 (unfiltered Version)
I get that in theory blah blah, but we now have choices in who gets to see all of our requests and the ISP will always lose out to the other losers in the list
Note how root "." just works and has done for decades - that's proper engineering and actually way more complicated than running 1.1.1.1. What 1.1.1.1 suffers from is anycast and not DNS.
Cloudflare (and Google and co) insist on using one or more "vanity" IP addresses - that is very unfair of me but that it what it is, and to make it work, they have to use anycast.
The real issue is fixing anycast and not DNS.
Anyway, select two+ providers and set them.
Unless you do something fancy with a local caching dns proxy with more than one upstream.
I would count not configuring at least two as 'user error'. Many systems require you to enter a primary and alternate server in order to save a configuration.
Not sure how cloudflare keeps struggling with issues like these, this isn't the first (and probably won't be the last) time they have these 'simple', 'deprecated', 'legacy' issues occuring.
8.8.8.8+8.8.4.4 hasn't had a global(1) second of downtime for almost a decade.
1: localized issues did exist, but that's really the fault of the internet and they did remain running when google itself suffered severe downtime in various different services.
European users might prefer one of the alternatives listed at https://european-alternatives.eu/category/public-dns over US corporations subject to the CLOUD act.
Cloudflare has a reasonable culture around incident response, but it doesn't incentivize proactive prevention.
From the longer term graphs it looks like volume returned to normal https://imgur.com/a/8a1H8eL
> It’s worth noting that DoH (DNS-over-HTTPS) traffic remained relatively stable as most DoH users use the domain cloudflare-dns.com, configured manually or through their browser, to access the public DNS resolver, rather than by IP address.
Interesting, I was affected by this yesterday. My router (supposedly) had Cloudflare DoH enabled but nothing would resolve. Changing the DNS server to 8.8.8.8 fixed the issues.
It’s corporate newspeak. “legacy” isn’t a clear term, it’s used to abstract and obfuscate.
> Legacy components do not leverage a gradual, staged deployment methodology. Cloudflare will deprecate these systems which enables modern progressive and health mediated deployment processes to provide earlier indication in a staged manner and rollback accordingly.
I know what this means, but there’s absolutely no reason for it to be written in this inscrutable corporatese.
I will not say whether or not it’s acceptable for a company of their size and maturity, but it’s definitely not hidden in corporate lingo.
I do believe they could have elaborate more on the follow up steps they will take to prevent this from happening again, I don’t think staggered roll outs are the only answer to this, they’re just a safety net.
Note that this introduces one query overhead per DNS request if the previous cache has expired. For this reason, I've been using https://1.1.1.1/dns-query instead.
In theory, this should eliminate that overhead. Your operating system can validate the IP address of the DNS response by using the Subject Alternative Name (SAN) field within the CA certificate presented by the DoH server: https://g.co/gemini/share/40af4514cb6e
TLDR; DoH was working
I guess now we should start using a completely different provider as dns backup Maybe 8.8.8.8 or 9.9.9.9
That said, it's a good idea to specifically pick multiple resolvers in different regions, on different backbones, using different providers, and not use an Anycast address, because Anycast can get a little weird. However, this can lead to hard-to-troubleshoot issues, because DNS doesn't always behave the way you expect.
all-servers
server=8.8.8.8
server=9.9.9.9
server=1.1.1.1If you were using systemd-resolved however, it retries all servers in the order they were specified, so it's important to interleave upstreams.
Using the servers in the above example, and assuming IPv4 + IPv6:
1.1.1.1
2001:4860:4860::8888
9.9.9.9
2606:4700:4700::1111
8.8.8.8
2620:fe::fe
1.0.0.1
2001:4860:4860::8844
149.112.112.112
2606:4700:4700::1001
8.8.4.4
2620:fe::9
will failover faster and more successfully on systemd-resolved, than if you specify all Cloudflare IPs together, then all Google IPs, etc.Also note that Quad9 is default filtering on this IP while the other two or not, so you could get intermittent differences in resolution behavior. If this is a problem, don't mix filtered and unfiltered resolvers. You definitely shouldn't mix DNSSEC validatng and not DNSSEC validating resolvers if you care about that (all of the above are DNSSEC validating).
I was handling an incident due to this outage. I ended up adding Google DNS resolvers using systemd-resolved, but I didn't think to interleave them!
dnsmasq with a list of smaller trusted DNS providers sounds perfect, as long as it is not considered bad etiquette to spam multiple DNS providers for every resolution?
But where to find a trusted list of privacy focused DNS resolvers. The couple I tried from random internet advice seemed unstable.
I believe that they follow their published policies and have reasonable security teams. They're also both popular services, which mitigates many of the other types of DNS tracking possible.
https://developers.google.com/speed/public-dns/privacy https://developers.cloudflare.com/1.1.1.1/privacy/public-dns...
> OpenNIC (also referred to as the OpenNIC Project) is a user owned and controlled top-level Network Information Center offering a non-national alternative to traditional Top-Level Domain (TLD) registries; such as ICANN.
I need to do a write-up one day
If you want to eschew centralized DNS altogether, if you run a Tor daemon, it has an option to expose a DNS resolver to your network. Multiple resolvers if you want them.
I recently started using the "luci-app-https-dns-proxy" package on OpenWrt, which is preconfigured to use both Cloudflare and Google DNS, and since DoH was mostly unaffected, I didn't notice an outage. (Though if DoH had been affected, it presumably would have failed over to Google DNS anyway.)
Anecdotally, I figured out their DNS was broken before it hit their status page and switched my upstream DNS over to Google. Haven't gotten around to switching back yet.
Clients cache DNS resolutions to avoid having to do that request each time they send a request. It's plausible that some clients held on to their cache for a significant period.
It would be interesting to see the service level objective (SLO) that cloudflare internally has for this service.
I've found https://www.cloudflare.com/r2-service-level-agreement/ but this seems to be for payed services, so this outage would put July in the "< 99.9% but >= 99.0%" bucket, so you'd get a 10% refund for the month if you payed for it.
Cloudflare's 1.1.1.1 Resolver service became unavailable to the Internet starting at 21:52 UTC and ending at 22:54 UTC
Weird. According to my own telemetry from multiple networks they were unavailable for a lot longer than that.I find it somewhat surprising that none of the multiple engineers who reviewed the original change in June noticed that they had added 1.1.1.0/24 to the list of prefixes that should be rerouted. I wonder what sort of human mistake or malice led to that original error.
Perhaps it would be wise to add some hard-coded special-case mitigations to DLS such that it would not allow 1.1.1.1/32 or 1.0.0.1/32 to be reassigned to a single location.
But, yes, a second mitigation here would be defense in depth - in an ideal world, all your systems use the same ops/deploy/etc stack, in this one, you probably want an extra couple steps in the way of potentially taking a large public service offline.
EDIT: Appears I was wrong, it is failover not round-robin between the primary and secondary DNS servers. Thus, using 1.1.1.1 and 8.8.8.8 makes sense.
If you have a more advanced local resolver of some sort (systemd for example) you can configure whatever behaviour you want.
Maybe there is noticeable difference?
I have seen more outage incident reports of cloudflare than of google, but this is just personal anecdote.
Last 30 days, 8.8.8.8 has 99.99% uptime vs 1.1.1.1 has 99.09%
For me cloudflare 1.1.1.1 and 1.0.0.1 have a mean response time of 15.5ms over the last 3 months, 8.8.8.8 and 8.8.4.4 are 15.0ms, and 9.9.9.9 is 13.8ms.
All of those servers return over 3-nines of uptime when quantised in the "worst result in a given 1 minute bucket" from my monitoring points, which seem fine to have in your mix of upstream providers. Personally I'd never rely on a single provider. Google gets 4 nines, but that's only over 90 days so I wouldn't draw any long term conclusions.
This writing is just brilliant. Clear to technical and non-technical readers. Makes the in-progress migration sound way more exciting than it probably is!
> We are sorry for the disruption this incident caused for our customers. We are actively making these improvements to ensure improved stability moving forward and to prevent this problem from happening again.
This is about as good as you can get it from a company as serious and important as Cloudflare. Bravo to the writers and vetters for not watering this down.
Say what now? A test triggered a global production change?
> Due to the earlier configuration error linking the 1.1.1.1 Resolver's IP addresses to our non-production service, those 1.1.1.1 IPs were inadvertently included when we changed how the non-production service was set up.
You have a process that allows some other service to just hoover up address routes already in use in production by a different service?
I use their DNS over HTTPS and if I hadn't seen the issue being reported here, I wouldn't have caught it at all. However, this—along with a chain of past incidents (including a recent cascading service failure caused by a third-party outage)—led me to reduce my dependencies. I no longer use Cloudflare Tunnels or Cloudflare Access, replacing them with WireGuard and mTLS certificates. I still use their compute and storage, but for personal projects only.
The theory is CF had the capacity to soak up the junk traffic without negatively impacting their network.
It is designed to be used in conjunction with 1.0.0.1. DNS has fault tolerance built in.
Did 1.0.0.1 go down too? If so, why were they on the same infrastructure?
This makes no sense to me. 8.8.8.8 also has 8.8.4.4. The whole point is that it can go down at any time and everything keeps working.
Shouldn’t the fix be to ensure that these are served out of completely independent silos and update all docs to make sure anyone using 1.1.1.1 also has 1.0.0.1 configured as a backup?
If I ran a service like this I would regularly do blackouts or brownouts on the primary to make sure that people’s resolvers are configured correctly. Nobody should be using a single IP as a point of failure for their internet access/browsing.
Yes.
> Shouldn’t the fix be to ensure that these are served out of completely independent silos [...]?
Yes.
> If so, why were they on the same infrastructure?
Apparently, they weren’t independent enough: something in CF has announced both addresses and that got out.
The solution for the end user is, of course, to use 1.1.1.1 and 8.8.8.8 (or any other combination of two different resolvers).
If there were some way to view torrenting traffic, no doubt there'd be a 20 minute slump.
I use Cloudflare at work. Cloudflare has many bugs, and some technical decisions are absurd, such as the worker's cache.delete method, which only clears the cache contents in the data center where the Worker was invoked!!! https://developers.cloudflare.com/workers/runtime-apis/cache...
In my experience, Cloudflare support is not helpful at all, trying to pass the problem onto the user, like "Just avoid holding it in that way. ".
At work, I needed to use Cloudflare. The next job I get, I'll put a limit on my responsibilities: I don't work with Cloudflare.
I will never use Cloudflare at home and I don't recommend it to anyone.
Next week: A new post about how Cloudflare saved the web from a massive DDOS attack.
The Cache API is a standard taken from browsers. In the browser, cache.delete obviously only deletes that browser's cache, not all other browsers in the world. You could certainly argue that a global purge would be more useful in Workers, but it would be inconsistent with the standard API behavior, and also would be extraordinarily expensive. Code designed to use the standard cache API would end up being much more expensive than expected.
With all that said, we (Workers team) do generally feel in retrospect that the Cache API was not a good fit for our platform. We really wanted to follow standards, but this standard in this case is too specific to browsers and as a result does not work well for typical use cases in Cloudflare Workers. We'd like to replace it with something better.
To me, it only makes sense if the put method creates a cache only in the datacenter where the Worker was invoked. Put and delete need to be related, in my opinion.
Now I'm curious: what's the point of clearing the cache contents in the datacenter where the Worker was invoked? I can't think of any use for this method.
My criticisms aren't about functionality per see or developers. I don't doubt the developers' competence, but I feel like there's something wrong with the company culture.
That said, I don't use workers and don't plan to. I personally try to stay away from non cross-platform stuff because I've been burned too heavily with vendor/platform lock-in in the past.
If we changed an API in Workers in a way that broke any Worker in production, we consider that an incident and we will roll it back ASAP. We really try to avoid this but sometimes it's hard for us to tell. Please feel free to contact us if this happens in the future (e.g. file a support ticket or file a bug on workerd on GitHub or complain in our Discord or email kenton@cloudflare.com).
Like mentioned by other comments, do it on your own if you are not happy with the stability. Or just pay someone to provide it - like your ISP..
And TBH I trust my local ISP more than Google or CF. Not in availability, but it's covered by my local legislature. That's a huge difference - in a positive way.
which might not be a good thing in some jurisdictions - see the porn block in the UK (it's done via dns iirc, and trivially bypassed with a third party dns like cloudflare's).
I don't think this is fair when discussing infrastructure. It's reasonable to complain about potholes, undrinkable tap water, long lines at the DMV, cracked (or nonexistent) sidewalks, etc. The internet is infrastructure and DNS resolution is a critical part of it. That it hasn't been nationalized doesn't change the fact that it's infrastructure (and access absolutely should be free) and therefore everyone should feel free to complain about it not working correctly.
"But you pay taxes for drinkable tap water," yes, and we paid taxes to make the internet work too. For some reason, some governments like the USA feel it to be a good idea to add a middle man to spend that tax money on, but, fine, we'll complain about the middle man then as well.
I did this for a while, but ~300ms hangs on every DNS resolution sure do get old fast.
When DNS resolver is down, it affects everything, 100% uptime is a fair expectation, hence redundancy. Looks like both 1.0.0.1 and 1.1.1.1 were down for more than 1h, pretty bad TBH, especially when you advise global usage.
RCA is not detailed and feels like a marketing stunt we are now getting every other week.
Secondary DNS is supposed to be in an independent network to avoid precisely this.
But I do appreciate these types of detailed public incident reports and RCAs.
Not sure what the "advantage" of stub resolvers is in 2025 for anything.
Very frustrating.
I know.