Traditional monitoring systems like Nagios and Icinga have settings where they only open events/alerts if a check failed three times in a row, because spurious failed checks are quite common.
If you spam your operators with lots of alerts for monitoring checks that fix themselves, you stress the unnecessarily and create alert blindness, because the first reaction will be "let's wait if it fixes itself".
I've never operated a service with as much exposure as CF's DNS service, but I'm not really surprised that it took 8 minutes to get a reliable detection.
Were it much less than 1.1.1.1 itself, taking longer than a minute to alarm probably wouldn’t surprise me, but this is 1.1.1.1, they’re dealing with vasts amounts of probably fairly consistent traffic.
Thing is, it's probably still some engineering effort, and most orgs only really improve their monitoring after it turned out to be sub-optimal.
Let's say you've got a metric aggregation service, and that service crashes.
What does that result in? Metrics get delayed until your orchestration system redeploys that service elsewhere, which looks like a 100% drop in metrics.
Most orchestration take a sec to redeploy in this case, assuming that it could be a temporary outage of the node (like a network blip of some sort).
Sooo, if you alert after just a minute, you end up with people getting woken up at 2am for nothing.
What happens if you keep waking up people at 2am for something that auto-resolves in 5 minutes? People quit, or eventually adjust the alert to 5 minutes.
I know you often can differentiate no data and real drops, but the overall point, of "if you page people constantly, people will quit" I think is the important one. If people keep getting paged for too tight alarms, the alarms can and should be loosened... and that's one way you end up at 5 minutes.
When you are building systems like 1.1.1.1 having an alert rollup of five minutes is not acceptable as it will hide legitimate downtime that lasts between 0 and 5 minutes.
You need to design systems which do not rely on orchestration to remediate short transient errors.
Disclosure: I work on a core SRE team for a company with over 500 million users.
Now without crying: I saw multiple, big companies getting rid of NOC and replacing that with on duties in multiple, focused teams. Instead of 12 people sitting 24/7 in group of 4 and doing some basic analysis and steps before calling others - you page correct people in 3-5 minutes, with exact and specific alert.
Incident resolution times went greatly down (2-10x times - depends on company), people don’t have to sit overnight and sleep for most of the time and no stupid actions like service restart taken to slow down incident resolution.
And I’m not liking that some platforms hire 1500 people for job that could be done with 50-100, but in terms of incident response - if you already have teams with separated responsibilities then NOC it’s "legacy"
Before you fire a quick alarm, check that the node is up, check that the service is up etc.
Operating at the scale of cloudflare? A lot.
* traffic appears to be down 90% but we're only getting metrics from the regions of the world that are asleep because of some pipeline error
* traffic appears to be down 90% but someone put in a firewall rule causing the metrics to be dropped
* traffic appears to be down 90% but actually the counter rolled over and prometheus handled it wrong
* traffic appears to be down 90% but the timing of the new release just caused polling to show wierd numbers
* traffic appears to be down 90% but actually there was a metrics reporting spike and there was pipeline lag
* traffic appears to be down 90% but it turns out that the team that handles transit links forgot to put the right acls around snmp so we're just not collecting metrics for 90% of our traffic
* I keep getting alerts for traffic down 90%.... thousands and thousands of them, but it turns out that really its just that this rarely used alert had some bitrot and doesn't use the aggregate metrics but the per-system ones.
* traffic is actually down 90% because theres an internet routing issue (not the dns team's problem)
* traffic is actually down 90% at one datacenter because of a fiber cut somewhere
* traffic is actually down 90% because the normal usage pattern is trough traffic volume is 10% of peak traffic volume
* traffic is down 90% from 10s ago, but 10s ago there was an unusual spike in traffic.
And then you get into all sorts of additional issues caused by the scale and distributed nature of a metrics system that monitors a huge global network of datacenters.
They have a rather significant vested interest in it being reliable.
Don't you normally have 2 DnS servers listed on any device. So was the second also down, if not why didn't it go to that.
Unfortunately, the configuration mistake that caused this outage disabled Cloudflare's BGP advertisements of both 1.1.1.0/24 and 1.0.0.0/24 prefixes to its peers.
Btw, I really don't understand why it does not accept an IP (1.1.1.1), so you have to give an address (one.one.one.one). It would be more sensible to configure a DNS server from an IP rather than from an address to be resolved by a DNS server :/
Normal DNS can normally be changed in your connection settings for a given connection on most flavours of Android.
Yes, sorry, I did not mention it.
So if you want to use DNS over HTTPS on Android, it is not possible to provide a fallback.
But I understand why Cloudflare can’t just say “use 8.8.8.8 as your backup”.
If your device doesn't support proper failover use a local DNS forwarder on your router or an external one.
In Switzerland I would use Init7 (isp that doesn't filter) -> quad9 (unfiltered Version) -> eu dns0 (unfiltered Version)
I get that in theory blah blah, but we now have choices in who gets to see all of our requests and the ISP will always lose out to the other losers in the list
If you choose a resolver that is very far, 100ms longer page loads do end add up quickly...
Note how root "." just works and has done for decades - that's proper engineering and actually way more complicated than running 1.1.1.1. What 1.1.1.1 suffers from is anycast and not DNS.
Cloudflare (and Google and co) insist on using one or more "vanity" IP addresses - that is very unfair of me but that it what it is, and to make it work, they have to use anycast.
The real issue is fixing anycast and not DNS.
Anyway, select two+ providers and set them.
Unless you do something fancy with a local caching dns proxy with more than one upstream.
I would count not configuring at least two as 'user error'. Many systems require you to enter a primary and alternate server in order to save a configuration.
Not sure how cloudflare keeps struggling with issues like these, this isn't the first (and probably won't be the last) time they have these 'simple', 'deprecated', 'legacy' issues occuring.
8.8.8.8+8.8.4.4 hasn't had a global(1) second of downtime for almost a decade.
1: localized issues did exist, but that's really the fault of the internet and they did remain running when google itself suffered severe downtime in various different services.
European users might prefer one of the alternatives listed at https://european-alternatives.eu/category/public-dns over US corporations subject to the CLOUD act.
i have musknet, though, so i can't edit the DNS providers on the router without buying another router, so cellphones aren't automatically on this plan, nor are VMs and the like.
Cloudflare has a reasonable culture around incident response, but it doesn't incentivize proactive prevention.
From the longer term graphs it looks like volume returned to normal https://imgur.com/a/8a1H8eL
> It’s worth noting that DoH (DNS-over-HTTPS) traffic remained relatively stable as most DoH users use the domain cloudflare-dns.com, configured manually or through their browser, to access the public DNS resolver, rather than by IP address.
Interesting, I was affected by this yesterday. My router (supposedly) had Cloudflare DoH enabled but nothing would resolve. Changing the DNS server to 8.8.8.8 fixed the issues.
It’s corporate newspeak. “legacy” isn’t a clear term, it’s used to abstract and obfuscate.
> Legacy components do not leverage a gradual, staged deployment methodology. Cloudflare will deprecate these systems which enables modern progressive and health mediated deployment processes to provide earlier indication in a staged manner and rollback accordingly.
I know what this means, but there’s absolutely no reason for it to be written in this inscrutable corporatese.
I will not say whether or not it’s acceptable for a company of their size and maturity, but it’s definitely not hidden in corporate lingo.
I do believe they could have elaborate more on the follow up steps they will take to prevent this from happening again, I don’t think staggered roll outs are the only answer to this, they’re just a safety net.
It's carefully written so my boss's boss thinks he understands it, and that we cannot possibly have that problem because we obviously don't have any "legacy components" because we are "modern and progressive".
It is, in my opinion, closer to "intentionally misleading corporatese".
Note that this introduces one query overhead per DNS request if the previous cache has expired. For this reason, I've been using https://1.1.1.1/dns-query instead.
In theory, this should eliminate that overhead. Your operating system can validate the IP address of the DNS response by using the Subject Alternative Name (SAN) field within the CA certificate presented by the DoH server: https://g.co/gemini/share/40af4514cb6e
TLDR; DoH was working
I guess now we should start using a completely different provider as dns backup Maybe 8.8.8.8 or 9.9.9.9
[0] https://man7.org/linux/man-pages/man3/inet_aton.3.html#DESCR...
1.0.0.0/24 is a different network than 1.1.1.0/24 too, so can be hosted elsewhere. Indeed right now 1.1.1.1 from my laptop goes via 141.101.71.63 and 1.0.0.1 via 141.101.71.121, which are both hosts on the same LINX/LON1 peer but presumably from different routers, so there is some resilience there.
Given DNS is about the easiest thing to avoid a single point of failure on I'm not sure why you would put all your eggs in a single company, but that seems to be the modern internet - centralisation over resilience because resilience is somehow deemed to be hard.
That said, it's a good idea to specifically pick multiple resolvers in different regions, on different backbones, using different providers, and not use an Anycast address, because Anycast can get a little weird. However, this can lead to hard-to-troubleshoot issues, because DNS doesn't always behave the way you expect.
all-servers
server=8.8.8.8
server=9.9.9.9
server=1.1.1.1If you were using systemd-resolved however, it retries all servers in the order they were specified, so it's important to interleave upstreams.
Using the servers in the above example, and assuming IPv4 + IPv6:
1.1.1.1
2001:4860:4860::8888
9.9.9.9
2606:4700:4700::1111
8.8.8.8
2620:fe::fe
1.0.0.1
2001:4860:4860::8844
149.112.112.112
2606:4700:4700::1001
8.8.4.4
2620:fe::9
will failover faster and more successfully on systemd-resolved, than if you specify all Cloudflare IPs together, then all Google IPs, etc.Also note that Quad9 is default filtering on this IP while the other two or not, so you could get intermittent differences in resolution behavior. If this is a problem, don't mix filtered and unfiltered resolvers. You definitely shouldn't mix DNSSEC validatng and not DNSSEC validating resolvers if you care about that (all of the above are DNSSEC validating).
I was handling an incident due to this outage. I ended up adding Google DNS resolvers using systemd-resolved, but I didn't think to interleave them!
dnsmasq with a list of smaller trusted DNS providers sounds perfect, as long as it is not considered bad etiquette to spam multiple DNS providers for every resolution?
But where to find a trusted list of privacy focused DNS resolvers. The couple I tried from random internet advice seemed unstable.
I believe that they follow their published policies and have reasonable security teams. They're also both popular services, which mitigates many of the other types of DNS tracking possible.
https://developers.google.com/speed/public-dns/privacy https://developers.cloudflare.com/1.1.1.1/privacy/public-dns...
> OpenNIC (also referred to as the OpenNIC Project) is a user owned and controlled top-level Network Information Center offering a non-national alternative to traditional Top-Level Domain (TLD) registries; such as ICANN.
I need to do a write-up one day
server:
logfile: ""
log-queries: no
# adjust as necessary
interface: 127.0.0.1@53
access-control: 127.0.0.0/8 allow
infra-keep-probing: yes
tls-system-cert: yes
forward-zone:
name: "."
forward-tls-upstream: yes
forward-addr: 9.9.9.9@853#dns.quad9.net
forward-addr: 193.110.81.9@853#zero.dns0.eu
forward-addr: 149.112.112.112@853#dns.quad9.net
forward-addr: 185.253.5.9@853#zero.dns0.euIf you want to eschew centralized DNS altogether, if you run a Tor daemon, it has an option to expose a DNS resolver to your network. Multiple resolvers if you want them.
I recently started using the "luci-app-https-dns-proxy" package on OpenWrt, which is preconfigured to use both Cloudflare and Google DNS, and since DoH was mostly unaffected, I didn't notice an outage. (Though if DoH had been affected, it presumably would have failed over to Google DNS anyway.)
Anecdotally, I figured out their DNS was broken before it hit their status page and switched my upstream DNS over to Google. Haven't gotten around to switching back yet.
Clients cache DNS resolutions to avoid having to do that request each time they send a request. It's plausible that some clients held on to their cache for a significant period.
It would be interesting to see the service level objective (SLO) that cloudflare internally has for this service.
I've found https://www.cloudflare.com/r2-service-level-agreement/ but this seems to be for payed services, so this outage would put July in the "< 99.9% but >= 99.0%" bucket, so you'd get a 10% refund for the month if you payed for it.
Cloudflare's 1.1.1.1 Resolver service became unavailable to the Internet starting at 21:52 UTC and ending at 22:54 UTC
Weird. According to my own telemetry from multiple networks they were unavailable for a lot longer than that.I find it somewhat surprising that none of the multiple engineers who reviewed the original change in June noticed that they had added 1.1.1.0/24 to the list of prefixes that should be rerouted. I wonder what sort of human mistake or malice led to that original error.
Perhaps it would be wise to add some hard-coded special-case mitigations to DLS such that it would not allow 1.1.1.1/32 or 1.0.0.1/32 to be reassigned to a single location.
But, yes, a second mitigation here would be defense in depth - in an ideal world, all your systems use the same ops/deploy/etc stack, in this one, you probably want an extra couple steps in the way of potentially taking a large public service offline.
EDIT: Appears I was wrong, it is failover not round-robin between the primary and secondary DNS servers. Thus, using 1.1.1.1 and 8.8.8.8 makes sense.
If you have a more advanced local resolver of some sort (systemd for example) you can configure whatever behaviour you want.
Maybe there is noticeable difference?
I have seen more outage incident reports of cloudflare than of google, but this is just personal anecdote.
Last 30 days, 8.8.8.8 has 99.99% uptime vs 1.1.1.1 has 99.09%
For me cloudflare 1.1.1.1 and 1.0.0.1 have a mean response time of 15.5ms over the last 3 months, 8.8.8.8 and 8.8.4.4 are 15.0ms, and 9.9.9.9 is 13.8ms.
All of those servers return over 3-nines of uptime when quantised in the "worst result in a given 1 minute bucket" from my monitoring points, which seem fine to have in your mix of upstream providers. Personally I'd never rely on a single provider. Google gets 4 nines, but that's only over 90 days so I wouldn't draw any long term conclusions.
This writing is just brilliant. Clear to technical and non-technical readers. Makes the in-progress migration sound way more exciting than it probably is!
> We are sorry for the disruption this incident caused for our customers. We are actively making these improvements to ensure improved stability moving forward and to prevent this problem from happening again.
This is about as good as you can get it from a company as serious and important as Cloudflare. Bravo to the writers and vetters for not watering this down.
Say what now? A test triggered a global production change?
> Due to the earlier configuration error linking the 1.1.1.1 Resolver's IP addresses to our non-production service, those 1.1.1.1 IPs were inadvertently included when we changed how the non-production service was set up.
You have a process that allows some other service to just hoover up address routes already in use in production by a different service?
I use their DNS over HTTPS and if I hadn't seen the issue being reported here, I wouldn't have caught it at all. However, this—along with a chain of past incidents (including a recent cascading service failure caused by a third-party outage)—led me to reduce my dependencies. I no longer use Cloudflare Tunnels or Cloudflare Access, replacing them with WireGuard and mTLS certificates. I still use their compute and storage, but for personal projects only.
The theory is CF had the capacity to soak up the junk traffic without negatively impacting their network.
It is designed to be used in conjunction with 1.0.0.1. DNS has fault tolerance built in.
Did 1.0.0.1 go down too? If so, why were they on the same infrastructure?
This makes no sense to me. 8.8.8.8 also has 8.8.4.4. The whole point is that it can go down at any time and everything keeps working.
Shouldn’t the fix be to ensure that these are served out of completely independent silos and update all docs to make sure anyone using 1.1.1.1 also has 1.0.0.1 configured as a backup?
If I ran a service like this I would regularly do blackouts or brownouts on the primary to make sure that people’s resolvers are configured correctly. Nobody should be using a single IP as a point of failure for their internet access/browsing.
Yes.
> Shouldn’t the fix be to ensure that these are served out of completely independent silos [...]?
Yes.
> If so, why were they on the same infrastructure?
Apparently, they weren’t independent enough: something in CF has announced both addresses and that got out.
The solution for the end user is, of course, to use 1.1.1.1 and 8.8.8.8 (or any other combination of two different resolvers).
If there were some way to view torrenting traffic, no doubt there'd be a 20 minute slump.
I use Cloudflare at work. Cloudflare has many bugs, and some technical decisions are absurd, such as the worker's cache.delete method, which only clears the cache contents in the data center where the Worker was invoked!!! https://developers.cloudflare.com/workers/runtime-apis/cache...
In my experience, Cloudflare support is not helpful at all, trying to pass the problem onto the user, like "Just avoid holding it in that way. ".
At work, I needed to use Cloudflare. The next job I get, I'll put a limit on my responsibilities: I don't work with Cloudflare.
I will never use Cloudflare at home and I don't recommend it to anyone.
Next week: A new post about how Cloudflare saved the web from a massive DDOS attack.
The Cache API is a standard taken from browsers. In the browser, cache.delete obviously only deletes that browser's cache, not all other browsers in the world. You could certainly argue that a global purge would be more useful in Workers, but it would be inconsistent with the standard API behavior, and also would be extraordinarily expensive. Code designed to use the standard cache API would end up being much more expensive than expected.
With all that said, we (Workers team) do generally feel in retrospect that the Cache API was not a good fit for our platform. We really wanted to follow standards, but this standard in this case is too specific to browsers and as a result does not work well for typical use cases in Cloudflare Workers. We'd like to replace it with something better.
To me, it only makes sense if the put method creates a cache only in the datacenter where the Worker was invoked. Put and delete need to be related, in my opinion.
Now I'm curious: what's the point of clearing the cache contents in the datacenter where the Worker was invoked? I can't think of any use for this method.
My criticisms aren't about functionality per see or developers. I don't doubt the developers' competence, but I feel like there's something wrong with the company culture.
That is, in fact, how it works. cache.put() only writes to the local datacenter's cache. If delete() were global, it would be inconsistent with put().
> Now I'm curious: what's the point of clearing the cache contents in the datacenter where the Worker was invoked? I can't think of any use for this method.
Say you read the cache entry but you find, based on its content, that it is no longer valid. You would then want to delete it, to save the cost of reading it again later.
That said, I don't use workers and don't plan to. I personally try to stay away from non cross-platform stuff because I've been burned too heavily with vendor/platform lock-in in the past.
If we changed an API in Workers in a way that broke any Worker in production, we consider that an incident and we will roll it back ASAP. We really try to avoid this but sometimes it's hard for us to tell. Please feel free to contact us if this happens in the future (e.g. file a support ticket or file a bug on workerd on GitHub or complain in our Discord or email kenton@cloudflare.com).
If we start using workers though I'll definitely let you know if any API changes!
Like mentioned by other comments, do it on your own if you are not happy with the stability. Or just pay someone to provide it - like your ISP..
And TBH I trust my local ISP more than Google or CF. Not in availability, but it's covered by my local legislature. That's a huge difference - in a positive way.
which might not be a good thing in some jurisdictions - see the porn block in the UK (it's done via dns iirc, and trivially bypassed with a third party dns like cloudflare's).
So far I'm lucky and the only ban I'm aware of is on gambling. Which is fine for me personally.
But in a UK case I'd using a non local one as well.
I don't think this is fair when discussing infrastructure. It's reasonable to complain about potholes, undrinkable tap water, long lines at the DMV, cracked (or nonexistent) sidewalks, etc. The internet is infrastructure and DNS resolution is a critical part of it. That it hasn't been nationalized doesn't change the fact that it's infrastructure (and access absolutely should be free) and therefore everyone should feel free to complain about it not working correctly.
"But you pay taxes for drinkable tap water," yes, and we paid taxes to make the internet work too. For some reason, some governments like the USA feel it to be a good idea to add a middle man to spend that tax money on, but, fine, we'll complain about the middle man then as well.
DNS is infrastructure. But "Cloudflare Public Free DNS Resolver" is not, it's just a convenience and a product to collect data.
>"But you pay taxes for drinkable tap water," yes, and we paid taxes to make the internet work too. For some reason, some governments like the USA feel it to be a good idea to add a middle man to spend that tax money on, but, fine, we'll complain about the middle man then as well.
You don't want DNS to be nationalized. Even the US would have half the internet banned by now.
But opposite to tap water there are a lot of different free DNS resolvers that can be used.
And I don't see how my taxes funded CFs DNS service. But my ISP fee covers their DNS resolving setup. That's the reason why I wrote
> a service that's free of charge
Which CF is.
I did this for a while, but ~300ms hangs on every DNS resolution sure do get old fast.
With something like a N100- or N150-based single board computer (perhaps around $200) running any number of open source DNS resolvers, I would expect you can average around 30 ms for cold lookups and <1 ms for cache hits.
When DNS resolver is down, it affects everything, 100% uptime is a fair expectation, hence redundancy. Looks like both 1.0.0.1 and 1.1.1.1 were down for more than 1h, pretty bad TBH, especially when you advise global usage.
RCA is not detailed and feels like a marketing stunt we are now getting every other week.
Secondary DNS is supposed to be in an independent network to avoid precisely this.
But I do appreciate these types of detailed public incident reports and RCAs.
Not sure what the "advantage" of stub resolvers is in 2025 for anything.
Very frustrating.
What caused this specific behavior is the dilemma of backwards comparability when it comes to BGP security. We area long ways off from all routes being covered by rpki, (just 56% of v4 routes according to https://rpki-monitor.antd.nist.gov/ROV ) so invalid routes tend to be treated as less preferred, not rejected by BGP speakers that support RPKI.
I know.