Cloudflare 1.1.1.1 Incident on July 14, 2025 (opens in new tab)

(blog.cloudflare.com)

581 pointsnomaxx11711mo ago380 comments

380 comments

171 comments · 37 top-level

v5v311mo ago· 23 in thread

> For many users, not being able to resolve names using the 1.1.1.1 Resolver meant that basically all Internet services were unavailable.

Don't you normally have 2 DnS servers listed on any device. So was the second also down, if not why didn't it go to that.

Polizeiposaune11mo ago

Cloudflare recommends you configure 1.1.1.1 and 1.0.0.1 as DNS servers.

Unfortunately, the configuration mistake that caused this outage disabled Cloudflare's BGP advertisements of both 1.1.1.0/24 and 1.0.0.0/24 prefixes to its peers.

kingnothing11mo ago

A better recommendation is to use Cloudflare for one of your DNS servers and a completely different company for the other.

2 more replies

rom1v11mo ago

On Android, in Settings, Network & internet, Private DNS, you can only provide one in "Private DNS provider hostname" (AFAIK).

Btw, I really don't understand why it does not accept an IP (1.1.1.1), so you have to give an address (one.one.one.one). It would be more sensible to configure a DNS server from an IP rather than from an address to be resolved by a DNS server :/

quacksilver11mo ago

Private DNS on Android refers to 'DNS over HTTPS' and would normally only accept a hostname.

Normal DNS can normally be changed in your connection settings for a given connection on most flavours of Android.

4 more replies

Macha11mo ago

Cloudflare's own suggested config is to use their backup server 1.0.0.1 as the secondary DNS, which was also affected by this incident.

stingraycharles11mo ago

TBH at this point the failure modes in which 1.1.1.1 would go down and 1.0.0.1 would not are not that many. At CloudFlare’s scale, it’s hardly believable a single of these DNS servers would go down, and it’s rather a large-scale system failure.

But I understand why Cloudflare can’t just say “use 8.8.8.8 as your backup”.

2 more replies

Gieron11mo ago

I think normally you pair 1.1.1.1 with 1.0.0.1 and, if I understand this correctly, both were down.

moontear11mo ago

Just pair 1.1.1.1 with 9.9.9.9 (Quad9) so you have fault tolerance in terms of provider as well.

3 more replies

Algent11mo ago

Yeah pretty much. In a perfect world you would pair it with another service I guess but usually you use the official backup IP because it's not supposed to break at same time.

1 more reply

rvnx11mo ago

8.8.8.8 + 1.1.1.1 is stable and mostly safe

2 more replies

sschueller11mo ago

Yes, I would also highly recommend using a DNS closest to you (for those that have ISPs that don't mess around (blocking etc.) with their DNS you usually get much better response times) and multiple from different providers.

If your device doesn't support proper failover use a local DNS forwarder on your router or an external one.

In Switzerland I would use Init7 (isp that doesn't filter) -> quad9 (unfiltered Version) -> eu dns0 (unfiltered Version)

dylan60411mo ago

How busy in life are you that we're concerning ourselves with nearest DNS? Are you browsing the internet like a high frequency stock trader? Seriously, in everyone's day to day, other than when these incidents happen, does someone notice a delay from resolving a domain name?

I get that in theory blah blah, but we now have choices in who gets to see all of our requests and the ISP will always lose out to the other losers in the list

2 more replies

gerdesj11mo ago

If you think you can pontificate on DNS then I think you should be running your own service.

Note how root "." just works and has done for decades - that's proper engineering and actually way more complicated than running 1.1.1.1. What 1.1.1.1 suffers from is anycast and not DNS.

Cloudflare (and Google and co) insist on using one or more "vanity" IP addresses - that is very unfair of me but that it what it is, and to make it work, they have to use anycast.

The real issue is fixing anycast and not DNS.

Anyway, select two+ providers and set them.

wlonkly11mo ago

The root servers all use anycast addresses.

1 more reply

tyingq11mo ago

Listing two is better than nothing, but it's not great. If one goes down, there's nothing that tracks which one is working, so you usually see long hangs and intermittent issues.

Unless you do something fancy with a local caching dns proxy with more than one upstream.

ahoka11mo ago

Or run your own, if you are able to.

zamadatix11mo ago

1.1.1.1 is also what they call the resolver service as a whole, the impact section (seems to) be saying both 1.0.0.0/24 and 1.1.1.0/24 were affected (among other ranges).

bmicraft11mo ago

My Mikrotik router (and afaict all of them) don't support more than one DoH address.

rat998811mo ago

Not all users have configured two DNS servers?

quacksilver11mo ago

It is highly recommended to configure two or more DNS servers incase one is down.

I would count not configuring at least two as 'user error'. Many systems require you to enter a primary and alternate server in order to save a configuration.

2 more replies

daneel_w11mo ago

OK. But there's no reason or excuse not to, if they already manually configured a primary.

bongodongobob11mo ago

3 at every place I've ever worked.

Bluescreenbuddy11mo ago

Yup. I have Cloudfare and Quad9

homebrewer11mo ago· 22 in thread

This is a good time to mention that dnsmasq lets you setup several DNS servers, and can race them. The first responder wins. You won't ever notice one of the services being down:

  all-servers
  server=8.8.8.8
  server=9.9.9.9
  server=1.1.1.1

anthonyryan111mo ago

Additionally, as long as you don't set strict-order, dnsmasq will automatically use all-servers for retries.

If you were using systemd-resolved however, it retries all servers in the order they were specified, so it's important to interleave upstreams.

Using the servers in the above example, and assuming IPv4 + IPv6:

    1.1.1.1
    2001:4860:4860::8888
    9.9.9.9
    2606:4700:4700::1111
    8.8.8.8
    2620:fe::fe
    1.0.0.1
    2001:4860:4860::8844
    149.112.112.112
    2606:4700:4700::1001
    8.8.4.4
    2620:fe::9

will failover faster and more successfully on systemd-resolved, than if you specify all Cloudflare IPs together, then all Google IPs, etc.

Also note that Quad9 is default filtering on this IP while the other two or not, so you could get intermittent differences in resolution behavior. If this is a problem, don't mix filtered and unfiltered resolvers. You definitely shouldn't mix DNSSEC validatng and not DNSSEC validating resolvers if you care about that (all of the above are DNSSEC validating).

matthewtse11mo ago

wow good tip

I was handling an incident due to this outage. I ended up adding Google DNS resolvers using systemd-resolved, but I didn't think to interleave them!

whitehexagon11mo ago

That sounds good in principle, but is there a more private configuration that doesnt send DNS resolutions to cloudfare, google et al. ie. avoid BigTech tracking, and not wanting DOH.

dnsmasq with a list of smaller trusted DNS providers sounds perfect, as long as it is not considered bad etiquette to spam multiple DNS providers for every resolution?

But where to find a trusted list of privacy focused DNS resolvers. The couple I tried from random internet advice seemed unstable.

agolliver11mo ago

There are no good private DNS configurations, but if you don't trust the big caching recursive resolvers then I'd consider just running your own at home. Unbound is easy to set up and you'll probably never notice a speed difference.

1 more reply

mcpherrinm11mo ago

I've reviewed the privacy policy and performance of various DoH servers, and determined in my opinion that Cloudflare and Google both provide privacy-respecting policies.

I believe that they follow their published policies and have reasonable security teams. They're also both popular services, which mitigates many of the other types of DNS tracking possible.

https://developers.google.com/speed/public-dns/privacy https://developers.cloudflare.com/1.1.1.1/privacy/public-dns...

hamandcheese11mo ago

NextDNS. Generous free tier, very affordable paid tier. Happy customer for several years and I've never noticed an outage.

2 more replies

Yeri11mo ago

https://www.dns0.eu/ is an option

bsilvereagle11mo ago

I haven’t had any problems with OpenNIC: https://opennic.org/

> OpenNIC (also referred to as the OpenNIC Project) is a user owned and controlled top-level Network Information Center offering a non-national alternative to traditional Top-Level Domain (TLD) registries; such as ICANN.

paradao11mo ago

Using DNSCrypt with anonymized DNS could be an option: https://github.com/DNSCrypt/dnscrypt-proxy/wiki/Anonymized-D...

Tmpod11mo ago

Quad9 and NextDNS are usually thrown around.

sophacles11mo ago

You can just run unbound or similar and do your own recursive resolving.

localtoast11mo ago

dnsforge.de comes to mind.

karmakaze11mo ago

I don't consider these interchangeable. They have different priorities and policies. If anything I'd choose one and use my ISP default as fallback.

outworlder11mo ago

My ISP (one of the largest in the US) like to hijack DNS responses (specially NXDOMAIN) and serve crap. No thanks. Which is also why I have to use encryption to talk to public DNS servers otherwise they will hijack anyways.

eli11mo ago

My ISP has already been caught selling personally identifiable customer data. I trust them less than any of those companies.

1 more reply

nemonemo11mo ago

Agreed in principle, but has anyone seen any practical difference between these DNS services? What would be a more detailed downside for using these in parallel instead of the ISP default as a fallback?

1 more reply

mnordhoff11mo ago

Even without "all-servers", DNSMasq will race servers frequently (after 20 seconds, unless it's changed), and when retrying. A sudden outage should only affect you for a few seconds, if at all.

karel-3d11mo ago

dnsdist is AMAZINGLY easy to set up as a secure local resolver that forwards all queries to DoH (and checks SSL) and checks liveliness every second

I need to do a write-up one day

jzebedee11mo ago

Please do. I'd be curious what a secure-by-default self hosted resolver would look like.

1 more reply

heavyset_go11mo ago

I think systemd-resolved does something similar if you use that. Does DoT and DNSSEC by default.

If you want to eschew centralized DNS altogether, if you run a Tor daemon, it has an option to expose a DNS resolver to your network. Multiple resolvers if you want them.

xyst11mo ago

Probably great for users. Awful for trying to reproduce an issue. I prefer a more deterministic approach myself.

itscrush11mo ago

Looks like AdGuard allows for same, thanks for mentioning dnsmasq support! I overlooked it on setup.

chrismorgan11mo ago· 13 in thread

I’m surprised at the delay in impact detection: it took their internal health service more than five minutes to notice (or at least alert) that their main protocol’s traffic had abruptly dropped to around 10% of expected and was staying there. Without ever having been involved in monitoring at that kind of scale, I’d have pictured alarms firing for something that extreme within a minute. I’m curious for description of how and why that might be, and whether it’s reasonable or surprising to professionals in that space too.

perlgeek11mo ago

There's a constant tension between speed of detection and false positive rates.

Traditional monitoring systems like Nagios and Icinga have settings where they only open events/alerts if a check failed three times in a row, because spurious failed checks are quite common.

If you spam your operators with lots of alerts for monitoring checks that fix themselves, you stress the unnecessarily and create alert blindness, because the first reaction will be "let's wait if it fixes itself".

I've never operated a service with as much exposure as CF's DNS service, but I'm not really surprised that it took 8 minutes to get a reliable detection.

sbergot11mo ago

I work on the SSO stack in a b2b company with about 200k monthly active users. One blind spot in our monitoring is when an error occurs on the client's identity provider because of a problem on our side. The service is unusable and we don't have any error logs to raise an alert. We tried to setup an alert based on expected vs actual traffic but we concluded that it would create more problems for the reason you provided.

1 more reply

chrismorgan11mo ago

At Cloudflare’s scale on 1.1.1.1, I’d imagine you could do something comparatively simple like track ten-minute and ten-second rolling averages (I know, I know, I make that sound much easier and more practical than it actually would be), and if they differ by more than 50%, sound the alarm. (Maybe the exact numbers would need to be tweaked, e.g. 20 seconds or 80%, but it’s the idea.)

Were it much less than 1.1.1.1 itself, taking longer than a minute to alarm probably wouldn’t surprise me, but this is 1.1.1.1, they’re dealing with vasts amounts of probably fairly consistent traffic.

3 more replies

bombcar11mo ago

This is one of those graphs that would have been on the giant wall in the NOC in the old days - someone would glance up and see it had dropped and say “that’s not right” and start scrambling.

seb120411mo ago

That's how I picture it. Is that not how it is? Everyone working from home and the big chart is on the TV but someone in the family changed channels?

TheDong11mo ago

I'm not surprised.

Let's say you've got a metric aggregation service, and that service crashes.

What does that result in? Metrics get delayed until your orchestration system redeploys that service elsewhere, which looks like a 100% drop in metrics.

Most orchestration take a sec to redeploy in this case, assuming that it could be a temporary outage of the node (like a network blip of some sort).

Sooo, if you alert after just a minute, you end up with people getting woken up at 2am for nothing.

What happens if you keep waking up people at 2am for something that auto-resolves in 5 minutes? People quit, or eventually adjust the alert to 5 minutes.

I know you often can differentiate no data and real drops, but the overall point, of "if you page people constantly, people will quit" I think is the important one. If people keep getting paged for too tight alarms, the alarms can and should be loosened... and that's one way you end up at 5 minutes.

__turbobrew__11mo ago

The real issue in your hypothetical scenario is a single bad metrics instance can bring the entire thing down. You could deploy multiple geographically distributed metrics aggregation services which establish the “canonical state” through a RAFT/PAXOS quorum. Then as long as a majority of metric aggregator instances are up the system will continue to work.

When you are building systems like 1.1.1.1 having an alert rollup of five minutes is not acceptable as it will hide legitimate downtime that lasts between 0 and 5 minutes.

You need to design systems which do not rely on orchestration to remediate short transient errors.

Disclosure: I work on a core SRE team for a company with over 500 million users.

mentalgear11mo ago

Its not wrong for smaller companies. But there's an argument that a big system critical company/provider like Cloudflare should be able to afford its own always on team with a night shift.

3 more replies

croemer11mo ago

It's not rocket science. You do a 2 stage thing: Why not check if the aggregation service has crashed before firing the alarm if it's within the first 5 minutes? How many types of false positives can there be? You just need to eliminate the most common ones and you gradually end up with fewer of them.

Before you fire a quick alarm, check that the node is up, check that the service is up etc.

1 more reply

kccqzy11mo ago

Having alarms firing within a minute just becomes a stress test for your alarm infrastructure. Is your alarm infrastructure able to get metrics and perform calculations consistently within a minute of real time?

bastawhiz11mo ago

The service almost certainly wasn't completely hard down at the time the impact began, especially if that's the start of a global rollout. It would have taken time for the impact to become measurable.

philipwhiuk11mo ago

Remember they have no SLA for this service.

chrismorgan11mo ago

So?

They have a rather significant vested interest in it being reliable.

jallmann11mo ago· 13 in thread

Good writeup.

> It’s worth noting that DoH (DNS-over-HTTPS) traffic remained relatively stable as most DoH users use the domain cloudflare-dns.com, configured manually or through their browser, to access the public DNS resolver, rather than by IP address.

Interesting, I was affected by this yesterday. My router (supposedly) had Cloudflare DoH enabled but nothing would resolve. Changing the DNS server to 8.8.8.8 fixed the issues.

sneak11mo ago

I disagree. The actual root cause here is shrouded in jargon that even experienced admins such as myself have to struggle to parse.

It’s corporate newspeak. “legacy” isn’t a clear term, it’s used to abstract and obfuscate.

> Legacy components do not leverage a gradual, staged deployment methodology. Cloudflare will deprecate these systems which enables modern progressive and health mediated deployment processes to provide earlier indication in a staged manner and rollback accordingly.

I know what this means, but there’s absolutely no reason for it to be written in this inscrutable corporatese.

stingraycharles11mo ago

I disagree, the target audience is also going to be less technical people, and the gist is clear to everyone: they just deploy this config from 0 to 100% to production, without feature gates or rollback. And they made changes to the config that wasn’t deployed for weeks until some other change was made, which also smells like a process error.

I will not say whether or not it’s acceptable for a company of their size and maturity, but it’s definitely not hidden in corporate lingo.

I do believe they could have elaborate more on the follow up steps they will take to prevent this from happening again, I don’t think staggered roll outs are the only answer to this, they’re just a safety net.

willejs11mo ago

If you carry on reading, its quite obvious they misconfigured a service and routed production traffic to that instead of the correct service, and the system used to do that was built in 2018 and is considered legacy (probably because you can easily deploy bad configs). Given that, I wouldn't say the summary is "inscrutable corporatese" whatever that is.

1 more reply

bauruine11mo ago

How does DoH work? Somehow you need to know the IP of cloudflare-dns.com first. Maybe your router uses 1.1.1.1 for this.

maxloh11mo ago

Yeah, your operating system will first need to resolve cloudflare-dns.com. This initial resolution will likely occur unencrypted via the network's default DNS. Only then will your system query the resolved address for its DoH requests.

Note that this introduces one query overhead per DNS request if the previous cache has expired. For this reason, I've been using https://1.1.1.1/dns-query instead.

In theory, this should eliminate that overhead. Your operating system can validate the IP address of the DNS response by using the Subject Alternative Name (SAN) field within the CA certificate presented by the DoH server: https://g.co/gemini/share/40af4514cb6e

ta124311mo ago

And even if you have already resolved it the TTL is only 5 minutes

stavros11mo ago

Are we meant to use a domain? I've always just used the IP.

1 more reply

stingraycharles11mo ago

Yeah I don’t understand this part either, maybe it’s supposed to be bootstrapped using your ISP’s DNS server?

1 more reply

noduerme11mo ago

Funny. I was configuring a new domain today, and for about 20 minutes I could only reach it through Firefox on one laptop. Google's DNS tools showed it active. SSH to an Amazon server that could resolve it. My local network had no idea of it. Flush cache and all. Turns out I had that one FF browser set up to use Cloudflare's DoH.

Hamuko11mo ago

My (Unifi) router is set to automatic DoH, and I think that means it's using Cloudflare and Google. Didn't notice any disruptions so either the Cloudflare DoH kept working or it used the Google one while it was down.

zahrc11mo ago

Check Jallmann’s response https://news.ycombinator.com/item?id=44578490#44578917

TLDR; DoH was working

1 more reply

sathackr11mo ago

Good writeup except the entirely false timeline shared at the beginning of the post

bartvk11mo ago

You need to clarify such a statement, in my opinion.

CuteDepravity11mo ago· 10 in thread

It's crazy that both 1.1.1.1 and 1.0.0.1 where affected by the same change

I guess now we should start using a completely different provider as dns backup Maybe 8.8.8.8 or 9.9.9.9

sammy225511mo ago

1.1.1.1 and 1.0.0.1 are served by the same service. It's not advertised as a redundant fully separate backup or anything like that...

yjftsjthsd-h11mo ago

Wait, then why does 1.0.0.1 exist? I'll grant I've never seen it advertised/documented as a backup, but I just assumed it must be because why else would you have two? (Given that 1.1.1.1 already isn't actually a single point, so I wouldn't think you need a second IP for load balancing reasons.)

4 more replies

0xbadcafebee11mo ago

In general, the idea of DNS's design is to use the DNS resolver closest to you, rather than the one run by the largest company.

That said, it's a good idea to specifically pick multiple resolvers in different regions, on different backbones, using different providers, and not use an Anycast address, because Anycast can get a little weird. However, this can lead to hard-to-troubleshoot issues, because DNS doesn't always behave the way you expect.

ben0x53911mo ago

Isn't the largest company most likely to have the DNS resolver closest to me?

3 more replies

dontTREATonme11mo ago

What’s your recommendation for finding the dns resolver closest to me? I currently use 1.1 and 8.8, but I’m absolutely open to alternatives.

2 more replies

codingminds11mo ago

Wasn't that the case since ever?

globular-toast11mo ago

In general there's no such thing as "DNS backup". Most clients just arbitrarily pick one from the list, they don't fall back to the other one in case of failure or anything. So if one went down you'd still find many requests timing out.

JdeBP11mo ago

The reality is that it's rather complicated to say what "most clients" do, as there is some behavioural variation amongst the DNS client libraries when they are configured with multiple IP addresses to contact. So whilst it's true to say that fallback and redundancy does not always operate as one might suppose at the DNS client level, it is untrue to go to the opposite extreme and say that there's no such thing at all.

bigiain11mo ago

I mean, aren't we already?

My Pi-holes both use OpenDNS, Quad9, and CloudFlare for upstream.

Most of my devices use both of my Pi-holes.

johnklos11mo ago

If you're already running Pi-hole, wny not just run your own recursive, caching resolver?

geoffpado11mo ago· 10 in thread

This was quite annoying for me, having only switched my DNS server to 1.1.1.1 approximately 3 weeks ago to get around my ISP having a DNS outage. Is reasonably stable DNS really so much to ask for these days?

codingminds11mo ago

If you consume a service that's free of charge, it's at least not reasonable to complain if there's an outage.

Like mentioned by other comments, do it on your own if you are not happy with the stability. Or just pay someone to provide it - like your ISP..

And TBH I trust my local ISP more than Google or CF. Not in availability, but it's covered by my local legislature. That's a huge difference - in a positive way.

chii11mo ago

> it's covered by my local legislature

which might not be a good thing in some jurisdictions - see the porn block in the UK (it's done via dns iirc, and trivially bypassed with a third party dns like cloudflare's).

1 more reply

komali211mo ago

> it's at least not reasonable to complain if there's an outage.

I don't think this is fair when discussing infrastructure. It's reasonable to complain about potholes, undrinkable tap water, long lines at the DMV, cracked (or nonexistent) sidewalks, etc. The internet is infrastructure and DNS resolution is a critical part of it. That it hasn't been nationalized doesn't change the fact that it's infrastructure (and access absolutely should be free) and therefore everyone should feel free to complain about it not working correctly.

"But you pay taxes for drinkable tap water," yes, and we paid taxes to make the internet work too. For some reason, some governments like the USA feel it to be a good idea to add a middle man to spend that tax money on, but, fine, we'll complain about the middle man then as well.

3 more replies

bauruine11mo ago

Why not use multiple? You can use 1.1.1.1, your ISPs and google at the same time. Or just run a resolver yourself.

ripdog11mo ago

>Or just run a resolver yourself.

I did this for a while, but ~300ms hangs on every DNS resolution sure do get old fast.

1 more reply

pparanoidd11mo ago

A single incident means 1.1.1.1 is no longer reasonably stable? You are the unreasonable one

yjftsjthsd-h11mo ago

Although I agree 1.1.1.1 is fine: To this particular commenter they've had one major outage in 3 total weeks of use, which isn't exactly a good record. (And it's understandable to weigh personal experience above other people claiming this isn't representative.)

geoffpado11mo ago

Two incidents from two completely different providers in three weeks means that my personal experience with DNS is remarkably less stable recently than the last 20-ish years I've been using the Internet.

1 more reply

cryptonym11mo ago

I have been online for 30y and can't remember being affected by downtime from my ISP DNS.

When DNS resolver is down, it affects everything, 100% uptime is a fair expectation, hence redundancy. Looks like both 1.0.0.1 and 1.1.1.1 were down for more than 1h, pretty bad TBH, especially when you advise global usage.

RCA is not detailed and feels like a marketing stunt we are now getting every other week.

1 more reply

bjoli11mo ago

Run your own forwarder locally. Technitium dns makes it easy.

kachapopopow11mo ago· 9 in thread

Interesting to see that they probably lost 20% of 1.1.1.1 usage from a roughly 20 minute incident.

Not sure how cloudflare keeps struggling with issues like these, this isn't the first (and probably won't be the last) time they have these 'simple', 'deprecated', 'legacy' issues occuring.

8.8.8.8+8.8.4.4 hasn't had a global(1) second of downtime for almost a decade.

1: localized issues did exist, but that's really the fault of the internet and they did remain running when google itself suffered severe downtime in various different services.

Tepix11mo ago

There's more to DNS than just availability (granted, it's very important). There's also speed and privacy.

European users might prefer one of the alternatives listed at https://european-alternatives.eu/category/public-dns over US corporations subject to the CLOUD act.

adornKey11mo ago

I think just setting up Unbound is even less trouble. Servers come and go. Getting rid of the dependency altogether is better than having to worry who operates the DNS-servers and how long it's going to be available.

1 more reply

daneel_w11mo ago

Everyone, European or not, should prefer anything but Cloudflare and Google if they feel that privacy has any value.

immibis11mo ago

HN users might prefer to run their own. It's a low maintenance service. It's not like running a mail server.

3 more replies

kod11mo ago

> Not sure how cloudflare keeps struggling with issues like these

Cloudflare has a reasonable culture around incident response, but it doesn't incentivize proactive prevention.

user393938211mo ago

You’re not sure how they’re struggling to fix an engineering problem characterized by complexity and scale encountered by 0.001% of network engineers?

zamadatix11mo ago

Regarding the 20% some clients/resolvers will mark a server as temporarily down if it fails to respond to multiple queries in a row. That way the user doesn't have to wait the timeout delay 500 times in a row on the next 500 queries.

From the longer term graphs it looks like volume returned to normal https://imgur.com/a/8a1H8eL

barbazoo11mo ago

Then you’d be using a google DNS though which is undesirable for many.

heraldgeezer11mo ago

Yes, I honestly switched back to 8.8.8.8 and 8.8.4.4 google DNS. 100% stable, no filtering, fast in the EU.

Mindless211211mo ago· 5 in thread

Interesting that traffic didn't return to completely normal levels after the incident.

I recently started using the "luci-app-https-dns-proxy" package on OpenWrt, which is preconfigured to use both Cloudflare and Google DNS, and since DoH was mostly unaffected, I didn't notice an outage. (Though if DoH had been affected, it presumably would have failed over to Google DNS anyway.)

caconym_11mo ago

> Interesting that traffic didn't return to completely normal levels after the incident.

Anecdotally, I figured out their DNS was broken before it hit their status page and switched my upstream DNS over to Google. Haven't gotten around to switching back yet.

radicaldreamer11mo ago

What would be a good reason to switch back from Google DNS?

2 more replies

anon700011mo ago

They go into that more towards the end, sounds like some smaller % of servers needed more direct intervention

bastawhiz11mo ago

If your Internet doesn't work, you'll get up and do other things for a while. I strongly suspect most folks didn't switch DNS providers in that time.

motorest11mo ago

> Interesting that traffic didn't return to completely normal levels after the incident.

Clients cache DNS resolutions to avoid having to do that request each time they send a request. It's plausible that some clients held on to their cache for a significant period.

zac23or11mo ago· 5 in thread

It's no surprise that Cloudflare is having a service issue again.

I use Cloudflare at work. Cloudflare has many bugs, and some technical decisions are absurd, such as the worker's cache.delete method, which only clears the cache contents in the data center where the Worker was invoked!!! https://developers.cloudflare.com/workers/runtime-apis/cache...

In my experience, Cloudflare support is not helpful at all, trying to pass the problem onto the user, like "Just avoid holding it in that way. ".

At work, I needed to use Cloudflare. The next job I get, I'll put a limit on my responsibilities: I don't work with Cloudflare.

I will never use Cloudflare at home and I don't recommend it to anyone.

Next week: A new post about how Cloudflare saved the web from a massive DDOS attack.

kentonv11mo ago

> some technical decisions are absurd, such as the worker's cache.delete method, which only clears the cache contents in the data center where the Worker was invoked!!!

The Cache API is a standard taken from browsers. In the browser, cache.delete obviously only deletes that browser's cache, not all other browsers in the world. You could certainly argue that a global purge would be more useful in Workers, but it would be inconsistent with the standard API behavior, and also would be extraordinarily expensive. Code designed to use the standard cache API would end up being much more expensive than expected.

With all that said, we (Workers team) do generally feel in retrospect that the Cache API was not a good fit for our platform. We really wanted to follow standards, but this standard in this case is too specific to browsers and as a result does not work well for typical use cases in Cloudflare Workers. We'd like to replace it with something better.

zac23or11mo ago

>cache.delete obviously only deletes that browser's cache, not all other browsers in the world.

To me, it only makes sense if the put method creates a cache only in the datacenter where the Worker was invoked. Put and delete need to be related, in my opinion.

Now I'm curious: what's the point of clearing the cache contents in the datacenter where the Worker was invoked? I can't think of any use for this method.

My criticisms aren't about functionality per see or developers. I don't doubt the developers' competence, but I feel like there's something wrong with the company culture.

1 more reply

freedomben11mo ago

Just wanted to say, I always appreciate your comments and frankness!

freedomben11mo ago

Cloudflare is definitely not perfect (and when they make a change that breaks the existing API contract it always makes for several miserable days for me), but on the whole Cloudflare is pretty reliable.

That said, I don't use workers and don't plan to. I personally try to stay away from non cross-platform stuff because I've been burned too heavily with vendor/platform lock-in in the past.

kentonv11mo ago

> and when they make a change that breaks the existing API contract it always makes for several miserable days for me

If we changed an API in Workers in a way that broke any Worker in production, we consider that an incident and we will roll it back ASAP. We really try to avoid this but sometimes it's hard for us to tell. Please feel free to contact us if this happens in the future (e.g. file a support ticket or file a bug on workerd on GitHub or complain in our Discord or email kenton@cloudflare.com).

1 more reply

egamirorrim11mo ago· 4 in thread

What's that about a hijack?

homero11mo ago

Related, non-causal event: BGP origin hijack of 1.1.1.0/24 exposed by withdrawal of routes from Cloudflare. This was not a cause of the service failure, but an unrelated issue that was suddenly visible as that prefix was withdrawn by Cloudflare.

ollien11mo ago

I'm a bit uneducated here - why was the other 1.1.1.0/24 announcement previously suppressed? Did it just express a high enough cost that no one took it on compared to the CF announcement?

1 more reply

JdeBP11mo ago

And because people highlighted it on social media at the time of the outage, many thought that the bogus route was the cause of the problem.

kylestanfield11mo ago

So someone just started advertising the prefix when it was up for grabs? That’s pretty funny

1 more reply

thunderbong11mo ago· 4 in thread

How does Cloudflare compare with OpenDNS?

blurrybird11mo ago

You’d be better off comparing it to Quad9 based on performance, privacy claims, and response accuracy.

johnklos11mo ago

Cloudflare is a for-profit company in the US. Their privacy claims can't be believed. Even if we did believe them, we have no idea if rsolution data isn't taken by US TLA agencies.

forbiddenlake11mo ago

Hm, what distinction are you trying to make here? OpenDNS is also an American company, acquired by Cisco (an American company) in 2015.

1 more reply

johnklos11mo ago

It seems we have a lot of Cloudflare fanbois and apologists here. This is not unexpected. But is anything I'm writing untrue, or just unpopular? Does anyone who's downvoting me care to point out any inaccuracies about what I've written?

1 more reply

cadamsdotcom11mo ago· 3 in thread

> The way that Cloudflare manages service topologies has been refined over time and currently consist of a combination of a legacy and a strategic system that are synced.

This writing is just brilliant. Clear to technical and non-technical readers. Makes the in-progress migration sound way more exciting than it probably is!

> We are sorry for the disruption this incident caused for our customers. We are actively making these improvements to ensure improved stability moving forward and to prevent this problem from happening again.

This is about as good as you can get it from a company as serious and important as Cloudflare. Bravo to the writers and vetters for not watering this down.

kccqzy11mo ago

I can't tell if you are being sarcastic, but "legacy" is a term most often used by technical people whereas "strategic" is a term most often used by marketing and non-technical leadership. Mixing them together annoys both kinds of readers.

marcusb11mo ago

You cannot throw a rock without hitting a product marketer describing everything not-their-product as "legacy."

hbay11mo ago

you were annoyed by that sentence?

perlgeek11mo ago· 2 in thread

An outage of roughly 1 hour is 0.13% of a month or 0.0114% of a year.

It would be interesting to see the service level objective (SLO) that cloudflare internally has for this service.

I've found https://www.cloudflare.com/r2-service-level-agreement/ but this seems to be for payed services, so this outage would put July in the "< 99.9% but >= 99.0%" bucket, so you'd get a 10% refund for the month if you payed for it.

philipwhiuk11mo ago

Probably 99.9% or better annually just from a 'maintaining reputation for reliability' standpoint.

stingraycharles11mo ago

What really matters with these percentages is whether it’s per month or per year. 99.9% per year allows for much longer outages than 99.9% per month.

aftbit11mo ago· 2 in thread

>Even though this release was peer-reviewed by multiple engineers

I find it somewhat surprising that none of the multiple engineers who reviewed the original change in June noticed that they had added 1.1.1.0/24 to the list of prefixes that should be rerouted. I wonder what sort of human mistake or malice led to that original error.

Perhaps it would be wise to add some hard-coded special-case mitigations to DLS such that it would not allow 1.1.1.1/32 or 1.0.0.1/32 to be reassigned to a single location.

burnte11mo ago

It's probably much simpler, "I trust Jerry, I'm sure this is fine, approved."

roughly11mo ago

I’m generally more a “blame the tools” than “blame the people” - depending on how the system is set up and how the configs are generated, it’s easy for a change like this to slip by - especially if a bunch of the diff is autogenerated. It’s still humans doing code review, and this kind of failure indicates process problems, regardless of whether or not laziness or stupidity were also present.

But, yes, a second mitigation here would be defense in depth - in an ideal world, all your systems use the same ops/deploy/etc stack, in this one, you probably want an extra couple steps in the way of potentially taking a large public service offline.

angst11mo ago· 2 in thread

I wonder how uptime ratio of 1.1.1.1 is against 8.8.8.8

Maybe there is noticeable difference?

I have seen more outage incident reports of cloudflare than of google, but this is just personal anecdote.

Pharaoh211mo ago

https://www.dnsperf.com/#!dns-resolvers

Last 30 days, 8.8.8.8 has 99.99% uptime vs 1.1.1.1 has 99.09%

ta124311mo ago

I guess it depends on where you are and what you count as an outage. Is a single failed query an outage?

For me cloudflare 1.1.1.1 and 1.0.0.1 have a mean response time of 15.5ms over the last 3 months, 8.8.8.8 and 8.8.4.4 are 15.0ms, and 9.9.9.9 is 13.8ms.

All of those servers return over 3-nines of uptime when quantised in the "worst result in a given 1 minute bucket" from my monitoring points, which seem fine to have in your mix of upstream providers. Personally I'd never rely on a single provider. Google gets 4 nines, but that's only over 90 days so I wouldn't draw any long term conclusions.

nu11ptr11mo ago· 2 in thread

Question: Years ago, back when I used to do networking, Cisco Wireless controllers used 1.1.1.1 internally. They seemed to literally blackhole any comms to that IP in my testing. I assume they changed this when 1.0.0.0/8 started routing on the Internet?

blurrybird11mo ago

Yeah part of the reason why APNIC granted Cloudflare access to those very lucrative IPs is to observe the misconfiguration volume.

The theory is CF had the capacity to soak up the junk traffic without negatively impacting their network.

yabones11mo ago

The general guidance for networking has been to only use IPs and domains that you actually control... But even 5-8 years ago, the last time I personally touched a cisco WLC box, it still had 1.1.1.1 hardcoded. Cisco loves to break their own rules...

sneak11mo ago· 2 in thread

1.1.1.1 does not operate in isolation.

It is designed to be used in conjunction with 1.0.0.1. DNS has fault tolerance built in.

Did 1.0.0.1 go down too? If so, why were they on the same infrastructure?

This makes no sense to me. 8.8.8.8 also has 8.8.4.4. The whole point is that it can go down at any time and everything keeps working.

Shouldn’t the fix be to ensure that these are served out of completely independent silos and update all docs to make sure anyone using 1.1.1.1 also has 1.0.0.1 configured as a backup?

If I ran a service like this I would regularly do blackouts or brownouts on the primary to make sure that people’s resolvers are configured correctly. Nobody should be using a single IP as a point of failure for their internet access/browsing.

detaro11mo ago

You don't need to test if peoples resolvers handle this cleanly, because its already known that many don't. DNS fallback behavior across platforms is a mess.

notpushkin11mo ago

> Did 1.0.0.1 go down too?

Yes.

> Shouldn’t the fix be to ensure that these are served out of completely independent silos [...]?

Yes.

> If so, why were they on the same infrastructure?

Apparently, they weren’t independent enough: something in CF has announced both addresses and that got out.

The solution for the end user is, of course, to use 1.1.1.1 and 8.8.8.8 (or any other combination of two different resolvers).

i_niks_8611mo ago· 1 in thread

Many commenters assume fallback behavior exists between DNS providers, but in practice, DNS clients - especially at the OS or router level -rarely implement robust failover for DoH. If you're using cloudflare-dns(.)com and it goes down, unless the stub resolver or router explicitly supports multi-provider failover (and uses a trust-on-first-use or pinned cert model), you’re stuck. The illusion of redundancy with DoH needs serious UX rethinking.

tankenmate11mo ago

I use routedns[0] for this specific reason it handles almost all DNS protocols; UDP, TCP, DoT, DoH, DoQ (including 0-RTT). But more importantly is has a very configurable route steering even down to a record by record basis if you want to put up with all the configuration involved. It's very robust and is very handy, I use 1.1.1.1 on my desktops and servers and when the incident happened I didn't even notice as the failover "just worked". I had to actually go look at the logs because I didn't notice.

[0] https://github.com/folbricht/routedns

nodesocket11mo ago· 1 in thread

I used to configure 1.1.1.1 as primary and 8.8.8.8 as secondary but noticed that Cloudflare on aggregate was quicker to respond to queries and changed everything to use 1.1.1.1 and 1.0.0.1. Perhaps I'll switch back to using 8.8.8.8 as secondary, though my understanding is DNS will round-robin between primary and secondary, it's not primary and then use secondary ONLY if primary is down. Perhaps I am wrong though.

EDIT: Appears I was wrong, it is failover not round-robin between the primary and secondary DNS servers. Thus, using 1.1.1.1 and 8.8.8.8 makes sense.

ta124311mo ago

Depends on how you configure it. In resolv.conf systems for example you can set a timeout of say 1 second and do it as main/reserve, or set it up to round-robin. From memory it's something like "options:rotate"

If you have a more advanced local resolver of some sort (systemd for example) you can configure whatever behaviour you want.

nness11mo ago· 1 in thread

Interesting side-effect, the Gluetun docker image uses 1.1.1.1 for DNS resolution — as a result of the outage Gluetun's health checks failed and the images stopped.

If there were some way to view torrenting traffic, no doubt there'd be a 20 minute slump.

johnklos11mo ago

Personally, I'd consider any Docker image that does its own DNS resolution outside of the OS a Trojan.

alyandon11mo ago

  Cloudflare's 1.1.1.1 Resolver service became unavailable to the Internet starting at 21:52 UTC and ending at 22:54 UTC

Weird. According to my own telemetry from multiple networks they were unavailable for a lot longer than that.

chrisgeleven11mo ago

I been lazy and was using Cloudflare's resolver only recently. In hindsight I probably should just setup two instances of Unbound on my home network that don't rely on upstream resolvers and call it a day. It's unlikely both will go down at the same time and if I'm having an total Internet outage (unlikely as I have Comcast as primary + T-Mobile Home Internet as a backup), it doesn't matter if DNS is or isn't resolving.

wreckage64511mo ago

This is a good post mortem, but improvements only come with change on processes. It seems every team at CloudFlare is approaching this in isolation, without a central problem management. Every week we see a new CloudFlare global outage. It seems like the change management processes is broken and needs to be looked at..

neurostimulant11mo ago

I never noticed the outage because my isp hijack all outbound udp traffic to port 53 and redirect them to their own dns server so they can apply government-mandated cencorship :)

0xbadcafebee11mo ago

> A configuration change was made for the same DLS service. The change attached a test location to the non-production service; this location itself was not live, but the change triggered a refresh of network configuration globally.

Say what now? A test triggered a global production change?

> Due to the earlier configuration error linking the 1.1.1.1 Resolver's IP addresses to our non-production service, those 1.1.1.1 IPs were inadvertently included when we changed how the non-production service was set up.

You have a process that allows some other service to just hoover up address routes already in use in production by a different service?

dawnerd11mo ago

Oh this explains a lot. I kept having random connection issues and when I disabled AdGuard dns (self hosted) it started working so I just assumed it was something with my vm.

alexandrutocar11mo ago

I use their DNS over HTTPS and if I hadn't seen the issue being reported here, I wouldn't have caught it at all. However, this—along with a chain of past incidents (including a recent cascading service failure caused by a third-party outage)—led me to reduce my dependencies. I no longer use Cloudflare Tunnels or Cloudflare Access, replacing them with WireGuard and mTLS certificates. I still use their compute and storage, but for personal projects only.

udev409611mo ago

This is why running your own resolver is so important. Clownflare will always break something or backdoor something

b0rbb11mo ago

I don't know about you all but I love a well written RCA. Nicely done.

trollbridge11mo ago

I got bit by this, so dnsmasq now has 1.1.1.2, Quad9, and Google’s 8.8.8.8 with both primary and secondary.

Secondary DNS is supposed to be in an independent network to avoid precisely this.

xyst11mo ago

Am not a fan of CF in general due to their role in centralization of the internet around their services.

But I do appreciate these types of detailed public incident reports and RCAs.

rswail11mo ago

I now run unbound locally as a recursive DNS server, which really should be the default. There's no reason not to in modern routers.

Not sure what the "advantage" of stub resolvers is in 2025 for anything.

nixpulvis11mo ago

Fun fact, Verizon cellular blocks 1.1.1.1. I discovered this after trying to use my hotspot from my Linux laptop with it set for my default DNS.

Very frustrating.

hkon11mo ago

To say I was surprised when I finally checked the status page of cloudflare is an understatement.

greggsy11mo ago

I’d love to know legacy systems they’re referring to.

sylware11mo ago

cloudflare is providing a service designed to block noscript/basic (x)html browsers.

I know.

tacitusarc11mo ago

Perhaps I am over-saturated, but this write up felt like AI- at least largely edited by a model.

j / k navigate · click thread line to collapse

380 comments

171 comments · 37 top-level

v5v311mo ago· 23 in thread

> For many users, not being able to resolve names using the 1.1.1.1 Resolver meant that basically all Internet services were unavailable.

Don't you normally have 2 DnS servers listed on any device. So was the second also down, if not why didn't it go to that.

Polizeiposaune11mo ago

Cloudflare recommends you configure 1.1.1.1 and 1.0.0.1 as DNS servers.

Unfortunately, the configuration mistake that caused this outage disabled Cloudflare's BGP advertisements of both 1.1.1.0/24 and 1.0.0.0/24 prefixes to its peers.

kingnothing11mo ago

A better recommendation is to use Cloudflare for one of your DNS servers and a completely different company for the other.

2 more replies

rom1v11mo ago

On Android, in Settings, Network & internet, Private DNS, you can only provide one in "Private DNS provider hostname" (AFAIK).

quacksilver11mo ago

Private DNS on Android refers to 'DNS over HTTPS' and would normally only accept a hostname.

Normal DNS can normally be changed in your connection settings for a given connection on most flavours of Android.

4 more replies

Macha11mo ago

Cloudflare's own suggested config is to use their backup server 1.0.0.1 as the secondary DNS, which was also affected by this incident.

stingraycharles11mo ago

But I understand why Cloudflare can’t just say “use 8.8.8.8 as your backup”.

2 more replies

Gieron11mo ago

I think normally you pair 1.1.1.1 with 1.0.0.1 and, if I understand this correctly, both were down.

moontear11mo ago

Just pair 1.1.1.1 with 9.9.9.9 (Quad9) so you have fault tolerance in terms of provider as well.

3 more replies

Algent11mo ago

Yeah pretty much. In a perfect world you would pair it with another service I guess but usually you use the official backup IP because it's not supposed to break at same time.

1 more reply

rvnx11mo ago

8.8.8.8 + 1.1.1.1 is stable and mostly safe

2 more replies

sschueller11mo ago

If your device doesn't support proper failover use a local DNS forwarder on your router or an external one.

In Switzerland I would use Init7 (isp that doesn't filter) -> quad9 (unfiltered Version) -> eu dns0 (unfiltered Version)

dylan60411mo ago

I get that in theory blah blah, but we now have choices in who gets to see all of our requests and the ISP will always lose out to the other losers in the list

2 more replies

gerdesj11mo ago

If you think you can pontificate on DNS then I think you should be running your own service.

Note how root "." just works and has done for decades - that's proper engineering and actually way more complicated than running 1.1.1.1. What 1.1.1.1 suffers from is anycast and not DNS.

Cloudflare (and Google and co) insist on using one or more "vanity" IP addresses - that is very unfair of me but that it what it is, and to make it work, they have to use anycast.

The real issue is fixing anycast and not DNS.

Anyway, select two+ providers and set them.

wlonkly11mo ago

The root servers all use anycast addresses.

1 more reply

tyingq11mo ago

Listing two is better than nothing, but it's not great. If one goes down, there's nothing that tracks which one is working, so you usually see long hangs and intermittent issues.

Unless you do something fancy with a local caching dns proxy with more than one upstream.

ahoka11mo ago

Or run your own, if you are able to.

zamadatix11mo ago

1.1.1.1 is also what they call the resolver service as a whole, the impact section (seems to) be saying both 1.0.0.0/24 and 1.1.1.0/24 were affected (among other ranges).

bmicraft11mo ago

My Mikrotik router (and afaict all of them) don't support more than one DoH address.

rat998811mo ago

Not all users have configured two DNS servers?

quacksilver11mo ago

It is highly recommended to configure two or more DNS servers incase one is down.

I would count not configuring at least two as 'user error'. Many systems require you to enter a primary and alternate server in order to save a configuration.

2 more replies

daneel_w11mo ago

OK. But there's no reason or excuse not to, if they already manually configured a primary.

bongodongobob11mo ago

3 at every place I've ever worked.

Bluescreenbuddy11mo ago

Yup. I have Cloudfare and Quad9

homebrewer11mo ago· 22 in thread

This is a good time to mention that dnsmasq lets you setup several DNS servers, and can race them. The first responder wins. You won't ever notice one of the services being down:

  all-servers
  server=8.8.8.8
  server=9.9.9.9
  server=1.1.1.1

anthonyryan111mo ago

Additionally, as long as you don't set strict-order, dnsmasq will automatically use all-servers for retries.

If you were using systemd-resolved however, it retries all servers in the order they were specified, so it's important to interleave upstreams.

Using the servers in the above example, and assuming IPv4 + IPv6:

    1.1.1.1
    2001:4860:4860::8888
    9.9.9.9
    2606:4700:4700::1111
    8.8.8.8
    2620:fe::fe
    1.0.0.1
    2001:4860:4860::8844
    149.112.112.112
    2606:4700:4700::1001
    8.8.4.4
    2620:fe::9

will failover faster and more successfully on systemd-resolved, than if you specify all Cloudflare IPs together, then all Google IPs, etc.

matthewtse11mo ago

wow good tip

I was handling an incident due to this outage. I ended up adding Google DNS resolvers using systemd-resolved, but I didn't think to interleave them!

whitehexagon11mo ago

That sounds good in principle, but is there a more private configuration that doesnt send DNS resolutions to cloudfare, google et al. ie. avoid BigTech tracking, and not wanting DOH.

dnsmasq with a list of smaller trusted DNS providers sounds perfect, as long as it is not considered bad etiquette to spam multiple DNS providers for every resolution?

But where to find a trusted list of privacy focused DNS resolvers. The couple I tried from random internet advice seemed unstable.

agolliver11mo ago

1 more reply

mcpherrinm11mo ago

I've reviewed the privacy policy and performance of various DoH servers, and determined in my opinion that Cloudflare and Google both provide privacy-respecting policies.

I believe that they follow their published policies and have reasonable security teams. They're also both popular services, which mitigates many of the other types of DNS tracking possible.

https://developers.google.com/speed/public-dns/privacy https://developers.cloudflare.com/1.1.1.1/privacy/public-dns...

hamandcheese11mo ago

NextDNS. Generous free tier, very affordable paid tier. Happy customer for several years and I've never noticed an outage.

2 more replies

Yeri11mo ago

https://www.dns0.eu/ is an option

bsilvereagle11mo ago

I haven’t had any problems with OpenNIC: https://opennic.org/

paradao11mo ago

Using DNSCrypt with anonymized DNS could be an option: https://github.com/DNSCrypt/dnscrypt-proxy/wiki/Anonymized-D...

Tmpod11mo ago

Quad9 and NextDNS are usually thrown around.

sophacles11mo ago

You can just run unbound or similar and do your own recursive resolving.

localtoast11mo ago

dnsforge.de comes to mind.

karmakaze11mo ago

I don't consider these interchangeable. They have different priorities and policies. If anything I'd choose one and use my ISP default as fallback.

outworlder11mo ago

eli11mo ago

My ISP has already been caught selling personally identifiable customer data. I trust them less than any of those companies.

1 more reply

nemonemo11mo ago

1 more reply

mnordhoff11mo ago

Even without "all-servers", DNSMasq will race servers frequently (after 20 seconds, unless it's changed), and when retrying. A sudden outage should only affect you for a few seconds, if at all.

karel-3d11mo ago

dnsdist is AMAZINGLY easy to set up as a secure local resolver that forwards all queries to DoH (and checks SSL) and checks liveliness every second

I need to do a write-up one day

jzebedee11mo ago

Please do. I'd be curious what a secure-by-default self hosted resolver would look like.

1 more reply

heavyset_go11mo ago

I think systemd-resolved does something similar if you use that. Does DoT and DNSSEC by default.

If you want to eschew centralized DNS altogether, if you run a Tor daemon, it has an option to expose a DNS resolver to your network. Multiple resolvers if you want them.

xyst11mo ago

Probably great for users. Awful for trying to reproduce an issue. I prefer a more deterministic approach myself.

itscrush11mo ago

Looks like AdGuard allows for same, thanks for mentioning dnsmasq support! I overlooked it on setup.

chrismorgan11mo ago· 13 in thread

perlgeek11mo ago

There's a constant tension between speed of detection and false positive rates.

Traditional monitoring systems like Nagios and Icinga have settings where they only open events/alerts if a check failed three times in a row, because spurious failed checks are quite common.

I've never operated a service with as much exposure as CF's DNS service, but I'm not really surprised that it took 8 minutes to get a reliable detection.

sbergot11mo ago

1 more reply

chrismorgan11mo ago

3 more replies

bombcar11mo ago

This is one of those graphs that would have been on the giant wall in the NOC in the old days - someone would glance up and see it had dropped and say “that’s not right” and start scrambling.

seb120411mo ago

That's how I picture it. Is that not how it is? Everyone working from home and the big chart is on the TV but someone in the family changed channels?

TheDong11mo ago

I'm not surprised.

Let's say you've got a metric aggregation service, and that service crashes.

What does that result in? Metrics get delayed until your orchestration system redeploys that service elsewhere, which looks like a 100% drop in metrics.

Most orchestration take a sec to redeploy in this case, assuming that it could be a temporary outage of the node (like a network blip of some sort).

Sooo, if you alert after just a minute, you end up with people getting woken up at 2am for nothing.

What happens if you keep waking up people at 2am for something that auto-resolves in 5 minutes? People quit, or eventually adjust the alert to 5 minutes.

__turbobrew__11mo ago

When you are building systems like 1.1.1.1 having an alert rollup of five minutes is not acceptable as it will hide legitimate downtime that lasts between 0 and 5 minutes.

You need to design systems which do not rely on orchestration to remediate short transient errors.

Disclosure: I work on a core SRE team for a company with over 500 million users.

mentalgear11mo ago

Its not wrong for smaller companies. But there's an argument that a big system critical company/provider like Cloudflare should be able to afford its own always on team with a night shift.

3 more replies

croemer11mo ago

Before you fire a quick alarm, check that the node is up, check that the service is up etc.

1 more reply

kccqzy11mo ago

bastawhiz11mo ago

philipwhiuk11mo ago

Remember they have no SLA for this service.

chrismorgan11mo ago

So?

They have a rather significant vested interest in it being reliable.

jallmann11mo ago· 13 in thread

Good writeup.

Interesting, I was affected by this yesterday. My router (supposedly) had Cloudflare DoH enabled but nothing would resolve. Changing the DNS server to 8.8.8.8 fixed the issues.

sneak11mo ago

I disagree. The actual root cause here is shrouded in jargon that even experienced admins such as myself have to struggle to parse.

It’s corporate newspeak. “legacy” isn’t a clear term, it’s used to abstract and obfuscate.

I know what this means, but there’s absolutely no reason for it to be written in this inscrutable corporatese.

stingraycharles11mo ago

I will not say whether or not it’s acceptable for a company of their size and maturity, but it’s definitely not hidden in corporate lingo.

willejs11mo ago

1 more reply

bauruine11mo ago

How does DoH work? Somehow you need to know the IP of cloudflare-dns.com first. Maybe your router uses 1.1.1.1 for this.

maxloh11mo ago

Note that this introduces one query overhead per DNS request if the previous cache has expired. For this reason, I've been using https://1.1.1.1/dns-query instead.

ta124311mo ago

And even if you have already resolved it the TTL is only 5 minutes

stavros11mo ago

Are we meant to use a domain? I've always just used the IP.

1 more reply

stingraycharles11mo ago

Yeah I don’t understand this part either, maybe it’s supposed to be bootstrapped using your ISP’s DNS server?

1 more reply

noduerme11mo ago

Hamuko11mo ago

zahrc11mo ago

Check Jallmann’s response https://news.ycombinator.com/item?id=44578490#44578917

TLDR; DoH was working

1 more reply

sathackr11mo ago

Good writeup except the entirely false timeline shared at the beginning of the post

bartvk11mo ago

You need to clarify such a statement, in my opinion.

CuteDepravity11mo ago· 10 in thread

It's crazy that both 1.1.1.1 and 1.0.0.1 where affected by the same change

I guess now we should start using a completely different provider as dns backup Maybe 8.8.8.8 or 9.9.9.9

sammy225511mo ago

1.1.1.1 and 1.0.0.1 are served by the same service. It's not advertised as a redundant fully separate backup or anything like that...

yjftsjthsd-h11mo ago

4 more replies

0xbadcafebee11mo ago

In general, the idea of DNS's design is to use the DNS resolver closest to you, rather than the one run by the largest company.

ben0x53911mo ago

Isn't the largest company most likely to have the DNS resolver closest to me?

3 more replies

dontTREATonme11mo ago

What’s your recommendation for finding the dns resolver closest to me? I currently use 1.1 and 8.8, but I’m absolutely open to alternatives.

2 more replies

codingminds11mo ago

Wasn't that the case since ever?

globular-toast11mo ago

JdeBP11mo ago

bigiain11mo ago

I mean, aren't we already?

My Pi-holes both use OpenDNS, Quad9, and CloudFlare for upstream.

Most of my devices use both of my Pi-holes.

johnklos11mo ago

If you're already running Pi-hole, wny not just run your own recursive, caching resolver?

geoffpado11mo ago· 10 in thread

codingminds11mo ago

If you consume a service that's free of charge, it's at least not reasonable to complain if there's an outage.

Like mentioned by other comments, do it on your own if you are not happy with the stability. Or just pay someone to provide it - like your ISP..

And TBH I trust my local ISP more than Google or CF. Not in availability, but it's covered by my local legislature. That's a huge difference - in a positive way.

chii11mo ago

> it's covered by my local legislature

which might not be a good thing in some jurisdictions - see the porn block in the UK (it's done via dns iirc, and trivially bypassed with a third party dns like cloudflare's).

1 more reply

komali211mo ago

> it's at least not reasonable to complain if there's an outage.

3 more replies

bauruine11mo ago

Why not use multiple? You can use 1.1.1.1, your ISPs and google at the same time. Or just run a resolver yourself.

ripdog11mo ago

>Or just run a resolver yourself.

I did this for a while, but ~300ms hangs on every DNS resolution sure do get old fast.

1 more reply

pparanoidd11mo ago

A single incident means 1.1.1.1 is no longer reasonably stable? You are the unreasonable one

yjftsjthsd-h11mo ago

geoffpado11mo ago

1 more reply

cryptonym11mo ago

I have been online for 30y and can't remember being affected by downtime from my ISP DNS.

RCA is not detailed and feels like a marketing stunt we are now getting every other week.

1 more reply

bjoli11mo ago

Run your own forwarder locally. Technitium dns makes it easy.

kachapopopow11mo ago· 9 in thread

Interesting to see that they probably lost 20% of 1.1.1.1 usage from a roughly 20 minute incident.

Not sure how cloudflare keeps struggling with issues like these, this isn't the first (and probably won't be the last) time they have these 'simple', 'deprecated', 'legacy' issues occuring.

8.8.8.8+8.8.4.4 hasn't had a global(1) second of downtime for almost a decade.

1: localized issues did exist, but that's really the fault of the internet and they did remain running when google itself suffered severe downtime in various different services.

Tepix11mo ago

There's more to DNS than just availability (granted, it's very important). There's also speed and privacy.

European users might prefer one of the alternatives listed at https://european-alternatives.eu/category/public-dns over US corporations subject to the CLOUD act.

adornKey11mo ago

1 more reply

daneel_w11mo ago

Everyone, European or not, should prefer anything but Cloudflare and Google if they feel that privacy has any value.

immibis11mo ago

HN users might prefer to run their own. It's a low maintenance service. It's not like running a mail server.

3 more replies

kod11mo ago

> Not sure how cloudflare keeps struggling with issues like these

Cloudflare has a reasonable culture around incident response, but it doesn't incentivize proactive prevention.

user393938211mo ago

You’re not sure how they’re struggling to fix an engineering problem characterized by complexity and scale encountered by 0.001% of network engineers?

zamadatix11mo ago

From the longer term graphs it looks like volume returned to normal https://imgur.com/a/8a1H8eL

barbazoo11mo ago

Then you’d be using a google DNS though which is undesirable for many.

heraldgeezer11mo ago

Yes, I honestly switched back to 8.8.8.8 and 8.8.4.4 google DNS. 100% stable, no filtering, fast in the EU.

Mindless211211mo ago· 5 in thread

Interesting that traffic didn't return to completely normal levels after the incident.

caconym_11mo ago

> Interesting that traffic didn't return to completely normal levels after the incident.

Anecdotally, I figured out their DNS was broken before it hit their status page and switched my upstream DNS over to Google. Haven't gotten around to switching back yet.

radicaldreamer11mo ago

What would be a good reason to switch back from Google DNS?

2 more replies

anon700011mo ago

They go into that more towards the end, sounds like some smaller % of servers needed more direct intervention

bastawhiz11mo ago

If your Internet doesn't work, you'll get up and do other things for a while. I strongly suspect most folks didn't switch DNS providers in that time.

motorest11mo ago

> Interesting that traffic didn't return to completely normal levels after the incident.

Clients cache DNS resolutions to avoid having to do that request each time they send a request. It's plausible that some clients held on to their cache for a significant period.

zac23or11mo ago· 5 in thread

It's no surprise that Cloudflare is having a service issue again.

In my experience, Cloudflare support is not helpful at all, trying to pass the problem onto the user, like "Just avoid holding it in that way. ".

At work, I needed to use Cloudflare. The next job I get, I'll put a limit on my responsibilities: I don't work with Cloudflare.

I will never use Cloudflare at home and I don't recommend it to anyone.

Next week: A new post about how Cloudflare saved the web from a massive DDOS attack.

kentonv11mo ago

> some technical decisions are absurd, such as the worker's cache.delete method, which only clears the cache contents in the data center where the Worker was invoked!!!

zac23or11mo ago

>cache.delete obviously only deletes that browser's cache, not all other browsers in the world.

To me, it only makes sense if the put method creates a cache only in the datacenter where the Worker was invoked. Put and delete need to be related, in my opinion.

Now I'm curious: what's the point of clearing the cache contents in the datacenter where the Worker was invoked? I can't think of any use for this method.

My criticisms aren't about functionality per see or developers. I don't doubt the developers' competence, but I feel like there's something wrong with the company culture.

1 more reply

freedomben11mo ago

Just wanted to say, I always appreciate your comments and frankness!

freedomben11mo ago

That said, I don't use workers and don't plan to. I personally try to stay away from non cross-platform stuff because I've been burned too heavily with vendor/platform lock-in in the past.

kentonv11mo ago

> and when they make a change that breaks the existing API contract it always makes for several miserable days for me

1 more reply

egamirorrim11mo ago· 4 in thread

What's that about a hijack?

homero11mo ago

ollien11mo ago

I'm a bit uneducated here - why was the other 1.1.1.0/24 announcement previously suppressed? Did it just express a high enough cost that no one took it on compared to the CF announcement?

1 more reply

JdeBP11mo ago

And because people highlighted it on social media at the time of the outage, many thought that the bogus route was the cause of the problem.

kylestanfield11mo ago

So someone just started advertising the prefix when it was up for grabs? That’s pretty funny

1 more reply

thunderbong11mo ago· 4 in thread

How does Cloudflare compare with OpenDNS?

blurrybird11mo ago

You’d be better off comparing it to Quad9 based on performance, privacy claims, and response accuracy.

johnklos11mo ago

Cloudflare is a for-profit company in the US. Their privacy claims can't be believed. Even if we did believe them, we have no idea if rsolution data isn't taken by US TLA agencies.

forbiddenlake11mo ago

Hm, what distinction are you trying to make here? OpenDNS is also an American company, acquired by Cisco (an American company) in 2015.

1 more reply

johnklos11mo ago

1 more reply

cadamsdotcom11mo ago· 3 in thread

> The way that Cloudflare manages service topologies has been refined over time and currently consist of a combination of a legacy and a strategic system that are synced.

This writing is just brilliant. Clear to technical and non-technical readers. Makes the in-progress migration sound way more exciting than it probably is!

This is about as good as you can get it from a company as serious and important as Cloudflare. Bravo to the writers and vetters for not watering this down.

kccqzy11mo ago

marcusb11mo ago

You cannot throw a rock without hitting a product marketer describing everything not-their-product as "legacy."

hbay11mo ago

you were annoyed by that sentence?

perlgeek11mo ago· 2 in thread

An outage of roughly 1 hour is 0.13% of a month or 0.0114% of a year.

It would be interesting to see the service level objective (SLO) that cloudflare internally has for this service.

philipwhiuk11mo ago

Probably 99.9% or better annually just from a 'maintaining reputation for reliability' standpoint.

stingraycharles11mo ago

What really matters with these percentages is whether it’s per month or per year. 99.9% per year allows for much longer outages than 99.9% per month.

aftbit11mo ago· 2 in thread

>Even though this release was peer-reviewed by multiple engineers

Perhaps it would be wise to add some hard-coded special-case mitigations to DLS such that it would not allow 1.1.1.1/32 or 1.0.0.1/32 to be reassigned to a single location.

burnte11mo ago

It's probably much simpler, "I trust Jerry, I'm sure this is fine, approved."

roughly11mo ago

angst11mo ago· 2 in thread

I wonder how uptime ratio of 1.1.1.1 is against 8.8.8.8

Maybe there is noticeable difference?

I have seen more outage incident reports of cloudflare than of google, but this is just personal anecdote.

Pharaoh211mo ago

https://www.dnsperf.com/#!dns-resolvers

Last 30 days, 8.8.8.8 has 99.99% uptime vs 1.1.1.1 has 99.09%

ta124311mo ago

I guess it depends on where you are and what you count as an outage. Is a single failed query an outage?

For me cloudflare 1.1.1.1 and 1.0.0.1 have a mean response time of 15.5ms over the last 3 months, 8.8.8.8 and 8.8.4.4 are 15.0ms, and 9.9.9.9 is 13.8ms.

nu11ptr11mo ago· 2 in thread

blurrybird11mo ago

Yeah part of the reason why APNIC granted Cloudflare access to those very lucrative IPs is to observe the misconfiguration volume.

The theory is CF had the capacity to soak up the junk traffic without negatively impacting their network.

yabones11mo ago

sneak11mo ago· 2 in thread

1.1.1.1 does not operate in isolation.

It is designed to be used in conjunction with 1.0.0.1. DNS has fault tolerance built in.

Did 1.0.0.1 go down too? If so, why were they on the same infrastructure?

This makes no sense to me. 8.8.8.8 also has 8.8.4.4. The whole point is that it can go down at any time and everything keeps working.

Shouldn’t the fix be to ensure that these are served out of completely independent silos and update all docs to make sure anyone using 1.1.1.1 also has 1.0.0.1 configured as a backup?

detaro11mo ago

You don't need to test if peoples resolvers handle this cleanly, because its already known that many don't. DNS fallback behavior across platforms is a mess.

notpushkin11mo ago

> Did 1.0.0.1 go down too?

Yes.

> Shouldn’t the fix be to ensure that these are served out of completely independent silos [...]?

Yes.

> If so, why were they on the same infrastructure?

Apparently, they weren’t independent enough: something in CF has announced both addresses and that got out.

The solution for the end user is, of course, to use 1.1.1.1 and 8.8.8.8 (or any other combination of two different resolvers).

i_niks_8611mo ago· 1 in thread

tankenmate11mo ago

[0] https://github.com/folbricht/routedns

nodesocket11mo ago· 1 in thread

EDIT: Appears I was wrong, it is failover not round-robin between the primary and secondary DNS servers. Thus, using 1.1.1.1 and 8.8.8.8 makes sense.

ta124311mo ago

If you have a more advanced local resolver of some sort (systemd for example) you can configure whatever behaviour you want.

nness11mo ago· 1 in thread

Interesting side-effect, the Gluetun docker image uses 1.1.1.1 for DNS resolution — as a result of the outage Gluetun's health checks failed and the images stopped.

If there were some way to view torrenting traffic, no doubt there'd be a 20 minute slump.

johnklos11mo ago

Personally, I'd consider any Docker image that does its own DNS resolution outside of the OS a Trojan.

alyandon11mo ago

  Cloudflare's 1.1.1.1 Resolver service became unavailable to the Internet starting at 21:52 UTC and ending at 22:54 UTC

Weird. According to my own telemetry from multiple networks they were unavailable for a lot longer than that.

chrisgeleven11mo ago

wreckage64511mo ago

neurostimulant11mo ago

I never noticed the outage because my isp hijack all outbound udp traffic to port 53 and redirect them to their own dns server so they can apply government-mandated cencorship :)

0xbadcafebee11mo ago

Say what now? A test triggered a global production change?

You have a process that allows some other service to just hoover up address routes already in use in production by a different service?

dawnerd11mo ago

Oh this explains a lot. I kept having random connection issues and when I disabled AdGuard dns (self hosted) it started working so I just assumed it was something with my vm.

alexandrutocar11mo ago

udev409611mo ago

This is why running your own resolver is so important. Clownflare will always break something or backdoor something

b0rbb11mo ago

I don't know about you all but I love a well written RCA. Nicely done.

trollbridge11mo ago

I got bit by this, so dnsmasq now has 1.1.1.2, Quad9, and Google’s 8.8.8.8 with both primary and secondary.

Secondary DNS is supposed to be in an independent network to avoid precisely this.

xyst11mo ago

Am not a fan of CF in general due to their role in centralization of the internet around their services.

But I do appreciate these types of detailed public incident reports and RCAs.

rswail11mo ago

I now run unbound locally as a recursive DNS server, which really should be the default. There's no reason not to in modern routers.

Not sure what the "advantage" of stub resolvers is in 2025 for anything.

nixpulvis11mo ago

Fun fact, Verizon cellular blocks 1.1.1.1. I discovered this after trying to use my hotspot from my Linux laptop with it set for my default DNS.

Very frustrating.

hkon11mo ago

To say I was surprised when I finally checked the status page of cloudflare is an understatement.

greggsy11mo ago

I’d love to know legacy systems they’re referring to.

sylware11mo ago

cloudflare is providing a service designed to block noscript/basic (x)html browsers.

I know.

tacitusarc11mo ago

Perhaps I am over-saturated, but this write up felt like AI- at least largely edited by a model.

j / k navigate · click thread line to collapse