Cloudflare outage on November 18, 2025 post mortem (opens in new tab)

(blog.cloudflare.com)

1465 pointseastdakota7mo ago916 comments

Related: Cloudflare Global Network experiencing issues - https://news.ycombinator.com/item?id=45963780 - Nov 2025 (1580 comments)

916 comments

262 comments · 147 top-level

ojosilva7mo ago· 39 in thread

This is the multi-million dollar .unwrap() story. In a critical path of infrastructure serving a significant chunk of the internet, calling .unwrap() on a Result means you're saying "this can never fail, and if it does, crash the thread immediately."The Rust compiler forced them to acknowledge this could fail (that's what Result is for), but they explicitly chose to panic instead of handle it gracefully. This is textbook "parse, don't validate" anti-pattern.

I know, this is "Monday morning quarterbacking", but that's what you get for an outage this big that had me tied up for half a day.

abalone7mo ago

I’ve led multiple incident responses at a FAANG, here’s my take. The fundamental problem here is not Rust or the coding error. The problem is:

1. Their bot management system is designed to push a configuration out to their entire network rapidly. This is necessary so they can rapidly respond to attacks, but it creates risk as compared to systems that roll out changes gradually.

2. Despite the elevated risk of system wide rapid config propagation, it took them 2 hours to identify the config as the proximate cause, and another hour to roll it back.

SOP for stuff breaking is you roll back to a known good state. If you roll out gradually and your canaries break, you have a clear signal to roll back. Here was a special case where they needed their system to rapidly propagate changes everywhere, which is a huge risk, but didn’t quite have the visibility and rapid rollback capability in place to match that risk.

While it’s certainly useful to examine the root cause in the code, you’re never going to have defect free code. Reliability isn’t just about avoiding bugs. It’s about understanding how to give yourself clear visibility into the relationship between changes and behavior and the rollback capability to quickly revert to a known good state.

Cloudflare has done an amazing job with availability for many years and their Rust code now powers 20% of internet traffic. Truly a great team.

17 more replies

wrs7mo ago

It seems people have a blind spot for unwrap, perhaps because it's so often used in example code. In production code an unwrap or expect should be reviewed exactly like a panic.

It's not necessarily invalid to use unwrap in production code if you would just call panic anyway. But just like every unsafe block needs a SAFETY comment, every unwrap in production code needs an INFALLIBILITY comment. clippy::unwrap_used can enforce this.

9 more replies

smj-edison7mo ago

Isn't the point of this article that pieces of infrastructure don't go down to root causes, but due to bad combinations of components that are correct individually? After reading "engineering a safer world", I find root cause analysis rather reductionistic, because it wasn't just an unwrap, it was that the payload was larger than normal, because of a query that didn't select by database, because a clickhouse made more databases visible. Hard to say "it was just due to an unwrap" imo. Especially in terms of how to fix an issue going forwards. I think the article lists a lot of good ideas, that aren't just "don't unwrap", like enabling more global kill switches for features, or eliminating the ability for core dumps or other error reports to overwhelm system resources.

1 more reply

jcalvinowens7mo ago

> This is the multi-million dollar .unwrap() story.

That's too semantic IMHO. The failure mode was "enforced invariant stopped being true". If they'd written explicit code to fail the request when that happened, the end result would have been exactly the same.

5 more replies

ajross7mo ago

I'm not completely sure I agree. I mean, I do agree about the .unwrap() culture being a bug trap. But I don't think this example qualifies.

The root cause here was that a file was mildly corrupt (with duplicate entries, I guess). And there was a validation check elsewhere that said "THIS FILE IS TOO BIG".

But if that's a validation failure, well, failing is correct? What wasn't correct was that the failure reached production. What should have happened is that the validation should have been a unified thing and whatever generated the file should have flagged it before it entered production.

And that's not an issue with function return value API management. The software that should have bailed was somewhere else entirely, and even there an unwrap explosion (in a smoke test or pre-release pass or whatever) would have been fine.

1 more reply

AgentME7mo ago

This is assuming that the process could have done anything sensible while it had the malformed feature file. It might be in this case that this was one configuration file of several and maybe the program could have been built to run with some defaults when it finds this specific configuration invalid, but in the general case, if a program expects a configuration file and can't do anything without it, panicking is a normal thing to do. There's no graceful handling (beyond a nice error message) a program like Nginx could do on a syntax error in its config.

The real issue is further up the chain where the malformed feature file got created and deployed without better checks.

6 more replies

vlovich1237mo ago

To be fair, this failed in the non-rust path too because the bot management returned that all traffic was a bot. But yes, FL2 needs to catch panics from individual components but I’m not sure if failing open is necessarily that much better (it was in this case but the next incident could easily be the result of failing open).

But more generally you could catch the panic at the FL2 layer to make that decision intentional - missing logic at that layer IMHO.

1 more reply

ironman14787mo ago

I'm not a fan of rust, but I don't think that is the only takeaway. All systems have assumptions about their input and if the assumption is violated, it has to be caught somewhere. It seems like it was caught too deep in the system.

Maybe the validation code should've handled the larger size, but also the db query produced something invalid. That shouldn't have ever happened in the first place.

2 more replies

slanterns7mo ago

> Today, many friends pinged me saying Cloudflare was down. As a core developer of the first generation of Cloudflare FL, I'd like to share some thoughts.

> This wasn't an attack, but a classic chain reaction triggered by “hidden assumptions + configuration chains” — permission changes exposed underlying tables, doubling the number of lines in the generated feature file. This exceeded FL2's memory preset, ultimately pushing the core proxy into panic.

> Rust mitigates certain errors, but the complexity in boundary layers, data flows, and configuration pipelines remains beyond the language's scope. The real challenge lies in designing robust system contracts, isolation layers, and fail-safe mechanisms.

> Hats off to Cloudflare's engineers—those on the front lines putting out fires bear the brunt of such incidents.

> Technical details: Even handling the unwrap correctly, an OOM would still occur. The primary issue was the lack of contract validation in feature ingest. The configuration system requires “bad → reject, keep last-known-good” logic.

> Why did it persist so long? The global kill switch was inadequate, preventing rapid circuit-breaking. Early suspicion of an attack also caused delays.

> Why not roll back software versions or restart?

> Rollback isn't feasible because this isn't a code issue—it's a continuously propagating bad configuration. Without version control or a kill switch, restarting would only cause all nodes to load the bad config faster and accelerate crashes.

> Why not roll back the configuration?

> Configuration lacks versioning and functions more like a continuously updated feed. As long as the ClickHouse pipeline remains active, manually rolling back would result in new corrupted files being regenerated within minutes, overwriting any fixes.

https://x.com/guanlandai/status/1990967570011468071

1 more reply

ChrisMarshallNY7mo ago

Swift has implicit unwrap (!), and explicit unwrap (?).

I don't like to use implicit unwrap. Even things that are guaranteed to be there, I treat as explicit (For example, (self.view?.isEnabled ?? false), in a view controller, instead of self.view.isEnabled).

I always redefine @IBOutlets from:

    @IBOutlet weak var someView!

to:

    @IBOutlet weak var someView?

I'm kind of a "belt & suspenders" type of guy.

2 more replies

antonvs7mo ago

> This is textbook "parse, don't validate" anti-pattern.

How so? “Parse, don’t validate” implies converting input into typed values that prevent representation of invalid state. But the parsing still needs to be done correctly. An unchecked unwrap really has nothing to do with this.

1 more reply

butvacuum7mo ago

It rang more as "A/B deployments are pointless if you can't tell if a downstream failure is related." To me.

shadowgovt7mo ago

In addition, it looks like this system wasn't on any kind of 1%/10%/50%/100% rollout gating. Such a rollout would trivially have shown the poison input killing tasks.

2 more replies

selfmodruntime7mo ago

While this is true, I wish that Rust had more of a first-class support for `no_panic`. Every solution we do have is hacky. I wish that I could guarantee that there were no panic calls anywhere in a code path.

gwd7mo ago

> This is the multi-million dollar .unwrap() story.

While there are certainly many things to admire about Rust, this is why I prefer Golang's "noisy" error handling. In golang that would be either:

    feature_values, err := features.append_with_names(...)

And the compiler would have complained that this value of `err` was unused; or you'd write:

    feature_values, _ := features.append_with_names(...)

And it would be far more obvious that an error message is being ignored.

(Renaming `unwrap` to `unwrapOrPanic` would probably help too.)

4 more replies

quotemstr7mo ago

If the error had been an exception instead of a result, could have bubbled up

I have been saying for years that Rust botched error handling in unfixable ways. I will go to the grave believing Rust fumbled.

The design of the Rust language encourages people to use unwrap() to turn foreseeable runtime problems into fatal errors. It's the path of least resistance, so people will take it.

Rust encourages developers to consider only the happy path. No wonder it's popular among people who've never had to deal with failure.

All of the concomitant complexity--- Result, ?, the test thing, anyhow, the inability for stdlib to report allocation failure --- is downstream of a fashion statement against exceptions Rust cargo-culted from Go.

The funniest part is that Rust does have exceptions. It just calls them panics. So Rust code has to deal with the ergonomic footgun of Result but pays anyway for the possibility of exceptions. (Sure, you can compile with panic=abort. You can't count on it.)

I could not be more certain that Rust should have been a language with exceptions, not Result, and that error objects are a gross antipattern we'll regret for decades.

1 more reply

ozgrakkurt7mo ago

Not panicking code is tedious to write. It is not realistic to expect everything to be non panic. There is a reason that panicking exists in the first place.

Them calling unwrap on a limit check is the real issue imo. Everything that takes in external input should assume it is bad input and should be fuzz tested imo.

In the end, what is the point of having a limit check if you are just unwrapping on it

1 more reply

torginus7mo ago

By the way - does this discussion matter and were they wrong to use unwrap()?

The way they wrote the code means that having more than 200 features is a hard non-transient error - even if they recovered from it, it meant they'd have had the same error when the code got to the same place.

I'm sure when the process crashed, k8s restarted the pod or something - then it reran the same piece of code and crashed in the same place.

While I don't necessarily agree with crashing as business strategy, I don't think that doing anything other than either dropping the extra rules or allocating more memory - neither of which the original code was built to do (probably by design).

The code made the local hard assumption that there won't ever be more than 200 rules and its okay to crash if that count is exceeded.

If you design your code around an invariant never being violated (which is fine), you have to make it clear on a higher level that they did.

This isn't a Rust problem (though Rust does make it easy to do the wrong thing here imo)

1 more reply

twhitmore7mo ago

Interesting to see Rust error handling flunk out in practice.

It may be that forcing handling at every call tends to makes code verbose, and devs insensitized to bad practice. And the diagnostic Rust provided seems pretty garbage.

There is bad practice here too -- config failure manifesting as request failure, lack of failing to safe, unsafe rollout, lack of observability.

Back to language design & error handling. My informed view is that robustness is best when only major reliability boundaries need to be coded.

This the "throw, don't catch" principle with the addition of catches on key reliability boundaries -- typically high-level interactions where you can meaningfully answer a failure.

For example, this system could have a total of three catch clauses "Error Loading Config" which fails to safe, "Error Handling Request" which answers 5xx, and "Socket Error" which closes the HTTP connection.

2 more replies

cvhc7mo ago

Some languages and style guides simply forbid throwing exceptions without catching / proper recovery. Google C++ bans exceptions and the main mechanism for propogating errors is `absl::Status` which the caller has to check. Not familiar with Rust but it seems unwrap is such a thing that would be banned.

3 more replies

andy_ppp7mo ago

This is why the Erlang/Elixir methodology of having supervision and letting things crash gracefully is so useful. You can either handle every single error gracefully or handle crashing gracefully - it's much easier and more realistic in large codebases to do the later.

1 more reply

torginus7mo ago

Say what you want exception haters, but at least in exceptions-as-default languages, the decision of a particular issues is fatal to the whole program can be decided centrally at a high level, and not every choice is forced to be up to individual discretion.

1 more reply

branko_d7mo ago

Safe things should be easy, dangerous things should be hard.

This .unwrap() sounds too easy for what it does, certainly much easier than having an entire try..catch block with an explicit panic. Full disclosure: I don't actually know Rust.

1 more reply

peanut-walrus7mo ago

I wonder if similar to infrastructure resilience, code resilience is also required for critical services that can never go down? Instead of relying on a single implementation for a critical service, have multiple independent implementations in different languages. Back when I was running my own DNS servers, I did always ensure that primary and secondary were running on different platforms and different software.

sphericalkat7mo ago

Handling the error still would've returned a 5xx in this case, since the config file was still over the limit of features the service could handle.

rafaelmn7mo ago

That's such a bad take after reading the article. If you're going to write a system that preallocates and is based on hard assumptions about max size - the panic/unwrap approach is reasonable.

The config bug reaching prod without this being caught and pinpointed immediately is the strange part.

3 more replies

throwaway382947mo ago

This is a bummer. The unwrap()'ing function already returned a result and should have just propagated the error. Presumably the caller could have handled more sensibly than just panic'ing.

nrhrjrjrjtntbt7mo ago

I wonder what happens if they handle it gracefully? sounds like performance degradation (better than reliability degradation!).

Also wonder with a sharded system why are they not slow rolling out changes and monitoring?

karel-3d7mo ago

As a gopher I never understand why is there so many unwraps in an average rust code.

Average Go code has much less panics than Rust has unwraps, which are functionally equivalent.

3 more replies

meltyness7mo ago

tokio default behavior within a task is to ignore panics, such as an Err/None unwrap, and only crash that task, so it's impact limited so that's nice, maybe that's where the snowblindness came from.

it'd be kinda hard to amend the clippy lints to ignore coroutine unwraps but still pipe up on system ones. i guess.

edit: i think they'd have to be "solely-task-color-flavored" so definitely probably not trivial to infer

pjmlp7mo ago

Which is something I will bookmark for the usual Rust doesn't do exceptions discussions, except it kind of does even if called differently.

BrtByte7mo ago

Feels like a case where safety guarantees of Rust lulled them into thinking the edge cases were covered

echelon7mo ago

> This is the multi-million dollar .unwrap() story.

First multi-million dollar .unwrap() story.

NoboruWataya7mo ago

They should link this article in the docs for `unwrap()`.

guluarte7mo ago

it's usually because of fail fast and fail hard, in theory critical bugs will be caught in dev/test

arccy7mo ago

if you make it easy to be lazy and panic vs properly handling the error, you've designed a poor language

8 more replies

__bax7mo ago

git blame on .unwrap() line

otabdeveloper47mo ago

Oh come on, stop spreading FUD. Rust programs are 100% immune to crashes and bugs, they have memory safety (c).

Also, exception handling is hard and lame. We don't need exceptions, just add a "match" block after every line in your program.

1 more reply

hoppp7mo ago

You write so much rust you causally apply unwrap now to everything?

Rust compiler is a god of sorts, or at least a law of nature haha

Way to comment and go instantly off topic

dzonga7mo ago· 7 in thread

> thread fl2_worker_thread panicked: called Result::unwrap() on an Err value

I don't use Rust, but a lot of Rust people say if it compiles it runs.

Well Rust won't save you from the usual programming mistake. Not blaming anyone at cloudflare here. I love Cloudflare and the awesome tools they put out.

end of day - let's pick languages | tech because of what we love to do. if you love Rust - pick it all day. I actually wanna try it for industrial robot stuff or small controllers etc.

there's no bad language - just occassional hiccups from us users who use those tools.

jryio7mo ago

You misunderstand what Rust’s guarantees are. Rust has never promised to solve or protect programmers from logical or poor programming. In fact, no such language can do that, not even Haskell.

Unwrapping is a very powerful and important assertion to make in Rust whereby the programmer explicitly states that the value within will not be an error, otherwise panic. This is a contract between the author and the runtime. As you mentioned, this is a human failure, not a language failure.

Pause for a moment and think about what a C++ implementation of a globally distributed network ingress proxy service would look like - and how many memory vulnerabilities there would be… I shudder at the thought… (n.b. nginx)

This is the classic example of when something fails, the failure cause over indexes on - while under indexing on the quadrillions of memory accesses that went off without a single hitch thanks to the borrow checker.

I postulate that whatever the cost in millions or hundreds of millions of dollars by this Cloudflare outage, it has paid for more than by the savings of safe memory access.

See: https://en.wikipedia.org/wiki/Survivorship_bias

3 more replies

lmm7mo ago

> Rust won't save you from the usual programming mistake.

Disagree. Rust is at least giving you an "are you sure?" moment here. Calling unwrap() should be a red flag, something that a code reviewer asks you to explain; you can have a linter forbid it entirely if you like.

No language will prevent you from writing broken code if you're determined to do so, and no language is impossible to write correct code in if you make a superhuman effort. But most of life happens in the middle, and tools like Rust make a huge difference to how often a small mistake snowballs into a big one.

3 more replies

metaltyphoon7mo ago

> Well Rust won't save you from the usual programming mistake

This is not a Rust problem. Someone consciously chose to NOT handle an error, possibly thinking "this will never happen". Then someone else conconciouly reviewed (I hope so) a PR with an unwrap() and let it slide.

1 more reply

tptacek7mo ago

What people are saying is that idiomatic prod rust doesn't use unwrap/expect (both of which panic on the "exceptional" arm of the value) --- instead you "match" on the value and kick the can up a layer on the call chain.

1 more reply

the84727mo ago

"if it compiles it runs" - this is indeed an inaccurate marketing slogan. A more precise formulation would be "if it compiles then the static type system, pattern matching, explicit errors, Send bounds, etc. will have caught a lot of bugs that in other languages would have manifested as runtime errors".

Anecdotally I can write code for several hours, deploy it to a test sandbox without review or running tests and it will run well enough to use it, without silly errors like null pointer exceptions, type mismatches, OOBs etc. That doesn't mean it's bug-free. But it doesn't immediately crash and burn either. Recently I even introduced a bug that I didn't immediately notice because careful error handling in another place recovered from it.

Klonoar7mo ago

> I don't use Rust, but a lot of Rust people say if it compiles it runs.

Do you grok what the issue was with the unwrap, though...?

Idiomatic Rust code does not use that. The fact that it's allowed in a codebase says more about the engineering practices of that particular project/module/whatever. Whoever put the `unwrap` call there had to contend with the notion that it could panic and they still chose to do it.

It's a programmer error, but Rust at least forces you to recognize "okay, I'm going to be an idiot here". There is real value in that.

1 more reply

dzonga7mo ago

other people might say - why use unsafe rust - but we don't know the conditions of what the original code shipped under. why the pr was approved.

could have been tight deadline, managerial pressure or just the occasional slip up.

thatoneengineer7mo ago· 6 in thread

The unwrap: not great, but understandable. Better to silently run with a partial config while paging oncall on some other channel, but that's a lot of engineering for a case that apparently is supposed to be "can't happen".

The lack of canary: cause for concern, but I more or less believe Cloudflare when they say this is unavoidable given the use case. Good reason to be extra careful though, which in some ways they weren't.

The slowness to root cause: sheer bad luck, with the status page down and Azure's DDoS yesterday all over the news.

The broken SQL: this is the one that I'd be up in arms about if I worked for Cloudflare. For a system with the power to roll out config to ~all of prod at once while bypassing a lot of the usual change tracking, having this escape testing and review is a major miss.

vbezhenar7mo ago

IMO: there should be explicit error path for invalid configuration, so the program would abort with specific exit code and/or message. And there should be a superviser which would detect this behaviour, rollback old working config and wait for few minutes before trying to apply new config again (of course with corresponding alerts).

So basically bad config should be explicitly processed and handled by rolling back to known working config.

2 more replies

twoodfin7mo ago

The query is surely faulty: Even if this wasn’t a huge distributed database with who-knows-what schemas and use cases, looking up a specific table by its unqualified name is sloppy.

But the architectural assumption that the bot file build logic can safely obtain this operationally critical list of features from derivative database metadata vs. a SSOT seems like a bigger problem to me.

watchful_moose7mo ago

It's probably not ok to silently run with a partial config, which could have undefined semantics. An old but complete config is probably ok (or, the system should be designed to be safe to run in this state).

philipwhiuk7mo ago

For unwrap, Cloudflare should consider adding lint tooling that prevents unwrap being added to production code.

1 more reply

nijave7mo ago

Quite surprising a single bad config file brought down their entire global network across multiple products

Xunjin7mo ago

Share the same opinion, as others pointed out, the status page down probably caused by bots checking it.

otterley7mo ago· 6 in thread

> work has already begun on how we will harden them against failures like this in the future. In particular we are:

> Hardening ingestion of Cloudflare-generated configuration files in the same way we would for user-generated input

> Enabling more global kill switches for features

> Eliminating the ability for core dumps or other error reports to overwhelm system resources

> Reviewing failure modes for error conditions across all core proxy modules

Absent from this list are canary deployments and incremental or wave-based deployment of configuration files (which are often as dangerous as code changes) across fault isolation boundaries -- assuming CloudFlare has such boundaries at all. How are they going to contain the blast radius in the future?

This is something the industry was supposed to learn from the CrowdStrike incident last year, but it's clear that we still have a long way to go.

Also, enabling global anything (i.e., "enabling global kill switches for features") sounds like an incredibly risky idea. One can imagine a bug in a global switch that transforms disabling a feature into disabling an entire system.

nikcub7mo ago

They require the bot management config to update and propagate quickly in order to respond to attacks - but this seems like a case where updating a since instance first would have seen the panic and stopped the deploy.

I wonder why clickhouse is used to store the feature flags here, as it has it's own duplication footguns[0] which could have also easily lead to a query blowing up 2/3x in size. oltp/sqlite seems more suited, but i'm sure they have their reasons

[0] https://clickhouse.com/docs/guides/developer/deduplication

1 more reply

Scaevolus7mo ago

Global configuration is useful for low response times to attacks, but you need to have very good ways to know when a global config push is bad and to be able to rollback quickly.

In this case, the older proxy's "fail-closed" categorization of bot activity was obviously better than the "fail-crash", but every global change needs to be carefully validated to have good characteristics here.

Having a mapping of which services are downstream of which other service configs and versions would make detecting global incidents much easier too, by making the causative threads of changes more apparent to the investigators.

1 more reply

mewpmewp27mo ago

It seems they had this continous rollout for the config service, but the services consuming this were affected even by small percentage of these config providers being faulty, since they were auto updating every few minutes their configs. And it seems there is a reason for these updating so fast, presumably having to react to threat actors quickly.

1 more reply

Yokohiii7mo ago

I think global kill switches are just an last resort machanism, to bypass identified faulty subsystems. Even if there is a risk with it, in this instance the risk was zero, because CF was dead already. This wont change the blast radius, but it's duration and proliferation.

In reference to fault isolation boundaries: I am not familiar with their CI/CD, in theory the error could have been caught/prevented there, but that comes with a lot of depends or it's tricky. But it looks like they didn't go the extra mile to care about safety sensitive areas. So euphemistic speaking, they are now recalibrating balance of safety measures.

ants_everywhere7mo ago

it's always a config push. people rollout code slowly but don't have the same mechanisms for configs. But configs are code, and this is a blind spot that causes an outsized percentage of these big outages.

Buttons8407mo ago

When a failsafe system fails, it fails by failing to fail safely.

SerCe7mo ago· 5 in thread

As always, kudos for releasing a post mortem in less than 24 hours after the outage, very few tech organisations are capable of doing this.

yen2237mo ago

I'm curious about how their internal policies work such that they are allowed to publish a post mortem this quickly, and with this much transparency.

Any other large-ish company, there would be layers of "stakeholders" that will slow this process down. They will almost always never allow code to be published.

5 more replies

bayesnet7mo ago

And a well-written one at that. Compared to the AWS port-mortem this could be literature.

2 more replies

eastdakotaOP7mo ago

* published less than 12 hours from when the incident began. Proud of the team for pulling together everything so quickly and clearly.

1 more reply

BrtByte7mo ago

It's not just a PR-friendly summary either... they included real technical detail, timestamps, even code snippets

andrewinardeer7mo ago

Plenty are capable. Most don't bother.

EvanAnderson7mo ago· 4 in thread

It reads a lot like the Crowdstrike SNAFU. Machine-generated configuration file b0rks-up the software that consumes it.

The "...was then propagated to all the machines that make up our network..." followed by "....caused the software to fail." screams for a phased rollout / rollback methodology. I get that "...it’s critical that it is rolled out frequently and rapidly as bad actors change their tactics quickly" but today's outage highlights that rapid deployment isn't all upside.

The remediation section doesn't give me any sense that phased deployment, acceptance testing, and rapid rollback are part of the planned remediation strategy.

perlgeek7mo ago

At my employer, we have a small script that automatically checks such generated config files. It does a diff between the old and the new version, and if the diff size exceeds a threshold (either total or relative to the size of the old file), it refuses to do the update, and opens a ticket for a human to look over it.

It has somewhat regularly saved us from disaster in the past.

tptacek7mo ago

I don't think this system is best thought of as "deployment" in the sense of CI/CD; it's a control channel for a distributed bot detection system that (apparently) happens to be actuated by published config files (it has a consul-template vibe to it, though I don't know if that's what it is).

3 more replies

navigate83107mo ago

I'm amazed that they are not using any simulator of some sort and pushing changes directly to production.

Aeolun7mo ago

I’m fairly certain it will be after they read this thread. It doesn’t feel like they don’t want, or are incapable of improving?

bri3d7mo ago· 4 in thread

Everyone is hating on unwrap, but to me the odd and more interesting part is that it took 3 hours to figure this out? Even with a DDoS red herring, shouldn’t there have been a crash log or telemetry anomaly correlated? Also, shouldn’t the next steps and resolution focus more on this aspect, since it’s a high leverage tool for identifying any outage caused by a panic rather than just preventing a recurrence of random weird edge case #9999999?

spprashant7mo ago

I have nowhere near the experience managing such complex systems, but I can empathize with this. In a high-pressure situations the most obvious things get missed. If someone is convinced System X is at fault, your mind can make leaps to justify every other degraded system is a downstream effect of that. Cause and effect can get switched.

Sometimes you have smart people in the room who dig deeper and fish it out, but you cannot always rely on that.

1 more reply

speedgoose7mo ago

Yes this is the weird part for me. With good monitoring, the panic at unwrap should have been detected immediately. I assume they weren't looking at the right place, but still. If you use Sentry for example, a brand new panic should be pretty visible.

discordianfish7mo ago

Indeed, nothing about the root issues are particular surprising but why they missed a critical service panicing across their fleet is not bubbling up.

My best guess is too many alerts firing without a clear hierarchy and possibilities to seprate cause from effect. It's a typical challenge but I wish they would shed some light on that. And its a bit concerning that improving observability is not part of their follow up steps.

reassess_blind7mo ago

Once they figured it out they didn't have a way to load in a new feature file, had to figure that out, and then restart every machine.

1 more reply

lukan7mo ago· 4 in thread

"Throwing us off and making us believe this might have been an attack was another apparent symptom we observed: Cloudflare’s status page went down. The status page is hosted completely off Cloudflare’s infrastructure with no dependencies on Cloudflare. While it turned out to be a coincidence, it led some of the team diagnosing the issue to believe that an attacker may be targeting both our systems as well as our status page."

Unfortunately they do not share, what caused the status page to went down as well. (Does this happen often? Otherwise a big coincidence it seems)

eastdakotaOP7mo ago

We don’t know. Suspect it may just have been a big uptick in load and a failure of its underlying infrastructure to scale up.

4 more replies

paulddraper7mo ago

Quite possibly it was due to high traffic.

IDK Atlassian Statuspage clientele, but it's possible Cloudflare is much larger than usual.

notatoad7mo ago

it seems like a good chance that despite thinking their status page was completely independent of cloudfront, enough of the internet is dependent on cloudfront now that they're simply wrong about the status page's independence.

1 more reply

Aeolun7mo ago

I mean, that would require a postmortem from statuspage.io right? Is that a service operated by cloudflare?

2 more replies

nawgz7mo ago· 4 in thread

> a change to one of our database systems' permissions which caused the database to output multiple entries into a “feature file” used by our Bot Management system ... to keep [that] system up to date with ever changing threats

> The software had a limit on the size of the feature file that was below its doubled size. That caused the software to fail

A configuration error can cause internet-scale outages. What an era we live in

Edit: also, after finishing my reading, I have to express some surprise that this type of error wasn't caught in a staging environment. If the entire error is that "during migration of ClickHouse nodes, the migration -> query -> configuration file pipeline caused configuration files to become illegally large", it seems intuitive to me that doing this same migration in staging would have identified this exact error, no?

I'm not big on distributed systems by any means, so maybe I'm overly naive, but frankly posting a faulty Rust code snippet that was unwrapping an error value without checking for the error didn't inspire confidence for me!

mewpmewp27mo ago

It would have been caught only in stage if there was similar amount of data in the database. If stage has 2x less data it would have never occurred there. Not super clear how easy it would have been to keep stage database exactly as production database in terms of quantity and similarity of data etc.

I think it's quite rare for any company to have exact similar scale and size of storage in stage as in prod.

1 more reply

shoo7mo ago

The speed and transparency of Cloudflare publishing this port mortem is excellent.

I also found the "remediation and follow up" section a bit lacking, not mentioning how, in general, regressions in query results caused by DB changes could be caught in future before they get widely rolled out.

Even if a staging env didn't have a production-like volume of data to trigger the same failure mode of a bot management system crash, there's also an opportunity to detect that something has gone awry if there were tests that the queries were returning functionally equivalent results after the proposed permission change. A dummy dataset containing a single http_requests_features column would suffice to trigger the dupe results behaviour.

In theory there's a few general ways this kind of issue could be detected, e.g. someone or something doing a before/after comparison to test that the DB permission change did not regress query results for common DB queries, for changes that are expected to not cause functional changes in behaviour.

Maybe it could have been detected with an automated test suite of the form "spin up a new DB, populate it with some curated toy dataset, then run a suite of important queries we must support and check the results are still equivalent (after normalising row order etc) to known good golden outputs". This style of regression testing is brittle, burdensome to maintain and error prone when you need to make functional changes and update what then "golden" outputs are - but it can give a pretty high probability of detecting that a DB change has caused unplanned functional regressions in query output, and you can find out about this in a dev environment or CI before a proposed DB change goes anywhere near production.

norskeld7mo ago

This wild `unwrap()` kinda took me aback as well. Someone really believed in themselves writing this. :)

1 more reply

jmclnx7mo ago

I have to wonder if AI was involved with the change.

1 more reply

vsgherzi7mo ago· 3 in thread

Why does cloudflare allow unwraps in their code? I would've assumed they'd have clippy lints stopping that sort of thing. Why not just match with { ok(value) => {}, Err(error) => {} } the function already has a Result type.

At the bare minimum they could've used an expect("this should never happen, if it does database schema is incorrect").

The whole point of errors as values is preventing this kind of thing.... It wouldn't have stopped the outage but it would've made it easy to diagnose.

If anyone at cloudflare is here please let me in that codebase :)

waterTanuki7mo ago

Not a cloudflare employee but I do write a lot of Rust. The amount of things that can go wrong with any code that needs to make a network call is staggeringly high. unwrap() is normal during development phase but there are a number of times I leave an expect() for production because sometimes there's no way to move forward.

3 more replies

pornel7mo ago

unwrap() is only the most superficial part of the problem. Merely replacing `unwrap()` with `return Err(code)` wouldn't have changed the behavior. Instead of "error 500 due to panic" the proxy would fail with "error 500 due to $code".

Unwrap gives you a stack trace, while retuned Err doesn't, so simply using a Result for that line of code could have been even harder to diagnose.

`unwrap_or_default()` or other ways of silently eating the error would be less catastrophic immediately, but could still end up breaking the system down the line, and likely make it harder to trace the problem to the root cause.

The problem is deeper than an unwrap(), related to handling rollouts of invalid configurations, but that's not a 1-line change.

1 more reply

ozgrakkurt7mo ago

And the error magically disappears when the function returns it?

1 more reply

tristan-morris7mo ago· 3 in thread

Why call .unwrap() in a function which returns Result<_,_>?

For something so critical, why aren't you using lints to identify and ideally deny panic inducing code. This is one of the biggest strengths of using Rust in the first place for this problem domain.

tptacek7mo ago

Probably because this case was something more akin to an assert than an error check.

5 more replies

sayrer7mo ago

Yes, can't have .unwrap() in production code (it's ok in tests)

2 more replies

koakuma-chan7mo ago

Why is there a 200 limit on appending names?

2 more replies

gucci-on-fleek7mo ago· 2 in thread

> This showed up to Internet users trying to access our customers' sites as an error page indicating a failure within Cloudflare's network.

As a visitor to random web pages, I definitely appreciated this—much better than their completely false “checking the security of your connection” message.

> The issue was not caused, directly or indirectly, by a cyber attack or malicious activity of any kind. Instead, it was triggered by a change to one of our database systems' permissions

Also appreciate the honesty here.

> On 18 November 2025 at 11:20 UTC (all times in this blog are UTC), Cloudflare's network began experiencing significant failures to deliver core network traffic. […]

> Core traffic was largely flowing as normal by 14:30. We worked over the next few hours to mitigate increased load on various parts of our network as traffic rushed back online. As of 17:06 all systems at Cloudflare were functioning as normal.

Why did this take so long to resolve? I read through the entire article, and I understand why the outage happened, but when most of the network goes down, why wasn't the first step to revert any recent configuration changes, even ones that seem unrelated to the outage? (Or did I just misread something and this was explained somewhere?)

Of course, the correct solution is always obvious in retrospect, and it's impressive that it only took 7 minutes between the start of the outage and the incident being investigated, but it taking a further 4 hours to resolve the problem and 8 hours total for everything to be back to normal isn't great.

eastdakotaOP7mo ago

Because we initially thought it was an attack. And then when we figured it out we didn’t have a way to insert a good file into the queue. And then we needed to reboot processes on (a lot) of machines worldwide to get them to flush their bad files.

8 more replies

chrismorgan7mo ago

> much better than their completely false “checking the security of your connection” message

The exact wording (which I can easily find, because a good chunk of the internet gives it to me, because I’m on Indian broadband):

> example.com needs to review the security of your connection before proceeding.

It bothers me how this bald-faced lie of a wording has persisted.

(The “Verify you are human by completing the action below.” / “Verify you are human” checkbox is also pretty false, as ticking the box in no way verifies you are human, but that feels slightly less disingenuous.)

2 more replies

keypusher7mo ago· 2 in thread

The most surprising thing to me here is that it took 3 hours to root cause, and points to a glaring hole in the platform observability. Even taking into account the fact that the service was failing intermittently at first, it still took 1.5 hours after it started failing consistently to root cause. But the service was crashing on startup. If a core service is throwing a panic at startup like that, it should be raising alerts or at least easily findable via log aggregation. It seems like maybe there was some significant time lost in assuming it was an attack, but it also seems strange to me that nobody was asking "what just changed?", which is usually the first question I ask during an incident.

eastdakotaOP7mo ago

That’s not accurate. As with any incident response there were a number of theories of the cause we were working in parallel. The feature file failure was one identified as potential in the first 30 minutes. However, the theory that seemed the most plausible based on what we were seeing (intermittent, initially concentrated in the UK, spike in errors for certain API endpoints) as well as what else we’d been dealing with (a bot net that had escalated DDoS attacks from 3Tbps to 30Tbps against us and others like Microsoft over the last 3 months). We worked multiple theories in parallel. After an hour we ruled out the DDoS theory. We had other theories also running in parallel, but at that point the dominant theory was that the feature file was somehow corrupt. One thing that made us initially question the theory was nothing in our changelogs seemed like it would have caused the feature file to grow in size. It was only after the incident that we realized the database permissions change had caused it, but that was far from obvious. Even after we identified the problem with the feature file, we did not have an automated process to role the feature file back to a known-safe previous version. So we had to shut down the reissuance and manually insert a file into the queue. Figuring out how to do that took time and waking people up as there are lots of security safeguards in place to prevent an individual from easily doing that. We also needed to double check we wouldn’t make things worse. The propagation then takes some time especially because there are tiers of caching of the file that we had to clear. Finally we chose to restart the FL2 processes on all the machines that make up our fleet to ensure they all loaded the corrected file as quickly as possible. That’s a lot of processes on a lot of machines. So I think best description was it took us an hour for the team to coalesce on the feature file being the cause and then another two to get the fix rolled out.

2 more replies

Fiadliel7mo ago

If one actually looks at the current pingora API, it has limited ability to initialize async components at startup - the current pattern seems to be to lazily initialize on first call. An obvious downside of this is that a service can startup in a broken state. e.g. https://github.com/cloudflare/pingora/issues/169

I can imagine that this could easily lead to less visibility into issues.

yoyohello137mo ago· 2 in thread

People really like to hate on Rust for some reason. This wasn’t a Rust problem, no language would have saved them from this kind of issue. In fact, the compiler would have warned that this was a possible issue.

I get it, don’t pick languages just because they are trendy, but if any company’s use case is a perfect fit for Rust it’s cloudflare.

SchemaLoad7mo ago

Yeah even if you handled this situation without unwrap() if you just went down an error path that didn't panic, the service would likely still be inoperable if every single request went down the error path.

1 more reply

samdoesnothing7mo ago

The reason why people are criticizing is because Rust evangelicals say stuff like "if it compiles it works" or talk about how Rust's type system is so much better than other languages that it catches logic errors like this. You won't see Go or Java developers making such strong claims about their preferred languages.

1 more reply

ulfw7mo ago· 2 in thread

The internet hasn't been the internet in years. It was originally built to withstand wars. The whole idea of our IP based internet was to reroute packages should networks go down. Decentralisation was the mantra and how it differed from early centralised systems such as AOL et al.

This is all gone. The internet is a centralised system in the hand of just a few companies. If AWS goes down half the internet does. If Azure, Google Cloud, Oracle Cloud, Tencent Cloud or Alibaba Cloud goes down a large part of the internet does.

Yesterday with Cloudflare down half the sites I tried gave me nothing but errors.

The internet is dead.

samdoesnothing7mo ago

It's not that deep, if AWS or Cloudflare suddenly disappeared sites would move to different hosts, it wouldn't mean the internet would die.

rubatuga7mo ago

I mean you still depend on authoritative dns servers no?

throw77mo ago· 2 in thread

Is this true: from that core proxy diagram, I didn't realize cloudflare sees the full unencrypted packet between you and the server.

If that's true, is there a way to tell (easily) whether a site is using cloudflare or not?

tempest_7mo ago

It is pretty easy to see if cloudflare is proxying a site.

Just ping the host and see if the ip belongs to CF.

https://www.cloudflare.com/en-ca/ips/

finally73947mo ago

The NSA has to see the data somehow, right?

hnarn7mo ago· 1 in thread

> That feature file, in turn, doubled in size. The larger-than-expected feature file was then propagated to all the machines that make up our network.

> The software running on these machines to route traffic across our network reads this feature file to keep our Bot Management system up to date with ever changing threats. The software had a limit on the size of the feature file that was below its doubled size. That caused the software to fail.

I'm no FAANG 10x engineer, and I appreciate things can be obvious in hindsight, but I'm somewhat surprised that engineering at the level of Cloudflare does not:

1. Push out files A/B to ensure the old file is not removed.

2. Handle the failure of loading the file (for whatever reason) by automatically reloading the old file instead and logging the error.

This seems like pretty basic SRE stuff.

watchful_moose7mo ago

Yep, a decent canary mechanism should have caught this. There's a trade off between canarying and rollout speed, though. If this was a system for fighting bots, I'd expect it to be optimized for the latter.

2 more replies

RagingCactus7mo ago· 1 in thread

Lots of people here are (perhaps rightfully) pointing to the unwrap() call being an issue. That might be true, but to me the fact that a reasonably "clean" panic at a defined line of code was not quickly picked up in any error monitoring system sounds just as important to investigate.

Assuming something similar to Sentry would be in use, it should clearly pick up the many process crashes that start occurring right as the downtime starts. And the well defined clean crashes should in theory also stand out against all the random errors that start occuring all over the system as it begins to go down, precisely because it's always failing at the exact same point.

rixed7mo ago

Exactly! You could have `rand() > 0.5 && panic!()` in the code of your bot module, and that should not put the internet on fire.

The issue here is about the system as a whole not any line of code.

1 more reply

trengrj7mo ago· 1 in thread

Classic combination of errors:

Having the feature table pivoted (with 200 feature1, feature2, etc columns) meant they had to do meta queries to system.columns to get all the feature columns which made the query sensitive to permissioning changes (especially duplicate databases).

A Crowdstrike style config update that affects all nodes but obviously isn't tested in any QA or staged rollout strategy beforehand (the application panicking straight away with this new file basically proves this).

Finally an error with bot management config files should probably disable bot management vs crash the core proxy.

I'm interested here why they even decided to name Clickhouse as this error could have been caused by any other database. I can see though the replicas updating causing flip / flopping of results would have been really frustrating for incident responders.

tptacek7mo ago

Right but also this is a pretty common pattern in distributed systems that publish from databases (really any large central source of truth); it might be like the problem in systems like this. When you're lucky the corner cases are obvious; in the big one we experienced last year, a new row in our database tripped an if-let/mutex deadlock, which our system dutifully (and very quickly) propagated across our entire network.

The solution to that problem wasn't better testing of database permutations or a better staging environment (though in time we did do those things). It was (1) a watchdog system in our proxies to catch arbitrary deadlocks (which caught other stuff later), (2) segmenting our global broadcast domain for changes into regional broadcast domains so prod rollouts are implicitly staged, and (3) a process for operators to quickly restore that system to a known good state in the early stages of an outage.

(Cloudflare's responses will be different than ours, really I'm just sticking up for the idea that the changes you need don't follow obviously from the immediate facts of an outage.)

jdlyga7mo ago· 1 in thread

We shouldn't be having critical internet-wide outages on a monthly basis. Something is systematically wrong with the way we're architecting our systems.

smt887mo ago

Cloudflare, Azure, and other single points of failure are solving issues inherent to webhosting, and those problems have become incredibly hard due to the massive scale of bad actors and the massive complexity of managing hardware and software.

What would you propose to fix it? The fixed cost of being DDoS-proof is in the hundreds of millions of dollars.

2 more replies

HL33tibCe77mo ago· 1 in thread

An unwrap like that in production code on the critical path is very surprising to me.

I haven’t worked in Rust codebases, but I have never worked in a Go codebase where a `panic` in such a location would make it through code review.

Is this normal in Rust?

xnotcursed7mo ago

absolutely not normal, this is why in my opinion it took them so long to understand the core issue. instead of a nice error message and a backtrace saying something like "failed to parse config feature names" they thought they were under attack because the service was just crashing instead.

ademarre7mo ago· 1 in thread

I integrated Turnstile with a fail-open strategy that proved itself today. Basically, if the Turnstile JS fails to load in the browser (or in a few specific frontend error conditions), we allow the user to submit the web form with a dummy challenge token. On the backend, we process the dummy token like normal, and if there is an error or timeout checking Turnstile's siteverify endpoint, we fail open.

Of course, some users were still blocked, because the Turnstile JS failed to load in their browser but the subsequent siteverify check succeeded on the backend. But overall the fail-open implementation lessened impact to our customers nonetheless.

Fail-open with Turnstile works for us because we have other bot mitigations that are sufficient to fall back on in the event of a Cloudflare outage.

cj7mo ago

So to bypass captcha all a user has to do is block the script from loading? I can see that working but only for attacks that aren’t targeted?

1 more reply

ed_mercer7mo ago· 1 in thread

Wow. 26M/s 5xx error HTTP status codes over a span of roughly two hours. That's roughly 187 billion HTTP errors that interrupted people (and systems)!

watchful_moose7mo ago

Some of these would be retries that wouldn't have happened if not for earlier errors.

testemailfordg27mo ago· 1 in thread

"Customers on our old proxy engine, known as FL, did not see errors, but bot scores were not generated correctly, resulting in all traffic receiving a bot score of zero."

This simply means, the exception handling quality of your new FL2 is non-existent and is not at par / code logic wise similar to FL.

I hope it was not because of AI driven efficiency gains.

lmm7mo ago

In most domains, silently returning 0 in a case where your logic didn't actually calculate the thing you were trying to calculate is far worse than giving a clear error.

zf000027mo ago· 1 in thread

As an IT person, I wonder what it's like to work for a company like this. Where presumably IT stuff has a priority. Unlike the companies I've worked for where IT takes a backseat to everything until something goes wrong. Company I work had a huge new office built, with the plan it would be big enough for future growth, yet despite repeated attempts to reserve a larger space, our server room and infrastructure is actually smaller than our old building and has no room to grow.

rkomorn7mo ago

As a former CF employee, I'd say it's a mixed bag.

There are plenty of resources , yet it's somehow never enough. You do tons of pretty amazing things with pretty amazing tools that also have notable shortcomings.

You're surround by smart people who do lots of great work, but you also end up in incident reviews where you find facepalm-y stuff. Sometimes you even find out it was a known corner case that was deemed too unlikely to prioritize.

The last incident for my team that I remember dealing with there ended up with my coworker and I realizing the staging environment we'd taken down hours earlier was actually the source of data for a production dashboard, so we'd lost some visibility and monitoring for a bit.

I've also worked at Facebook (pre-Meta days) and at Datadog, and I'd say it was about the same. Most things are done quite well, but so much stuff is happening that you still end up with occasional incidents that feel like they shouldn't have happened.

pdimitar7mo ago· 1 in thread

While I heavily frown upon using `unwrap` and `expect` in Rust code and make sure to have Clippy tell me about every single usage of them, I also understand that without them Rust might have been seen as an academic curiosity language.

They are escape hatches. Without those your language would never take off.

But here's the thing. Escape hatches are like emergency exits. They are not to be used by your team to go to lunch in a nearby restaurant.

---

Cloudflare should likely invest in better linting and CI/CD alerts. Not to mention isolated testing i.e. deploy this change only to a small subset and monitor, and only then do a wider deployment.

Hindsight is 20/20 and we can all be smartasses after the fact of course. But I am really surprised because lately I am only using Rust for hobby projects and even I know I should not use `unwrap` and `expect` beyond the first iteration phases.

---

I have advocated for this before but IMO Rust at this point will benefit greatly from disallowing those unsafe APIs by default in release mode. Though I understand why they don't want to do it -- likely millions of CI/CD pipelines will break overnight. But in the interim, maybe a rustc flag we can put in our `Cargo.toml` that enables such a stricter mode? Or have that flag just remove all the panicky API _at compile time_ though I believe this might be a Gargantuan effort and is likely never happening (sadly).

In any case, I would expect many other failures from Cloudflare but not _this_ one in particular.

duped7mo ago

This is not a reasonable take to me. unwrap/expect are the idiomatic way to express code paths returning Option/Result as unreachable.

Bubbling up the error or None does not make the program correct. Panicking may be the only reasonable thing to do.

If panicking is guaranteed because of some input mistake to the system your failure is in testing.

2 more replies

habibur7mo ago· 1 in thread

    On 18 November 2025 at 11:20 UTC (all times in this blog are UTC), Cloudflare's network began experiencing significant failures
    As of 17:06 all systems at Cloudflare were functioning as normal

6 hours / 5 years gives ~99.98% uptime.

TehShrike7mo ago

I'm feeling generous tonight, I'm willing to consider 0.99986 to round to 99.99%

zzzeek7mo ago· 1 in thread

> Instead, it was triggered by a change to one of our database systems' permissions which caused the database to output multiple entries into a “feature file” used by our Bot Management system.

And here is the query they used ** (OK, so it's not exactly):

     SELECT * from feature JOIN permissions on feature.feature_type_id = permissions.feature_type_id

someone added a new row to permissions and the JOIN started returning two dupe feature rows for each distinct feature.

** "here is the query" is used for dramatic effect. I have no knowledge of what kind of database they are even using much less queries (but i do have an idea).

more edits: OK apparently it's described later in the post as a query against clickhouse's table metadata table, and because users were granted access to an additional database that was actually the backing store to the one they normally worked with, some row level security type of thing doubled up the rows. Not sure why querying system.columns is part of a production level query though, seems overly dynamic.

captainkrtek7mo ago

I believe they mentioned ClickHouse

abigailphoebe7mo ago· 1 in thread

kudos to getting this blog post out so fast, it’s well written and is appreciated.

i’m a little confused on how this was initially confused for an attack though?

is there no internal visibility into where 5xx’s are being thrown? i’m surprised there isn’t some kind of "this request terminated at the <bot checking logic>" error mapping that could have initially pointed you guys towards that over an attack.

also a bit taken aback that .unwrap()’s are ever allowed within such an important context.

would appreciate some insight!

pornel7mo ago

1. Cloudflare is in the business of being a lightning rod for large and targeted DoS attacks. A lot of cases are attacks.

2. Attacks that make it through the usual defences make servers run at rates beyond their breaking point, causing all kinds of novel and unexpected errors.

Additionally, attackers try to hit endpoints/features that amplify severity of their attack by being computationally expensive, holding a lock, or trigger an error path that restarts a service — like this one.

1 more reply

kqr7mo ago· 1 in thread

One of the remediations listed is

> Eliminating the ability for core dumps or other error reports to overwhelm system resources

but this is not mentioned at all in the timeline above. My best guess would be that the process got stuck in a tight restart loop and filled available disk space with logs, but I'm happy to hear other guesses for people more familiar with Rust.

janpio7mo ago

I understood this to be related to this section:

> As well as returning HTTP 5xx errors, we observed significant increases in latency of responses from our CDN during the impact period. This was due to large amounts of CPU being consumed by our debugging and observability systems, which automatically enhance uncaught errors with additional debugging information.

(Just above https://blog.cloudflare.com/18-november-2025-outage/#how-clo...)

1 more reply

jokoon7mo ago· 1 in thread

I don't understand what's the business of cloudflare.

They just sell proxies, to whoever.

Why are they the only company doing ddos protection?

I just don't get it.

BOOSTERHIDROGEN7mo ago

Momentum, I guess, pretty much like you wouldn’t get fired for using AWS or IBM (if that’s still the case now).

anal_reactor7mo ago· 1 in thread

Honestly... everyone shit themselves that internet doesn't work, but next week this outage will be forgotten by 99% of population. I was doing something on my PC when I saw clear information that Cloudflare is down, so I decided to just go take a nap, then read a book, then go for a walk. Once I was done, the internet was working again. Panic was not necessary on my side.

What I'm trying to say is that things would be much better if everyone took a chill pill and accepted the possibility that in rare instances, the internet doesn't work and that's fine. You don't need to keep scrolling TikTok 24/7.

> but my use case is especially important

Take a chill pill. Probably it isn't.

vultour7mo ago

It's funny how everyone seems to be having a meltdown over this. I didn't even notice anything was wrong until I read about it on Reddit 5 hours later, even though I was working all day. Sounds to me like people are too reliant on random websites.

drc500free7mo ago· 1 in thread

Makes me wonder which team is responsible for that feature generating query, and if they follow full engineering level QA. It might be deferred to an MLE team that is better than the data scientists but less rigorous than software needs to be.

mmaia7mo ago

Exactly. The post screams about all the issues I've seen in multiple companies between DS/MLE and SE/DevOps.

wildmXranat7mo ago· 1 in thread

Hold up ,- when I used a C or similar language for accessing a database and wanted to clamp down on memory usage to deterministically control how much I want to allocated, I would explicitly limit the number of rows in the query.

There never was an unbound "select all rows from some table" without a "fetch first N rows only" or "limit N"

If you knew that this design is rigid, why not leverage the query to actually do it ?

What am I missing ?

JuniperMesos7mo ago

Because nothing forced them to and they didn't think of it. Maybe the people writing the code that did the query knew that the tables they were working with never had more than 60 rows and figured "that's small" so they didn't bother with a limit. Maybe the people who wrote the file size limit thought "60 rows isn't that much data" and made a very small file size limit and didn't coordinate with the first people.

Anyway regardless of which language you use to construct a SQL query, you're not obligated to put in a max rows

1 more reply

kylegalbraith7mo ago· 1 in thread

The outage sucked for everyone. The root cause also feels like something they could have caught much earlier in a canary rollout from my reading of this.

All that said, to have an outage reported turned around practically the same day, that is this detailed, is quite impressive. Here's to hoping they make their changes from this learning, and we don't see this exact failure mode again.

agentifysh7mo ago

how would you build redundancy around cloudflare failing?

i think this is happening way too frequently

meanwhile VPS, dedicated servers hum along without any issues

i dont want to use kubernetes but if we have to build mission critical systems doesn't seem like building on cloudflare is going to cut it

markhandoff7mo ago· 1 in thread

Dear Matthew Prince, don't you think we (the ones affected by your staff's mistake) should get some sort of compensation??? Yours truly, a Cloudflare client who lost money during the November 18th outage.

NetMageSCW7mo ago

What does your contract with Cloudflare say?

cowsandmilk7mo ago

Blog post from less than a week ago on how Cloudflare avoids outages on configuration changes: https://blog.cloudflare.com/finding-the-grain-of-sand-in-a-h...

This has to sting a bit after that post.

dilyevsky7mo ago

Long time ago Google had a very similar incident where ddos protection system ingested a bad config and took everything down. Except it was auto resolved in like four minutes by an automatic rollback system before oncall was even able to do anything. Perhaps Cloudflare should invest in a system like that

nwellinghoff7mo ago

The real take away is that so much functionality depends on a few players. This is a fundamental flaw in design that is getting worse by the year as the winner takes all winners win. Not saying they didn’t earn their wins. But the fact remains. The system is not robust. Then again, so what. It went down for a while. Maybe we shouldn’t depend on the internet being “up” all the time.

aetherspawn7mo ago

Cloudflare Access is still experiencing weird issues for us (it’s asking users to SSO login to our public website even though our zone rules - set on a completely different zone - haven’t changed).

I don’t think the infrastructure has been as fully recovered as they think yet…

ksajadi7mo ago

May I just say that Matthew Prince is the CEO of Cloudflare and a lawyer by training (and a very nice guy overall). The quality of this postmortem is great but the fact that it is from him makes one respect the company even more.

jjice7mo ago

There's (obviously) a lot of discussion around the use of `unwrap` in production code. I feel like I'm watching comments speak past each other right now.

I'd agree that the use of `unwrap` could possibly make sense in a place where you do want the system to fail hard. There's lot of good reasons to make the system fail hard. I'd lean towards an `expect` here, but whatever.

That said, the function already returns a `Result` and we don't know what the calling code looks like. Maybe it does do an `unwrap` there too, or maybe there is a save way for this to log and continue that we're not aware of because we don't have enough info.

Should a system as critical as the CF proxy fail hard? I don't know. I'd say yes if it was the kind of situation that could revert itself (like an incremental rollout), but this is such an interesting situation since it's a config being rolled out. Hindsight is 20:20 obviously, but it feels like there should've been better logging, deployment, rollback, and parsing/validation capabilities, no matter what the `unwrap`/`Result` option is.

Also, it seems like the initial Clickhouse changes could've been testing much better, but I'm sure the CF team realizes that.

On the bright side, this is a very solid write up so quickly after the outage. Much better than those times we get it two weeks later.

spprashant7mo ago

A lot of outages off late seem to be related to automated config management.

Companies seem to place a lot of trust is configs being pushed automatically without human review into running systems. Considering how important these configs are, shouldn't they perhaps first be deployed to a staging/isolated network for a monitoring window before pushing to production systems?

Not trying to pontificate here, these systems are more complicated than anything I have maintained. Just trying to think of best practices perhaps everyone can adopt.

Diggsey7mo ago

There were two things I think went extremely poorly here:

1) Lack of validation of the configuration file.

Rolling out a config file across the global network every 5 minutes is extremely high risk. Even without hindsight, surely one would see then need for very careful validation of this file before taking on that risk?

There were several things "obviously" wrong with the file that validation should have caught:

- It was much bigger than expected.

- It had duplicate entries.

- Most importantly, when loaded into the FL2 proxy, the proxy would panic on every request. At the very least, part of the validation should involve loading the file into the proxy and serving a request?

2) Very long time to identify and then fix such a critical issue.

I can't understand the complete lack of monitoring or reporting? A panic in Rust code, especially from an unwrap, is the application screaming that there's a logic error! I don't understand how that can be conflated with a DDoS attack. How are your logs not filled with backtraces pointing to the exact "unwrap" in question?

Then, once identified, why was it so hard to revert to a known good version of the configuration file? How did noone foresee the need to roll back this file when designing a feature that deploys a new one globally every 5 minutes?

130R7mo ago

If the software has a limit on the size of the feature file then the process that propagates the file should probably validate the size before propagating ..

JamesJGoodwin7mo ago

>Currently that limit is set to 200, well above our current use of ~60 features. Again, the limit exists because for performance reasons we preallocate memory for the features.

So they basically hardcoded something, didn't bother to cover the overflow case with unit tests, didn't have basic error catching that would fallback and send logs/alerts to their internal monitoring system and this is why half of the internet went down?

MagicMoonlight7mo ago

Having a system which automatically deploys configuration files across a million servers every 5 minutes without testing it seems stupid to me.

Matthias2477mo ago

When I first read about it I assumed it would have been a "poison pill" - a bad config where the ingestion of the config leads the process to crash/restart. And due to that crash on startup, there is no automated possibility to revert to a good config. These things are the worst issues that all global control planes have to deal with.

The report actually seems to confirm this - it was indeed a crash on ingesting the bad config. However I'm actually surprised that the long duration didn't come from "it takes a long time to restart the fleet manually" or "tooling to restart the fleet was bad".

The problem mostly seems to have been "we didn't knew whats going on". Some look into the proxy logs would hopefully have shown the stacktrace/unwrap, and metrics about the incoming requests would hopefully have shown that there's no abnormal amount of requests coming in.

jamesblonde7mo ago

Cloudflare tried to build their own feature store, and get a grade F.

I wrote a book on feature stores by O'Reilly. The bad query they wrote in Clickhouse could have been caused by another more error - duplicate rows in materialized feature data. For example, in Hopsworks it prevents duplicate rows by building on primary key uniqueness enforcement in Apache Hudi. In contrast, Delta lake and Iceberg do not enforce primary key constraints, and neither does Clickhouse. So they could have the same bug again due to a bug in feature ingestion - and given they hacked together their feature store, it is not beyond the bounds of possibility.

Reference: https://www.oreilly.com/library/view/building-machine-learni...

pkumar000077mo ago

If you knew the expected number of features , any input file with >100 should be discarded as bad input and you failback to the last good feature file received. This would have protected your service even though you are unable to get the newly populated features. I believe these features are not updated that frequently. Even if they were, you would have biased your system towards availability vs 'correctness'.

What were the teams doing between 11 to 1300 hrs , no explanation of what investigations were going on to not being able to figure the root cause.

arkanovicz7mo ago

Interesting technical insight, but I would be curious to hear firsthand accounts from the teams on the ground, particularly regarding how the engineers felt the increasing pressure, frantically refreshing their dashboards, searching for phantom DDoS, scrolling codes updates...

Forgeties797mo ago

I’ll be honest, I only understand about 30% of what is being said in this thread and that is probably generous. But it is very interesting seeing so many people respond to each other “it’s so simple! what went wrong was…” as they all disagree on what exactly went wrong.

cvhc7mo ago

I don't get why that SQL query was even used in the first place. It seems it fetches feature names at runtime instead of using a static hardcoded schema. Considering this decides the schema of a global config, I don't think the dynamicity is a good idea.

NetMageSCW7mo ago

It feels like their list of after actions is lacking a bit to me.

How about 1. The permissions change project is paused or rolled back until 2. All impacted database interactions (SQL queries) are evaluated for improper assumptions or better 3. Their design that depends on database metainfo and schema is replaced with ones that use specific tables and rows in tables instead of using the meta info as part of their application. 4. All hard coded limits are centralized in a single global module and referenced from their users and then back propagated to any separate generator processes that validate against the limit before pushing generated changes

avereveard7mo ago

Question: customer having issues also couldn't switch their dns to bypass the service, why is the control plane updated along the data plane here it seem a lot of use could save business continuity if they could change their dns entry temporarily

laotree7mo ago

Attempt to reproduce the Cloudflare 2025-11-18 outage.

Cloudflare's incident report is written clearly and explicitly, so based on my own understanding, I’m going to try reproducing this outage. Already completed:

CK cluster Permission change triggering data doubling Cache propagation Unaffected proxy services Proxy services with bot score errors

TODO:

unwrap panic during pre-allocation of cache Full demonstration of the entire outage process

https://github.com/Laotree/reproduce_cf20251118

proverbs537mo ago

What I read is that a non-critical feature (blocking / managing bot - access) was able to impact a critical feature (routing traffic).

Shouldn't the architecture setup in such a way that subcomponents can fail without impacting the critical function of the component?

slanterns7mo ago

https://x.com/guanlandai/status/1990967570011468071

arifiqbal817mo ago

Thanks for the detailed writeup and explaining the root cause in details

However, I have a question from a release deployment process perspective. Why was this issue not detected during internal testing ? I didn't find the RCA analysis covering this aspect. Doesn't cloudflare have an internal test stage as part of its CICD pipeline. Looking the description of the issue, it should have been immediately detected in internal stage test environment.

mayank947mo ago

Thanks for the detailed RCA. After going through the blog post as well, there’s a curiosity about whether there are opportunities to detect such scenarios earlier in the testing phase. It would be helpful to understand if there are any differences between the test environment and production that might have contributed to this. This could provide insights into strengthening the process going forward.

vasuadari7mo ago

Wondering why they didn’t disable the bot management temporarily to recover. Websites could have survived temporarily without it compared to the outage itself.

amitkumary967mo ago

As we know that "With great power comes great responsibility" the team should understand this because Cloudflare is used worldwide, and for many countries it was the peak working time when Cloudflare went down so this affected massively. We always want perfect results but it's not possible, I hope the team is not overworking to get the changes on prod.

keepamovin7mo ago

That's interesting. That feature file, in turn, doubled in size. The larger-than-expected feature file was then propagated to all the machines that make up our network. It's like the issues with HOSTS.TXT needing to be copied among the network of the early internet to allow routing (taking days to download etc) and DNS having to be created to make that propagation less unwieldy.

darksideofthem7mo ago

Speaking of resiliency, the entire Bot Management module doesn't seems to be a critical part of the system, so for example, what happens if that module goes down for an hour? the other parts of the system should work. So I would rank every module and it's role in the system, and would design it in a way that when a non-critical module fails, other parts still can function.

nurettin7mo ago

I can never get used to the error happening at call site rather than within the function where the early return of Err happened. It is not "much cleaner", you have no idea which line and file caused it at call site. By default Returning should have a way of setting a marker which can then be used to map back to the line() and file(). 10+ years and still no ergonomics.

jeffrallen7mo ago

This is an excellent lesson learned: Harden loading of internally generated config files as though they were untrusted content.

Gonna use that one at $WORK.

atari_guy7mo ago

It actually wasn't resolved quite that fast. My site continued to have issues for hours afterward before things finally resolved completely.

l___l7mo ago

> The software had a limit on the size of the feature file that was below its doubled size. That caused the software to fail.

What could have prevented this failure?

Cloudflare's software could have included a check that refused to generate the feature file if it's size was higher than the limit.

A testcase could have caught this.

arjie7mo ago

Great post-mortem. Very clear. Surprised that num(panicking threads) didn't show up somewhere in telemetry.

flerchin7mo ago

Based on this writeup it seems that Cloudflare defaults to a 0 score for bot prevention if there's a failure. Could instead it default to a passing score? Default open instead of default closed? This would have been a non-event to a lot of websites if that change was made.

tete7mo ago

Bit funny. "Memory management" bug in Rust. (writing that as kind of a fan of Rust)

BrtByte7mo ago

This incident feels like a strong argument for stricter guardrails around internal config propagation

gkoz7mo ago

Given this was triggered by an old school configuration change across multiple servers, there's too little discussion of that particular process.

It sounds like the change could've been rolled out more slowly, halted when the incident started and perhaps rolled back just in case.

cvshane7mo ago

Would be nice if their Turnstile could be turned off on their login page when something like this happens, so we can attempt to route traffic away from Cloudflare during the outage. Or at least have a simple app where this can be modified from.

cmilton7mo ago

How many changes to production systems does Cloudflare make throughout a day? Are they a part of any change management process? That would be the first place I would check after a random outage, recent changes.

AtNightWeCode7mo ago

So they made a newbie mistake in SQL that would not even pass an AI review. They did not verify the change in a test environment. And I guess the logs are so full of errors it is hard to pinpoint which matters. Yikes.

hbarka7mo ago

ClickHouse db was mentioned. Does this incident raise any critiques about it?

1 more reply

1970-01-017mo ago

I would have been a bit cheeky and opened with 'It wasn't DNS.'

nromiun7mo ago

Unbelievable. I guess it's time to grep for every .unwrap in our code.

1 more reply

Chihuahua06337mo ago

> The first automated test detected the issue at 11:31 and manual investigation started at 11:32. The incident call was created at 11:35.

I'm impressed they were able to corral people this quickly.

laurentiurad7mo ago

Reason for the failure: switched to Chad IDE to ship new features.

__alexs7mo ago

Is dual sourcing CDNs feasible these days? Seems like having the capability to swap between CDN providers is good both from a negotiating perspective and a resiliency one.

lapcat7mo ago

It's unbelievable that the end of this postmortem is an advertisement for Cloudflare.

The last thing we need here is for more of the internet to sign up for Cloudflare.

1 more reply

chaos_emergent7mo ago

Just a moment to reflect on how much freaking leverage computers give us today - a single permission change took down half the internet. Truly crazy times.

sema4hacker7mo ago

If you deploy a change to your system, and things start to go wrong that same day, the prime suspect (no matter how unlikely it might seem) should be the change you made.

2 more replies

leonaves7mo ago

Why have a limit on the file size if the thing that happens when you hit the limit is the entire network goes down? Surely not having a limit can't be worse?

makach7mo ago

Excellent write up. Cybersecurity professionals read the story and learn. It’s textbook lesson in post-mortem incident analysis - a mvp for what is expected from us all in a similar situation.

Reputationally this is extremely embarrassing for Cloudflare, but imo they seem to get their feet back on the ground. I was surprised to see not just one, but two apologies to the internet. This just cements how professional and dedicated the Cloudflare team is to ensure stable resilient internet and how embarrassed they must have been.

A reputational hit for sure, but outcome is lessons learned and hopefully stronger resilience.

phyzome7mo ago

This is a small issue in the writeup (everything else made sense), but -- why does doubling 60 features exceed the 200 limit? Missing something.

agonux7mo ago

Time to rewrite with golang, explicit error handling ;-)

sanjitb7mo ago

cloudflare:

> Throwing us off and making us believe this might have been an attack was another apparent symptom we observed: Cloudflare’s status page went down. The status page is hosted completely off Cloudflare’s infrastructure with no dependencies on Cloudflare.

also cloudflare:

> The Cloudflare Dashboard was also impacted due to both Workers KV being used internally and Cloudflare Turnstile being deployed as part of our login flow.

1 more reply

niedbalski7mo ago

Sure this has been said but the issue here is not code is the ability to canary and rollback quickly from any arbitrary (config) change.

sigmar7mo ago

Wow. What a post mortem. Rather than Monday morning quarterbacking how many ways this could have been prevented, I'd love to hear people sound-off on things that unexpectedly broke. I, for one, did not realize logging in to porkbun to edit DNS settings would become impossible with a cloudflare meltdown

1 more reply

kwar137mo ago

Perils of a centralized chokepoint. Substantial amounts of the internet shouldn't go down because Cloudflare had a hiccup.

jijji7mo ago

this is where change management really shines because in a change management environment this would have been prevented by a backout procedure and it would never have been rolled out to production before going into QA, with peer review happening before that... I don't know if they lack change management but it's definitely something to think about

2 more replies

igornadj7mo ago

Any feature failing should still allow the traffic to continue. This should be the first bullet in the future actions list.

lofaszvanitt7mo ago

It is staggering to see that even large companies like CF have zero monitoring, so they would know what happened in t=0.

elAhmo7mo ago

Timely post mortem. Sucks to have this happened, but at least they are quite transparent and detailed in the writeup.

baalimago7mo ago

Interesting. Although principle of least privilege is great, it should not be applied as a feature to filter data.

zeroq7mo ago

But Rust was supposed to cure cancer and solve world hunger. Is this the end of the hello world but in Rust saga?

1 more reply

zyngaro7mo ago

Catastrophique failure for failing to read a file bigger that expected? Wow. This is really embarrassing.

slyall7mo ago

Ironically just now I got a Cloudflare "Error code 524" page because blog.cloudflare.com was down

moralestapia7mo ago

No publicity is bad publicity.

Best post mortem I've read in a while, this thing will be studied for years.

A bit ironic that their internal FL2 tool is supposed to make Cloudflare "faster and more secure" but brought a lot of things down. And yeah, as other have already pointed out, that's a very unsafe use of Rust, should've never made it to production.

themark7mo ago

“…and the fluctuation stabilized in the failing state.”

Sounds like the ops team had one hell of a day.

back_to_basics7mo ago

While it's certainly worthwhile to discuss the Technical and Procedural elements that contributed to this Service Outage, the far more important (and mutually-exclusive aspect) to discuss should be:

Why have we built / permitted the building of / Subscribed to such a Failure-intolerant "Network"?

1 more reply

alhirzel7mo ago

> I worry this is the big botnet flexing.

Even worse - the small botnet that controls everything.

keiywuvfwofw7mo ago

This post was written by chatgpt??

https://blog.cloudflare.com/18-november-2025-outage/#:~:text...

1 more reply

metasnitch027mo ago

It was ASSUREDLY a cyber attack.

Go to jeffblearning on LinkedIn. I took it down with 253 copies of a text file delivered through a vulnerability in Novo’s systems.

I’ve documented all of it.

It’s not done yet…

realtyblocks7mo ago

our site realtyblocks.com went down but we redirected the traffic by bypassing the cloudflare DNS routes with few clicks. Thank you for resolving the issue Cloudflare.

zhisme7mo ago

Thank you for being honest, all must learn from the mistakes.

assbuttbuttass7mo ago

My website was down too, because a tree fell on my power line

realtyblocks7mo ago

our site realtyblocks.com went down but we redirected the traffic by bypassing DNS entries with few clicks. Thank you for resolving the issue Cloudflare.

harivyom7mo ago

I am just going by the outage post-mortem report. I could not read the article after I read the first few lines - "feature file" expansion/limits. I am stuck at consuming the design idea here, where you allow multiple inserts for one feature [assuming you have some uniqueness constraint].

Even a simple key-value map per feature should have allowed for insertions as simple as a put/replace of the value and not appending to the file. That was not the case here, where Cloudflare kept appending to the file for any feature to be added. And I am assuming the features are bot attack patterns as features. Anyway, there is something fundamental here that Cloudflare should rethink. If someone can educate me on the design, I can continue reading the next few lines.

realtyblocks7mo ago

our site realtyblocks.com went down but we redirected the traffic with few clicks. Thank you for resolving the issue Cloudflare.

harivyom7mo ago

I am just going by the outage post mortem report. I could not read the article after I read the first few lines - "feature file". I am stuck at consuming the design idea here where you allow multiple inserts for one feature [assuming you have some uniqueness constraint]. Even a simple key-value map per feature should have made the insertions as just a put/replace the value. I think that was not the case here, where cloudflare kept appending to the file for any feature to be added. And I am assuming the features are bot attack patterns as features. Anyway, there is something fundamental here cloudflare should rethink. If someone can educate me on the design, I can continue reading the next lines.

mmaunder7mo ago

tl;dr A permissions change in a ClickHouse database caused a query to return duplicate rows for a “feature file” used by Cloudflares Bot Management system, which doubled the file size. That oversized file was propagated to their core proxy machines, triggered an unhandled error in the proxy’s bot-module (it exceeded its pre-allocated limit), and as a result the network started returning 5xx errors. The issue wasn’t a cyber-attack — it was a configuration/automation failure.

0x001D7mo ago

Configuration can be validated. https://cuelang.org

pkumar000077mo ago

Did your incident response team look at the last few changes that were executed? If they had , they could have just rolleback the change. or just looking at the changes executed, in the vicinity of the start of the outage could have pointed to the problem.

Didn't the services that were crashing due to OOM raise any alerts?

This is shitty at so many levels.

chatmasta7mo ago

Wow, crazy disproportional drop in the stock price… good buying opportunity for $NET.

1 more reply

nanankcornering7mo ago

Matt, Looking forward in regaining Elon's and his team trust to use CF again.

1 more reply

baalimago7mo ago

Git blame disabled on the line which crashed it - Cowards!

aspbee5557mo ago

unwraps are so very easy to use and they have bit me so many times because you can nearly never run into a problem and suddenly crashes from an unwrap that almost always was fine

1 more reply

CSMastermind7mo ago

I'm honestly curious what culturally is going on inside Cloudflare given they've had a few outages this year.

lalam7mo ago

Hack Free fire 8000 5487 3565 644664 464664644

464646449

lalam7mo ago

Martcpp7mo ago

deny (clippy:: unwrap_used)

xyst7mo ago

A fucking unhandled exception brought down a majority of the internet? Why do we continue to let these clowns run a large portion of the internet?

Big tech is a fucking joke.

1 more reply

weihz51387mo ago

Glad it's fixed, keep going!

robofanatic7mo ago

hope no one was fired

xlii7mo ago

Cloudflare rewrites Rust services to <next-cool-language> /joke

...

(I'd pick Haskell, cause I'm having fun with it recently :P)

ogurechny7mo ago

> Given Cloudflare's importance in the Internet ecosystem any outage of any of our systems is unacceptable.

Excuse me, what you've just said? Who decided on “Cloudflare's importance in the Internet ecosystem”? Some see it differently, you know, there's no need for that self-assured arrogance of an inseminating alpha male.

dev_l1x_be7mo ago

Was it DNS this time?

binarymax7mo ago

28M 500 errors/sec for several hours from a single provider. Must be a new record.

No other time in history has one single company been responsible for so much commerce and traffic. I wonder what some outage analogs to the pre-internet ages would be.

5 more replies

uecker7mo ago

So an unhandled error condition after an configuration update similar to Crowdstrike - if they had just used a programming language where this can't happen due to the superior type system such as Rust. Oh wait.

lalam7mo ago

Hack

Free fire

awesome_dude7mo ago

But but but muh rust makes EVERYTHING safer!!!!

My dude, everything is a footgun if you hold it wrong enough

homeonthemtn7mo ago

Another long day...

sachahjkl7mo ago

me af when there's a postmortem rubbing hands, impish smile on my face

issafram7mo ago

I give them a pass on lots of things, but this is inexcusable

nullbyte8087mo ago

I thought it was an internal mess-up. I thought an employee screwed a file up. Old methods are sometimes better than new. AI fails us again!

Adam20257mo ago

Cloudflare’s write-up is clear and to the point. A small change spread wider than expected, and they explained where the process failed. It’s a good reminder that reliability depends on strong workflows as much as infrastructure.

kjgkjhfkjf7mo ago

Seems like a substantial fraction of the web was brought down because of a coding error that should have been caught in CI by a linter.

These folks weren't operating for charity. They were highly paid so-called professionals.

Who will be held accountable for this?

wileydragonfly7mo ago

Did some $300k chief of IT blame it all on some overworked secretary clicking a link in an email they should have run through a filter? Because that’s the MO.

1 more reply

keiywuvfwofw7mo ago

This post was written by chatgpt????

https://blog.cloudflare.com/18-november-2025-outage/#:~:text...

1 more reply

rvz7mo ago

Great write up.

This is the first significant outage that has involved Rust code, and as you can see the .unwrap is known to carry the risk of a panic and should never be used on production code.

snoppy457mo ago

I think you should give me a credit for all the income I lost due to this outage. Who authorized a change to the core infrastructure during the period of the year when your customers make the most income? Seriously, this is a management failure at the highest levels of decision-making. We don't make any changes to our server infrastructure/stack during the busiest time of the year, and neither should you. If there were an alternative to Cloudflare, I'd leave your service and move my systems elsewhere.

1 more reply

rawgabbit7mo ago

     > The change explained above resulted in all users accessing accurate metadata about tables they have access to. Unfortunately, there were assumptions made in the past, that the list of columns returned by a query like this would only include the “default” database:

  SELECT
  name,
  type
  FROM system.columns
  WHERE
  table =        'http_requests_features'
  order by name;

    Note how the query does not filter for the database name. With us gradually rolling out the explicit grants to users of a given ClickHouse cluster, after the change at 11:05 the query above started returning “duplicates” of columns because those were for underlying tables stored in the r0 database.

1 more reply

0xbadcafebee7mo ago

So, to recap:

  - Their database permissions changed unexpectedly (??)
  - This caused a 'feature file' to be changed in an unusual way (?!)
     - Their SQL query made assumptions about the database; their permissions change thus resulted in queries getting additional results, permitted by the query
  - Changes were propagated to production servers which then crashed those servers (meaning they weren't tested correctly)
     - They hit an internal application memory limit and that just... crashed the app
  - The crashing did not result in an automatic backout of the change, meaning their deployments aren't blue/green or progressive
  - After fixing it, they were vulnerable to a thundering herd problem
  - Customers who were not using bot rules were not affected; CloudFlare's bot-scorer generated a constant bot score of 0, meaning all traffic is bots

In terms of preventing this from a software engineering perspective, they made assumptions about how their database queries work (and didn't validate the results), and they ignored their own application limits and didn't program in either a test for whether an input would hit a limit, or some kind of alarm to notify the engineers of the source of the problem.

From an operations perspective, it would appear they didn't test this on a non-production system mimicing production; they then didn't have a progressive deployment; and they didn't have a circuit breaker to stop the deployment or roll-back when a newly deployed app started crashing.

2 more replies

j / k navigate · click thread line to collapse

916 comments

262 comments · 147 top-level

ojosilva7mo ago· 39 in thread

I know, this is "Monday morning quarterbacking", but that's what you get for an outage this big that had me tied up for half a day.

abalone7mo ago

I’ve led multiple incident responses at a FAANG, here’s my take. The fundamental problem here is not Rust or the coding error. The problem is:

2. Despite the elevated risk of system wide rapid config propagation, it took them 2 hours to identify the config as the proximate cause, and another hour to roll it back.

Cloudflare has done an amazing job with availability for many years and their Rust code now powers 20% of internet traffic. Truly a great team.

17 more replies

wrs7mo ago

It seems people have a blind spot for unwrap, perhaps because it's so often used in example code. In production code an unwrap or expect should be reviewed exactly like a panic.

9 more replies

smj-edison7mo ago

1 more reply

jcalvinowens7mo ago

> This is the multi-million dollar .unwrap() story.

5 more replies

ajross7mo ago

I'm not completely sure I agree. I mean, I do agree about the .unwrap() culture being a bug trap. But I don't think this example qualifies.

The root cause here was that a file was mildly corrupt (with duplicate entries, I guess). And there was a validation check elsewhere that said "THIS FILE IS TOO BIG".

1 more reply

AgentME7mo ago

The real issue is further up the chain where the malformed feature file got created and deployed without better checks.

6 more replies

vlovich1237mo ago

But more generally you could catch the panic at the FL2 layer to make that decision intentional - missing logic at that layer IMHO.

1 more reply

ironman14787mo ago

Maybe the validation code should've handled the larger size, but also the db query produced something invalid. That shouldn't have ever happened in the first place.

2 more replies

slanterns7mo ago

> Today, many friends pinged me saying Cloudflare was down. As a core developer of the first generation of Cloudflare FL, I'd like to share some thoughts.

> Hats off to Cloudflare's engineers—those on the front lines putting out fires bear the brunt of such incidents.

> Why did it persist so long? The global kill switch was inadequate, preventing rapid circuit-breaking. Early suspicion of an attack also caused delays.

> Why not roll back software versions or restart?

> Why not roll back the configuration?

https://x.com/guanlandai/status/1990967570011468071

1 more reply

ChrisMarshallNY7mo ago

Swift has implicit unwrap (!), and explicit unwrap (?).

I always redefine @IBOutlets from:

    @IBOutlet weak var someView!

to:

    @IBOutlet weak var someView?

I'm kind of a "belt & suspenders" type of guy.

2 more replies

antonvs7mo ago

> This is textbook "parse, don't validate" anti-pattern.

1 more reply

butvacuum7mo ago

It rang more as "A/B deployments are pointless if you can't tell if a downstream failure is related." To me.

shadowgovt7mo ago

In addition, it looks like this system wasn't on any kind of 1%/10%/50%/100% rollout gating. Such a rollout would trivially have shown the poison input killing tasks.

2 more replies

selfmodruntime7mo ago

gwd7mo ago

> This is the multi-million dollar .unwrap() story.

While there are certainly many things to admire about Rust, this is why I prefer Golang's "noisy" error handling. In golang that would be either:

    feature_values, err := features.append_with_names(...)

And the compiler would have complained that this value of `err` was unused; or you'd write:

    feature_values, _ := features.append_with_names(...)

And it would be far more obvious that an error message is being ignored.

(Renaming `unwrap` to `unwrapOrPanic` would probably help too.)

4 more replies

quotemstr7mo ago

If the error had been an exception instead of a result, could have bubbled up

I have been saying for years that Rust botched error handling in unfixable ways. I will go to the grave believing Rust fumbled.

The design of the Rust language encourages people to use unwrap() to turn foreseeable runtime problems into fatal errors. It's the path of least resistance, so people will take it.

Rust encourages developers to consider only the happy path. No wonder it's popular among people who've never had to deal with failure.

I could not be more certain that Rust should have been a language with exceptions, not Result, and that error objects are a gross antipattern we'll regret for decades.

1 more reply

ozgrakkurt7mo ago

Not panicking code is tedious to write. It is not realistic to expect everything to be non panic. There is a reason that panicking exists in the first place.

Them calling unwrap on a limit check is the real issue imo. Everything that takes in external input should assume it is bad input and should be fuzz tested imo.

In the end, what is the point of having a limit check if you are just unwrapping on it

1 more reply

torginus7mo ago

By the way - does this discussion matter and were they wrong to use unwrap()?

I'm sure when the process crashed, k8s restarted the pod or something - then it reran the same piece of code and crashed in the same place.

The code made the local hard assumption that there won't ever be more than 200 rules and its okay to crash if that count is exceeded.

If you design your code around an invariant never being violated (which is fine), you have to make it clear on a higher level that they did.

This isn't a Rust problem (though Rust does make it easy to do the wrong thing here imo)

1 more reply

twhitmore7mo ago

Interesting to see Rust error handling flunk out in practice.

It may be that forcing handling at every call tends to makes code verbose, and devs insensitized to bad practice. And the diagnostic Rust provided seems pretty garbage.

There is bad practice here too -- config failure manifesting as request failure, lack of failing to safe, unsafe rollout, lack of observability.

Back to language design & error handling. My informed view is that robustness is best when only major reliability boundaries need to be coded.

This the "throw, don't catch" principle with the addition of catches on key reliability boundaries -- typically high-level interactions where you can meaningfully answer a failure.

cvhc7mo ago

andy_ppp7mo ago

torginus7mo ago

branko_d7mo ago

Safe things should be easy, dangerous things should be hard.

This .unwrap() sounds too easy for what it does, certainly much easier than having an entire try..catch block with an explicit panic. Full disclosure: I don't actually know Rust.

1 more reply

peanut-walrus7mo ago

sphericalkat7mo ago

Handling the error still would've returned a 5xx in this case, since the config file was still over the limit of features the service could handle.

rafaelmn7mo ago

That's such a bad take after reading the article. If you're going to write a system that preallocates and is based on hard assumptions about max size - the panic/unwrap approach is reasonable.

The config bug reaching prod without this being caught and pinpointed immediately is the strange part.

3 more replies

throwaway382947mo ago

This is a bummer. The unwrap()'ing function already returned a result and should have just propagated the error. Presumably the caller could have handled more sensibly than just panic'ing.

nrhrjrjrjtntbt7mo ago

I wonder what happens if they handle it gracefully? sounds like performance degradation (better than reliability degradation!).

Also wonder with a sharded system why are they not slow rolling out changes and monitoring?

karel-3d7mo ago

As a gopher I never understand why is there so many unwraps in an average rust code.

Average Go code has much less panics than Rust has unwraps, which are functionally equivalent.

3 more replies

meltyness7mo ago

tokio default behavior within a task is to ignore panics, such as an Err/None unwrap, and only crash that task, so it's impact limited so that's nice, maybe that's where the snowblindness came from.

it'd be kinda hard to amend the clippy lints to ignore coroutine unwraps but still pipe up on system ones. i guess.

edit: i think they'd have to be "solely-task-color-flavored" so definitely probably not trivial to infer

pjmlp7mo ago

Which is something I will bookmark for the usual Rust doesn't do exceptions discussions, except it kind of does even if called differently.

BrtByte7mo ago

Feels like a case where safety guarantees of Rust lulled them into thinking the edge cases were covered

echelon7mo ago

> This is the multi-million dollar .unwrap() story.

First multi-million dollar .unwrap() story.

NoboruWataya7mo ago

They should link this article in the docs for `unwrap()`.

guluarte7mo ago

it's usually because of fail fast and fail hard, in theory critical bugs will be caught in dev/test

arccy7mo ago

if you make it easy to be lazy and panic vs properly handling the error, you've designed a poor language

8 more replies

__bax7mo ago

git blame on .unwrap() line

otabdeveloper47mo ago

Oh come on, stop spreading FUD. Rust programs are 100% immune to crashes and bugs, they have memory safety (c).

Also, exception handling is hard and lame. We don't need exceptions, just add a "match" block after every line in your program.

1 more reply

hoppp7mo ago

You write so much rust you causally apply unwrap now to everything?

Rust compiler is a god of sorts, or at least a law of nature haha

Way to comment and go instantly off topic

dzonga7mo ago· 7 in thread

> thread fl2_worker_thread panicked: called Result::unwrap() on an Err value

I don't use Rust, but a lot of Rust people say if it compiles it runs.

Well Rust won't save you from the usual programming mistake. Not blaming anyone at cloudflare here. I love Cloudflare and the awesome tools they put out.

end of day - let's pick languages | tech because of what we love to do. if you love Rust - pick it all day. I actually wanna try it for industrial robot stuff or small controllers etc.

there's no bad language - just occassional hiccups from us users who use those tools.

jryio7mo ago

You misunderstand what Rust’s guarantees are. Rust has never promised to solve or protect programmers from logical or poor programming. In fact, no such language can do that, not even Haskell.

I postulate that whatever the cost in millions or hundreds of millions of dollars by this Cloudflare outage, it has paid for more than by the savings of safe memory access.

See: https://en.wikipedia.org/wiki/Survivorship_bias

3 more replies

lmm7mo ago

> Rust won't save you from the usual programming mistake.

3 more replies

metaltyphoon7mo ago

> Well Rust won't save you from the usual programming mistake

1 more reply

tptacek7mo ago

1 more reply

the84727mo ago

Klonoar7mo ago

> I don't use Rust, but a lot of Rust people say if it compiles it runs.

Do you grok what the issue was with the unwrap, though...?

It's a programmer error, but Rust at least forces you to recognize "okay, I'm going to be an idiot here". There is real value in that.

1 more reply

dzonga7mo ago

other people might say - why use unsafe rust - but we don't know the conditions of what the original code shipped under. why the pr was approved.

could have been tight deadline, managerial pressure or just the occasional slip up.

thatoneengineer7mo ago· 6 in thread

The slowness to root cause: sheer bad luck, with the status page down and Azure's DDoS yesterday all over the news.

vbezhenar7mo ago

So basically bad config should be explicitly processed and handled by rolling back to known working config.

2 more replies

twoodfin7mo ago

The query is surely faulty: Even if this wasn’t a huge distributed database with who-knows-what schemas and use cases, looking up a specific table by its unqualified name is sloppy.

watchful_moose7mo ago

philipwhiuk7mo ago

For unwrap, Cloudflare should consider adding lint tooling that prevents unwrap being added to production code.

1 more reply

nijave7mo ago

Quite surprising a single bad config file brought down their entire global network across multiple products

Xunjin7mo ago

Share the same opinion, as others pointed out, the status page down probably caused by bots checking it.

otterley7mo ago· 6 in thread

> work has already begun on how we will harden them against failures like this in the future. In particular we are:

> Hardening ingestion of Cloudflare-generated configuration files in the same way we would for user-generated input

> Enabling more global kill switches for features

> Eliminating the ability for core dumps or other error reports to overwhelm system resources

> Reviewing failure modes for error conditions across all core proxy modules

This is something the industry was supposed to learn from the CrowdStrike incident last year, but it's clear that we still have a long way to go.

nikcub7mo ago

[0] https://clickhouse.com/docs/guides/developer/deduplication

1 more reply

Scaevolus7mo ago

Global configuration is useful for low response times to attacks, but you need to have very good ways to know when a global config push is bad and to be able to rollback quickly.

1 more reply

mewpmewp27mo ago

1 more reply

Yokohiii7mo ago

ants_everywhere7mo ago

Buttons8407mo ago

When a failsafe system fails, it fails by failing to fail safely.

SerCe7mo ago· 5 in thread

As always, kudos for releasing a post mortem in less than 24 hours after the outage, very few tech organisations are capable of doing this.

yen2237mo ago

I'm curious about how their internal policies work such that they are allowed to publish a post mortem this quickly, and with this much transparency.

Any other large-ish company, there would be layers of "stakeholders" that will slow this process down. They will almost always never allow code to be published.

5 more replies

bayesnet7mo ago

And a well-written one at that. Compared to the AWS port-mortem this could be literature.

2 more replies

eastdakotaOP7mo ago

* published less than 12 hours from when the incident began. Proud of the team for pulling together everything so quickly and clearly.

1 more reply

BrtByte7mo ago

It's not just a PR-friendly summary either... they included real technical detail, timestamps, even code snippets

andrewinardeer7mo ago

Plenty are capable. Most don't bother.

EvanAnderson7mo ago· 4 in thread

It reads a lot like the Crowdstrike SNAFU. Machine-generated configuration file b0rks-up the software that consumes it.

The remediation section doesn't give me any sense that phased deployment, acceptance testing, and rapid rollback are part of the planned remediation strategy.

perlgeek7mo ago

It has somewhat regularly saved us from disaster in the past.

tptacek7mo ago

3 more replies

navigate83107mo ago

I'm amazed that they are not using any simulator of some sort and pushing changes directly to production.

Aeolun7mo ago

I’m fairly certain it will be after they read this thread. It doesn’t feel like they don’t want, or are incapable of improving?

bri3d7mo ago· 4 in thread

spprashant7mo ago

Sometimes you have smart people in the room who dig deeper and fish it out, but you cannot always rely on that.

1 more reply

speedgoose7mo ago

discordianfish7mo ago

Indeed, nothing about the root issues are particular surprising but why they missed a critical service panicing across their fleet is not bubbling up.

reassess_blind7mo ago

Once they figured it out they didn't have a way to load in a new feature file, had to figure that out, and then restart every machine.

1 more reply

lukan7mo ago· 4 in thread

Unfortunately they do not share, what caused the status page to went down as well. (Does this happen often? Otherwise a big coincidence it seems)

eastdakotaOP7mo ago

We don’t know. Suspect it may just have been a big uptick in load and a failure of its underlying infrastructure to scale up.

4 more replies

paulddraper7mo ago

Quite possibly it was due to high traffic.

IDK Atlassian Statuspage clientele, but it's possible Cloudflare is much larger than usual.

notatoad7mo ago

1 more reply

Aeolun7mo ago

I mean, that would require a postmortem from statuspage.io right? Is that a service operated by cloudflare?

2 more replies

nawgz7mo ago· 4 in thread

> The software had a limit on the size of the feature file that was below its doubled size. That caused the software to fail

A configuration error can cause internet-scale outages. What an era we live in

mewpmewp27mo ago

I think it's quite rare for any company to have exact similar scale and size of storage in stage as in prod.

1 more reply

shoo7mo ago

The speed and transparency of Cloudflare publishing this port mortem is excellent.

norskeld7mo ago

This wild `unwrap()` kinda took me aback as well. Someone really believed in themselves writing this. :)

1 more reply

jmclnx7mo ago

I have to wonder if AI was involved with the change.

1 more reply

vsgherzi7mo ago· 3 in thread

At the bare minimum they could've used an expect("this should never happen, if it does database schema is incorrect").

The whole point of errors as values is preventing this kind of thing.... It wouldn't have stopped the outage but it would've made it easy to diagnose.

If anyone at cloudflare is here please let me in that codebase :)

waterTanuki7mo ago

3 more replies

pornel7mo ago

Unwrap gives you a stack trace, while retuned Err doesn't, so simply using a Result for that line of code could have been even harder to diagnose.

The problem is deeper than an unwrap(), related to handling rollouts of invalid configurations, but that's not a 1-line change.

1 more reply

ozgrakkurt7mo ago

And the error magically disappears when the function returns it?

1 more reply

tristan-morris7mo ago· 3 in thread

Why call .unwrap() in a function which returns Result<_,_>?

For something so critical, why aren't you using lints to identify and ideally deny panic inducing code. This is one of the biggest strengths of using Rust in the first place for this problem domain.

tptacek7mo ago

Probably because this case was something more akin to an assert than an error check.

5 more replies

sayrer7mo ago

Yes, can't have .unwrap() in production code (it's ok in tests)

2 more replies

koakuma-chan7mo ago

Why is there a 200 limit on appending names?

2 more replies

gucci-on-fleek7mo ago· 2 in thread

> This showed up to Internet users trying to access our customers' sites as an error page indicating a failure within Cloudflare's network.

As a visitor to random web pages, I definitely appreciated this—much better than their completely false “checking the security of your connection” message.

> The issue was not caused, directly or indirectly, by a cyber attack or malicious activity of any kind. Instead, it was triggered by a change to one of our database systems' permissions

Also appreciate the honesty here.

> On 18 November 2025 at 11:20 UTC (all times in this blog are UTC), Cloudflare's network began experiencing significant failures to deliver core network traffic. […]

eastdakotaOP7mo ago

8 more replies

chrismorgan7mo ago

> much better than their completely false “checking the security of your connection” message

The exact wording (which I can easily find, because a good chunk of the internet gives it to me, because I’m on Indian broadband):

> example.com needs to review the security of your connection before proceeding.

It bothers me how this bald-faced lie of a wording has persisted.

2 more replies

keypusher7mo ago· 2 in thread

eastdakotaOP7mo ago

2 more replies

Fiadliel7mo ago

I can imagine that this could easily lead to less visibility into issues.

yoyohello137mo ago· 2 in thread

I get it, don’t pick languages just because they are trendy, but if any company’s use case is a perfect fit for Rust it’s cloudflare.

SchemaLoad7mo ago

1 more reply

samdoesnothing7mo ago

1 more reply

ulfw7mo ago· 2 in thread

Yesterday with Cloudflare down half the sites I tried gave me nothing but errors.

The internet is dead.

samdoesnothing7mo ago

It's not that deep, if AWS or Cloudflare suddenly disappeared sites would move to different hosts, it wouldn't mean the internet would die.

rubatuga7mo ago

I mean you still depend on authoritative dns servers no?

throw77mo ago· 2 in thread

Is this true: from that core proxy diagram, I didn't realize cloudflare sees the full unencrypted packet between you and the server.

If that's true, is there a way to tell (easily) whether a site is using cloudflare or not?

tempest_7mo ago

It is pretty easy to see if cloudflare is proxying a site.

Just ping the host and see if the ip belongs to CF.

https://www.cloudflare.com/en-ca/ips/

finally73947mo ago

The NSA has to see the data somehow, right?

hnarn7mo ago· 1 in thread

> That feature file, in turn, doubled in size. The larger-than-expected feature file was then propagated to all the machines that make up our network.

I'm no FAANG 10x engineer, and I appreciate things can be obvious in hindsight, but I'm somewhat surprised that engineering at the level of Cloudflare does not:

1. Push out files A/B to ensure the old file is not removed.

2. Handle the failure of loading the file (for whatever reason) by automatically reloading the old file instead and logging the error.

This seems like pretty basic SRE stuff.

watchful_moose7mo ago

2 more replies

RagingCactus7mo ago· 1 in thread

rixed7mo ago

Exactly! You could have `rand() > 0.5 && panic!()` in the code of your bot module, and that should not put the internet on fire.

The issue here is about the system as a whole not any line of code.

1 more reply

trengrj7mo ago· 1 in thread

Classic combination of errors:

Finally an error with bot management config files should probably disable bot management vs crash the core proxy.

tptacek7mo ago

(Cloudflare's responses will be different than ours, really I'm just sticking up for the idea that the changes you need don't follow obviously from the immediate facts of an outage.)

jdlyga7mo ago· 1 in thread

We shouldn't be having critical internet-wide outages on a monthly basis. Something is systematically wrong with the way we're architecting our systems.

smt887mo ago

What would you propose to fix it? The fixed cost of being DDoS-proof is in the hundreds of millions of dollars.

2 more replies

HL33tibCe77mo ago· 1 in thread

An unwrap like that in production code on the critical path is very surprising to me.

I haven’t worked in Rust codebases, but I have never worked in a Go codebase where a `panic` in such a location would make it through code review.

Is this normal in Rust?

xnotcursed7mo ago

ademarre7mo ago· 1 in thread

Fail-open with Turnstile works for us because we have other bot mitigations that are sufficient to fall back on in the event of a Cloudflare outage.

cj7mo ago

So to bypass captcha all a user has to do is block the script from loading? I can see that working but only for attacks that aren’t targeted?

1 more reply

ed_mercer7mo ago· 1 in thread

Wow. 26M/s 5xx error HTTP status codes over a span of roughly two hours. That's roughly 187 billion HTTP errors that interrupted people (and systems)!

watchful_moose7mo ago

Some of these would be retries that wouldn't have happened if not for earlier errors.

testemailfordg27mo ago· 1 in thread

"Customers on our old proxy engine, known as FL, did not see errors, but bot scores were not generated correctly, resulting in all traffic receiving a bot score of zero."

This simply means, the exception handling quality of your new FL2 is non-existent and is not at par / code logic wise similar to FL.

I hope it was not because of AI driven efficiency gains.

lmm7mo ago

In most domains, silently returning 0 in a case where your logic didn't actually calculate the thing you were trying to calculate is far worse than giving a clear error.

zf000027mo ago· 1 in thread

rkomorn7mo ago

As a former CF employee, I'd say it's a mixed bag.

There are plenty of resources , yet it's somehow never enough. You do tons of pretty amazing things with pretty amazing tools that also have notable shortcomings.

pdimitar7mo ago· 1 in thread

They are escape hatches. Without those your language would never take off.

But here's the thing. Escape hatches are like emergency exits. They are not to be used by your team to go to lunch in a nearby restaurant.

---

Cloudflare should likely invest in better linting and CI/CD alerts. Not to mention isolated testing i.e. deploy this change only to a small subset and monitor, and only then do a wider deployment.

---

In any case, I would expect many other failures from Cloudflare but not _this_ one in particular.

duped7mo ago

This is not a reasonable take to me. unwrap/expect are the idiomatic way to express code paths returning Option/Result as unreachable.

Bubbling up the error or None does not make the program correct. Panicking may be the only reasonable thing to do.

If panicking is guaranteed because of some input mistake to the system your failure is in testing.

2 more replies

habibur7mo ago· 1 in thread

    On 18 November 2025 at 11:20 UTC (all times in this blog are UTC), Cloudflare's network began experiencing significant failures
    As of 17:06 all systems at Cloudflare were functioning as normal

6 hours / 5 years gives ~99.98% uptime.

TehShrike7mo ago

I'm feeling generous tonight, I'm willing to consider 0.99986 to round to 99.99%

zzzeek7mo ago· 1 in thread

> Instead, it was triggered by a change to one of our database systems' permissions which caused the database to output multiple entries into a “feature file” used by our Bot Management system.

And here is the query they used ** (OK, so it's not exactly):

     SELECT * from feature JOIN permissions on feature.feature_type_id = permissions.feature_type_id

someone added a new row to permissions and the JOIN started returning two dupe feature rows for each distinct feature.

** "here is the query" is used for dramatic effect. I have no knowledge of what kind of database they are even using much less queries (but i do have an idea).

captainkrtek7mo ago

I believe they mentioned ClickHouse

abigailphoebe7mo ago· 1 in thread

kudos to getting this blog post out so fast, it’s well written and is appreciated.

i’m a little confused on how this was initially confused for an attack though?

also a bit taken aback that .unwrap()’s are ever allowed within such an important context.

would appreciate some insight!

pornel7mo ago

1. Cloudflare is in the business of being a lightning rod for large and targeted DoS attacks. A lot of cases are attacks.

2. Attacks that make it through the usual defences make servers run at rates beyond their breaking point, causing all kinds of novel and unexpected errors.

1 more reply

kqr7mo ago· 1 in thread

One of the remediations listed is

> Eliminating the ability for core dumps or other error reports to overwhelm system resources

janpio7mo ago

I understood this to be related to this section:

(Just above https://blog.cloudflare.com/18-november-2025-outage/#how-clo...)

1 more reply

jokoon7mo ago· 1 in thread

I don't understand what's the business of cloudflare.

They just sell proxies, to whoever.

Why are they the only company doing ddos protection?

I just don't get it.

BOOSTERHIDROGEN7mo ago

Momentum, I guess, pretty much like you wouldn’t get fired for using AWS or IBM (if that’s still the case now).

anal_reactor7mo ago· 1 in thread

> but my use case is especially important

Take a chill pill. Probably it isn't.

vultour7mo ago

drc500free7mo ago· 1 in thread

mmaia7mo ago

Exactly. The post screams about all the issues I've seen in multiple companies between DS/MLE and SE/DevOps.

wildmXranat7mo ago· 1 in thread

There never was an unbound "select all rows from some table" without a "fetch first N rows only" or "limit N"

If you knew that this design is rigid, why not leverage the query to actually do it ?

What am I missing ?

JuniperMesos7mo ago

Anyway regardless of which language you use to construct a SQL query, you're not obligated to put in a max rows

1 more reply

kylegalbraith7mo ago· 1 in thread

The outage sucked for everyone. The root cause also feels like something they could have caught much earlier in a canary rollout from my reading of this.

agentifysh7mo ago

how would you build redundancy around cloudflare failing?

i think this is happening way too frequently

meanwhile VPS, dedicated servers hum along without any issues

i dont want to use kubernetes but if we have to build mission critical systems doesn't seem like building on cloudflare is going to cut it

markhandoff7mo ago· 1 in thread

NetMageSCW7mo ago

What does your contract with Cloudflare say?

cowsandmilk7mo ago

Blog post from less than a week ago on how Cloudflare avoids outages on configuration changes: https://blog.cloudflare.com/finding-the-grain-of-sand-in-a-h...

This has to sting a bit after that post.

dilyevsky7mo ago

nwellinghoff7mo ago

aetherspawn7mo ago

I don’t think the infrastructure has been as fully recovered as they think yet…

ksajadi7mo ago

jjice7mo ago

There's (obviously) a lot of discussion around the use of `unwrap` in production code. I feel like I'm watching comments speak past each other right now.

Also, it seems like the initial Clickhouse changes could've been testing much better, but I'm sure the CF team realizes that.

On the bright side, this is a very solid write up so quickly after the outage. Much better than those times we get it two weeks later.

spprashant7mo ago

A lot of outages off late seem to be related to automated config management.

Not trying to pontificate here, these systems are more complicated than anything I have maintained. Just trying to think of best practices perhaps everyone can adopt.

Diggsey7mo ago

There were two things I think went extremely poorly here:

1) Lack of validation of the configuration file.

There were several things "obviously" wrong with the file that validation should have caught:

- It was much bigger than expected.

- It had duplicate entries.

2) Very long time to identify and then fix such a critical issue.

130R7mo ago

If the software has a limit on the size of the feature file then the process that propagates the file should probably validate the size before propagating ..

JamesJGoodwin7mo ago

>Currently that limit is set to 200, well above our current use of ~60 features. Again, the limit exists because for performance reasons we preallocate memory for the features.

MagicMoonlight7mo ago

Having a system which automatically deploys configuration files across a million servers every 5 minutes without testing it seems stupid to me.

Matthias2477mo ago

jamesblonde7mo ago

Cloudflare tried to build their own feature store, and get a grade F.

Reference: https://www.oreilly.com/library/view/building-machine-learni...

pkumar000077mo ago

What were the teams doing between 11 to 1300 hrs , no explanation of what investigations were going on to not being able to figure the root cause.

arkanovicz7mo ago

Forgeties797mo ago

cvhc7mo ago

NetMageSCW7mo ago

It feels like their list of after actions is lacking a bit to me.

avereveard7mo ago

laotree7mo ago

Attempt to reproduce the Cloudflare 2025-11-18 outage.

Cloudflare's incident report is written clearly and explicitly, so based on my own understanding, I’m going to try reproducing this outage. Already completed:

CK cluster Permission change triggering data doubling Cache propagation Unaffected proxy services Proxy services with bot score errors

TODO:

unwrap panic during pre-allocation of cache Full demonstration of the entire outage process

https://github.com/Laotree/reproduce_cf20251118

proverbs537mo ago

What I read is that a non-critical feature (blocking / managing bot - access) was able to impact a critical feature (routing traffic).

Shouldn't the architecture setup in such a way that subcomponents can fail without impacting the critical function of the component?

slanterns7mo ago

https://x.com/guanlandai/status/1990967570011468071

arifiqbal817mo ago

Thanks for the detailed writeup and explaining the root cause in details

mayank947mo ago

vasuadari7mo ago

Wondering why they didn’t disable the bot management temporarily to recover. Websites could have survived temporarily without it compared to the outage itself.

amitkumary967mo ago

keepamovin7mo ago

darksideofthem7mo ago

nurettin7mo ago

jeffrallen7mo ago

This is an excellent lesson learned: Harden loading of internally generated config files as though they were untrusted content.

Gonna use that one at $WORK.

atari_guy7mo ago

It actually wasn't resolved quite that fast. My site continued to have issues for hours afterward before things finally resolved completely.

l___l7mo ago

> The software had a limit on the size of the feature file that was below its doubled size. That caused the software to fail.

What could have prevented this failure?

Cloudflare's software could have included a check that refused to generate the feature file if it's size was higher than the limit.

A testcase could have caught this.

arjie7mo ago

Great post-mortem. Very clear. Surprised that num(panicking threads) didn't show up somewhere in telemetry.

flerchin7mo ago

tete7mo ago

Bit funny. "Memory management" bug in Rust. (writing that as kind of a fan of Rust)

BrtByte7mo ago

This incident feels like a strong argument for stricter guardrails around internal config propagation

gkoz7mo ago

Given this was triggered by an old school configuration change across multiple servers, there's too little discussion of that particular process.

It sounds like the change could've been rolled out more slowly, halted when the incident started and perhaps rolled back just in case.

cvshane7mo ago

cmilton7mo ago

AtNightWeCode7mo ago

hbarka7mo ago

ClickHouse db was mentioned. Does this incident raise any critiques about it?

1 more reply

1970-01-017mo ago

I would have been a bit cheeky and opened with 'It wasn't DNS.'

nromiun7mo ago

Unbelievable. I guess it's time to grep for every .unwrap in our code.

1 more reply

Chihuahua06337mo ago

> The first automated test detected the issue at 11:31 and manual investigation started at 11:32. The incident call was created at 11:35.

I'm impressed they were able to corral people this quickly.

laurentiurad7mo ago

Reason for the failure: switched to Chad IDE to ship new features.

__alexs7mo ago

Is dual sourcing CDNs feasible these days? Seems like having the capability to swap between CDN providers is good both from a negotiating perspective and a resiliency one.

lapcat7mo ago

It's unbelievable that the end of this postmortem is an advertisement for Cloudflare.

The last thing we need here is for more of the internet to sign up for Cloudflare.

1 more reply

chaos_emergent7mo ago

Just a moment to reflect on how much freaking leverage computers give us today - a single permission change took down half the internet. Truly crazy times.

sema4hacker7mo ago

If you deploy a change to your system, and things start to go wrong that same day, the prime suspect (no matter how unlikely it might seem) should be the change you made.

2 more replies

leonaves7mo ago

Why have a limit on the file size if the thing that happens when you hit the limit is the entire network goes down? Surely not having a limit can't be worse?

makach7mo ago

Excellent write up. Cybersecurity professionals read the story and learn. It’s textbook lesson in post-mortem incident analysis - a mvp for what is expected from us all in a similar situation.

A reputational hit for sure, but outcome is lessons learned and hopefully stronger resilience.

phyzome7mo ago

This is a small issue in the writeup (everything else made sense), but -- why does doubling 60 features exceed the 200 limit? Missing something.

agonux7mo ago

Time to rewrite with golang, explicit error handling ;-)

sanjitb7mo ago

cloudflare:

also cloudflare:

> The Cloudflare Dashboard was also impacted due to both Workers KV being used internally and Cloudflare Turnstile being deployed as part of our login flow.

1 more reply

niedbalski7mo ago

Sure this has been said but the issue here is not code is the ability to canary and rollback quickly from any arbitrary (config) change.

sigmar7mo ago

1 more reply

kwar137mo ago

Perils of a centralized chokepoint. Substantial amounts of the internet shouldn't go down because Cloudflare had a hiccup.

jijji7mo ago

2 more replies

igornadj7mo ago

Any feature failing should still allow the traffic to continue. This should be the first bullet in the future actions list.

lofaszvanitt7mo ago

It is staggering to see that even large companies like CF have zero monitoring, so they would know what happened in t=0.

elAhmo7mo ago

Timely post mortem. Sucks to have this happened, but at least they are quite transparent and detailed in the writeup.

baalimago7mo ago

Interesting. Although principle of least privilege is great, it should not be applied as a feature to filter data.

zeroq7mo ago

But Rust was supposed to cure cancer and solve world hunger. Is this the end of the hello world but in Rust saga?

1 more reply

zyngaro7mo ago

Catastrophique failure for failing to read a file bigger that expected? Wow. This is really embarrassing.

slyall7mo ago

Ironically just now I got a Cloudflare "Error code 524" page because blog.cloudflare.com was down

moralestapia7mo ago

No publicity is bad publicity.

Best post mortem I've read in a while, this thing will be studied for years.

themark7mo ago

“…and the fluctuation stabilized in the failing state.”

Sounds like the ops team had one hell of a day.

back_to_basics7mo ago

While it's certainly worthwhile to discuss the Technical and Procedural elements that contributed to this Service Outage, the far more important (and mutually-exclusive aspect) to discuss should be:

Why have we built / permitted the building of / Subscribed to such a Failure-intolerant "Network"?

1 more reply

alhirzel7mo ago

> I worry this is the big botnet flexing.

Even worse - the small botnet that controls everything.

keiywuvfwofw7mo ago

This post was written by chatgpt??

https://blog.cloudflare.com/18-november-2025-outage/#:~:text...

1 more reply

metasnitch027mo ago

It was ASSUREDLY a cyber attack.

Go to jeffblearning on LinkedIn. I took it down with 253 copies of a text file delivered through a vulnerability in Novo’s systems.

I’ve documented all of it.

It’s not done yet…

realtyblocks7mo ago

our site realtyblocks.com went down but we redirected the traffic by bypassing the cloudflare DNS routes with few clicks. Thank you for resolving the issue Cloudflare.

zhisme7mo ago

Thank you for being honest, all must learn from the mistakes.

assbuttbuttass7mo ago

My website was down too, because a tree fell on my power line

realtyblocks7mo ago

our site realtyblocks.com went down but we redirected the traffic by bypassing DNS entries with few clicks. Thank you for resolving the issue Cloudflare.

harivyom7mo ago

realtyblocks7mo ago

our site realtyblocks.com went down but we redirected the traffic with few clicks. Thank you for resolving the issue Cloudflare.

harivyom7mo ago

mmaunder7mo ago

0x001D7mo ago

Configuration can be validated. https://cuelang.org

pkumar000077mo ago

Didn't the services that were crashing due to OOM raise any alerts?

This is shitty at so many levels.

chatmasta7mo ago

Wow, crazy disproportional drop in the stock price… good buying opportunity for $NET.

1 more reply

nanankcornering7mo ago

Matt, Looking forward in regaining Elon's and his team trust to use CF again.

1 more reply

baalimago7mo ago

Git blame disabled on the line which crashed it - Cowards!

aspbee5557mo ago

unwraps are so very easy to use and they have bit me so many times because you can nearly never run into a problem and suddenly crashes from an unwrap that almost always was fine

1 more reply

CSMastermind7mo ago

I'm honestly curious what culturally is going on inside Cloudflare given they've had a few outages this year.

lalam7mo ago

Hack Free fire 8000 5487 3565 644664 464664644

464646449

lalam7mo ago

Martcpp7mo ago

deny (clippy:: unwrap_used)

xyst7mo ago

A fucking unhandled exception brought down a majority of the internet? Why do we continue to let these clowns run a large portion of the internet?

Big tech is a fucking joke.

1 more reply

weihz51387mo ago

Glad it's fixed, keep going!

robofanatic7mo ago

hope no one was fired

xlii7mo ago

Cloudflare rewrites Rust services to <next-cool-language> /joke

...

(I'd pick Haskell, cause I'm having fun with it recently :P)

ogurechny7mo ago

> Given Cloudflare's importance in the Internet ecosystem any outage of any of our systems is unacceptable.

dev_l1x_be7mo ago

Was it DNS this time?

binarymax7mo ago

28M 500 errors/sec for several hours from a single provider. Must be a new record.

No other time in history has one single company been responsible for so much commerce and traffic. I wonder what some outage analogs to the pre-internet ages would be.

5 more replies

uecker7mo ago

lalam7mo ago

Hack

Free fire

awesome_dude7mo ago

But but but muh rust makes EVERYTHING safer!!!!

My dude, everything is a footgun if you hold it wrong enough

homeonthemtn7mo ago

Another long day...

sachahjkl7mo ago

me af when there's a postmortem rubbing hands, impish smile on my face

issafram7mo ago

I give them a pass on lots of things, but this is inexcusable

nullbyte8087mo ago

I thought it was an internal mess-up. I thought an employee screwed a file up. Old methods are sometimes better than new. AI fails us again!

Adam20257mo ago

kjgkjhfkjf7mo ago

Seems like a substantial fraction of the web was brought down because of a coding error that should have been caught in CI by a linter.

These folks weren't operating for charity. They were highly paid so-called professionals.

Who will be held accountable for this?

wileydragonfly7mo ago

Did some $300k chief of IT blame it all on some overworked secretary clicking a link in an email they should have run through a filter? Because that’s the MO.

1 more reply

keiywuvfwofw7mo ago

This post was written by chatgpt????

https://blog.cloudflare.com/18-november-2025-outage/#:~:text...

1 more reply

rvz7mo ago

Great write up.

This is the first significant outage that has involved Rust code, and as you can see the .unwrap is known to carry the risk of a panic and should never be used on production code.

snoppy457mo ago

1 more reply

rawgabbit7mo ago

     > The change explained above resulted in all users accessing accurate metadata about tables they have access to. Unfortunately, there were assumptions made in the past, that the list of columns returned by a query like this would only include the “default” database:

  SELECT
  name,
  type
  FROM system.columns
  WHERE
  table =        'http_requests_features'
  order by name;

    Note how the query does not filter for the database name. With us gradually rolling out the explicit grants to users of a given ClickHouse cluster, after the change at 11:05 the query above started returning “duplicates” of columns because those were for underlying tables stored in the r0 database.

1 more reply

0xbadcafebee7mo ago

So, to recap:

  - Their database permissions changed unexpectedly (??)
  - This caused a 'feature file' to be changed in an unusual way (?!)
     - Their SQL query made assumptions about the database; their permissions change thus resulted in queries getting additional results, permitted by the query
  - Changes were propagated to production servers which then crashed those servers (meaning they weren't tested correctly)
     - They hit an internal application memory limit and that just... crashed the app
  - The crashing did not result in an automatic backout of the change, meaning their deployments aren't blue/green or progressive
  - After fixing it, they were vulnerable to a thundering herd problem
  - Customers who were not using bot rules were not affected; CloudFlare's bot-scorer generated a constant bot score of 0, meaning all traffic is bots

2 more replies

j / k navigate · click thread line to collapse