As a visitor to random web pages, I definitely appreciated this—much better than their completely false “checking the security of your connection” message.
> The issue was not caused, directly or indirectly, by a cyber attack or malicious activity of any kind. Instead, it was triggered by a change to one of our database systems' permissions
Also appreciate the honesty here.
> On 18 November 2025 at 11:20 UTC (all times in this blog are UTC), Cloudflare's network began experiencing significant failures to deliver core network traffic. […]
> Core traffic was largely flowing as normal by 14:30. We worked over the next few hours to mitigate increased load on various parts of our network as traffic rushed back online. As of 17:06 all systems at Cloudflare were functioning as normal.
Why did this take so long to resolve? I read through the entire article, and I understand why the outage happened, but when most of the network goes down, why wasn't the first step to revert any recent configuration changes, even ones that seem unrelated to the outage? (Or did I just misread something and this was explained somewhere?)
Of course, the correct solution is always obvious in retrospect, and it's impressive that it only took 7 minutes between the start of the outage and the incident being investigated, but it taking a further 4 hours to resolve the problem and 8 hours total for everything to be back to normal isn't great.
The exact wording (which I can easily find, because a good chunk of the internet gives it to me, because I’m on Indian broadband):
> example.com needs to review the security of your connection before proceeding.
It bothers me how this bald-faced lie of a wording has persisted.
(The “Verify you are human by completing the action below.” / “Verify you are human” checkbox is also pretty false, as ticking the box in no way verifies you are human, but that feels slightly less disingenuous.)
Any other large-ish company, there would be layers of "stakeholders" that will slow this process down. They will almost always never allow code to be published.
I know, this is "Monday morning quarterbacking", but that's what you get for an outage this big that had me tied up for half a day.
1. Their bot management system is designed to push a configuration out to their entire network rapidly. This is necessary so they can rapidly respond to attacks, but it creates risk as compared to systems that roll out changes gradually.
2. Despite the elevated risk of system wide rapid config propagation, it took them 2 hours to identify the config as the proximate cause, and another hour to roll it back.
SOP for stuff breaking is you roll back to a known good state. If you roll out gradually and your canaries break, you have a clear signal to roll back. Here was a special case where they needed their system to rapidly propagate changes everywhere, which is a huge risk, but didn’t quite have the visibility and rapid rollback capability in place to match that risk.
While it’s certainly useful to examine the root cause in the code, you’re never going to have defect free code. Reliability isn’t just about avoiding bugs. It’s about understanding how to give yourself clear visibility into the relationship between changes and behavior and the rollback capability to quickly revert to a known good state.
Cloudflare has done an amazing job with availability for many years and their Rust code now powers 20% of internet traffic. Truly a great team.
It's not necessarily invalid to use unwrap in production code if you would just call panic anyway. But just like every unsafe block needs a SAFETY comment, every unwrap in production code needs an INFALLIBILITY comment. clippy::unwrap_used can enforce this.
That's too semantic IMHO. The failure mode was "enforced invariant stopped being true". If they'd written explicit code to fail the request when that happened, the end result would have been exactly the same.
The root cause here was that a file was mildly corrupt (with duplicate entries, I guess). And there was a validation check elsewhere that said "THIS FILE IS TOO BIG".
But if that's a validation failure, well, failing is correct? What wasn't correct was that the failure reached production. What should have happened is that the validation should have been a unified thing and whatever generated the file should have flagged it before it entered production.
And that's not an issue with function return value API management. The software that should have bailed was somewhere else entirely, and even there an unwrap explosion (in a smoke test or pre-release pass or whatever) would have been fine.
The real issue is further up the chain where the malformed feature file got created and deployed without better checks.
But more generally you could catch the panic at the FL2 layer to make that decision intentional - missing logic at that layer IMHO.
Maybe the validation code should've handled the larger size, but also the db query produced something invalid. That shouldn't have ever happened in the first place.
> This wasn't an attack, but a classic chain reaction triggered by “hidden assumptions + configuration chains” — permission changes exposed underlying tables, doubling the number of lines in the generated feature file. This exceeded FL2's memory preset, ultimately pushing the core proxy into panic.
> Rust mitigates certain errors, but the complexity in boundary layers, data flows, and configuration pipelines remains beyond the language's scope. The real challenge lies in designing robust system contracts, isolation layers, and fail-safe mechanisms.
> Hats off to Cloudflare's engineers—those on the front lines putting out fires bear the brunt of such incidents.
> Technical details: Even handling the unwrap correctly, an OOM would still occur. The primary issue was the lack of contract validation in feature ingest. The configuration system requires “bad → reject, keep last-known-good” logic.
> Why did it persist so long? The global kill switch was inadequate, preventing rapid circuit-breaking. Early suspicion of an attack also caused delays.
> Why not roll back software versions or restart?
> Rollback isn't feasible because this isn't a code issue—it's a continuously propagating bad configuration. Without version control or a kill switch, restarting would only cause all nodes to load the bad config faster and accelerate crashes.
> Why not roll back the configuration?
> Configuration lacks versioning and functions more like a continuously updated feed. As long as the ClickHouse pipeline remains active, manually rolling back would result in new corrupted files being regenerated within minutes, overwriting any fixes.
I don't like to use implicit unwrap. Even things that are guaranteed to be there, I treat as explicit (For example, (self.view?.isEnabled ?? false), in a view controller, instead of self.view.isEnabled).
I always redefine @IBOutlets from:
@IBOutlet weak var someView!
to: @IBOutlet weak var someView?
I'm kind of a "belt & suspenders" type of guy.How so? “Parse, don’t validate” implies converting input into typed values that prevent representation of invalid state. But the parsing still needs to be done correctly. An unchecked unwrap really has nothing to do with this.
While there are certainly many things to admire about Rust, this is why I prefer Golang's "noisy" error handling. In golang that would be either:
feature_values, err := features.append_with_names(...)
And the compiler would have complained that this value of `err` was unused; or you'd write: feature_values, _ := features.append_with_names(...)
And it would be far more obvious that an error message is being ignored.(Renaming `unwrap` to `unwrapOrPanic` would probably help too.)
I have been saying for years that Rust botched error handling in unfixable ways. I will go to the grave believing Rust fumbled.
The design of the Rust language encourages people to use unwrap() to turn foreseeable runtime problems into fatal errors. It's the path of least resistance, so people will take it.
Rust encourages developers to consider only the happy path. No wonder it's popular among people who've never had to deal with failure.
All of the concomitant complexity--- Result, ?, the test thing, anyhow, the inability for stdlib to report allocation failure --- is downstream of a fashion statement against exceptions Rust cargo-culted from Go.
The funniest part is that Rust does have exceptions. It just calls them panics. So Rust code has to deal with the ergonomic footgun of Result but pays anyway for the possibility of exceptions. (Sure, you can compile with panic=abort. You can't count on it.)
I could not be more certain that Rust should have been a language with exceptions, not Result, and that error objects are a gross antipattern we'll regret for decades.
Them calling unwrap on a limit check is the real issue imo. Everything that takes in external input should assume it is bad input and should be fuzz tested imo.
In the end, what is the point of having a limit check if you are just unwrapping on it
The way they wrote the code means that having more than 200 features is a hard non-transient error - even if they recovered from it, it meant they'd have had the same error when the code got to the same place.
I'm sure when the process crashed, k8s restarted the pod or something - then it reran the same piece of code and crashed in the same place.
While I don't necessarily agree with crashing as business strategy, I don't think that doing anything other than either dropping the extra rules or allocating more memory - neither of which the original code was built to do (probably by design).
The code made the local hard assumption that there won't ever be more than 200 rules and its okay to crash if that count is exceeded.
If you design your code around an invariant never being violated (which is fine), you have to make it clear on a higher level that they did.
This isn't a Rust problem (though Rust does make it easy to do the wrong thing here imo)
It may be that forcing handling at every call tends to makes code verbose, and devs insensitized to bad practice. And the diagnostic Rust provided seems pretty garbage.
There is bad practice here too -- config failure manifesting as request failure, lack of failing to safe, unsafe rollout, lack of observability.
Back to language design & error handling. My informed view is that robustness is best when only major reliability boundaries need to be coded.
This the "throw, don't catch" principle with the addition of catches on key reliability boundaries -- typically high-level interactions where you can meaningfully answer a failure.
For example, this system could have a total of three catch clauses "Error Loading Config" which fails to safe, "Error Handling Request" which answers 5xx, and "Socket Error" which closes the HTTP connection.
This .unwrap() sounds too easy for what it does, certainly much easier than having an entire try..catch block with an explicit panic. Full disclosure: I don't actually know Rust.
The config bug reaching prod without this being caught and pinpointed immediately is the strange part.
Also wonder with a sharded system why are they not slow rolling out changes and monitoring?
Average Go code has much less panics than Rust has unwraps, which are functionally equivalent.
it'd be kinda hard to amend the clippy lints to ignore coroutine unwraps but still pipe up on system ones. i guess.
edit: i think they'd have to be "solely-task-color-flavored" so definitely probably not trivial to infer
First multi-million dollar .unwrap() story.
Also, exception handling is hard and lame. We don't need exceptions, just add a "match" block after every line in your program.
Rust compiler is a god of sorts, or at least a law of nature haha
Way to comment and go instantly off topic
The lack of canary: cause for concern, but I more or less believe Cloudflare when they say this is unavoidable given the use case. Good reason to be extra careful though, which in some ways they weren't.
The slowness to root cause: sheer bad luck, with the status page down and Azure's DDoS yesterday all over the news.
The broken SQL: this is the one that I'd be up in arms about if I worked for Cloudflare. For a system with the power to roll out config to ~all of prod at once while bypassing a lot of the usual change tracking, having this escape testing and review is a major miss.
So basically bad config should be explicitly processed and handled by rolling back to known working config.
But the architectural assumption that the bot file build logic can safely obtain this operationally critical list of features from derivative database metadata vs. a SSOT seems like a bigger problem to me.
The "...was then propagated to all the machines that make up our network..." followed by "....caused the software to fail." screams for a phased rollout / rollback methodology. I get that "...it’s critical that it is rolled out frequently and rapidly as bad actors change their tactics quickly" but today's outage highlights that rapid deployment isn't all upside.
The remediation section doesn't give me any sense that phased deployment, acceptance testing, and rapid rollback are part of the planned remediation strategy.
It has somewhat regularly saved us from disaster in the past.
Sometimes you have smart people in the room who dig deeper and fish it out, but you cannot always rely on that.
My best guess is too many alerts firing without a clear hierarchy and possibilities to seprate cause from effect. It's a typical challenge but I wish they would shed some light on that. And its a bit concerning that improving observability is not part of their follow up steps.
> The software running on these machines to route traffic across our network reads this feature file to keep our Bot Management system up to date with ever changing threats. The software had a limit on the size of the feature file that was below its doubled size. That caused the software to fail.
I'm no FAANG 10x engineer, and I appreciate things can be obvious in hindsight, but I'm somewhat surprised that engineering at the level of Cloudflare does not:
1. Push out files A/B to ensure the old file is not removed.
2. Handle the failure of loading the file (for whatever reason) by automatically reloading the old file instead and logging the error.
This seems like pretty basic SRE stuff.
Unfortunately they do not share, what caused the status page to went down as well. (Does this happen often? Otherwise a big coincidence it seems)
IDK Atlassian Statuspage clientele, but it's possible Cloudflare is much larger than usual.
> Hardening ingestion of Cloudflare-generated configuration files in the same way we would for user-generated input
> Enabling more global kill switches for features
> Eliminating the ability for core dumps or other error reports to overwhelm system resources
> Reviewing failure modes for error conditions across all core proxy modules
Absent from this list are canary deployments and incremental or wave-based deployment of configuration files (which are often as dangerous as code changes) across fault isolation boundaries -- assuming CloudFlare has such boundaries at all. How are they going to contain the blast radius in the future?
This is something the industry was supposed to learn from the CrowdStrike incident last year, but it's clear that we still have a long way to go.
Also, enabling global anything (i.e., "enabling global kill switches for features") sounds like an incredibly risky idea. One can imagine a bug in a global switch that transforms disabling a feature into disabling an entire system.
I wonder why clickhouse is used to store the feature flags here, as it has it's own duplication footguns[0] which could have also easily lead to a query blowing up 2/3x in size. oltp/sqlite seems more suited, but i'm sure they have their reasons
[0] https://clickhouse.com/docs/guides/developer/deduplication
In this case, the older proxy's "fail-closed" categorization of bot activity was obviously better than the "fail-crash", but every global change needs to be carefully validated to have good characteristics here.
Having a mapping of which services are downstream of which other service configs and versions would make detecting global incidents much easier too, by making the causative threads of changes more apparent to the investigators.
In reference to fault isolation boundaries: I am not familiar with their CI/CD, in theory the error could have been caught/prevented there, but that comes with a lot of depends or it's tricky. But it looks like they didn't go the extra mile to care about safety sensitive areas. So euphemistic speaking, they are now recalibrating balance of safety measures.
Assuming something similar to Sentry would be in use, it should clearly pick up the many process crashes that start occurring right as the downtime starts. And the well defined clean crashes should in theory also stand out against all the random errors that start occuring all over the system as it begins to go down, precisely because it's always failing at the exact same point.
The issue here is about the system as a whole not any line of code.
At the bare minimum they could've used an expect("this should never happen, if it does database schema is incorrect").
The whole point of errors as values is preventing this kind of thing.... It wouldn't have stopped the outage but it would've made it easy to diagnose.
If anyone at cloudflare is here please let me in that codebase :)
Unwrap gives you a stack trace, while retuned Err doesn't, so simply using a Result for that line of code could have been even harder to diagnose.
`unwrap_or_default()` or other ways of silently eating the error would be less catastrophic immediately, but could still end up breaking the system down the line, and likely make it harder to trace the problem to the root cause.
The problem is deeper than an unwrap(), related to handling rollouts of invalid configurations, but that's not a 1-line change.
This has to sting a bit after that post.
I don't use Rust, but a lot of Rust people say if it compiles it runs.
Well Rust won't save you from the usual programming mistake. Not blaming anyone at cloudflare here. I love Cloudflare and the awesome tools they put out.
end of day - let's pick languages | tech because of what we love to do. if you love Rust - pick it all day. I actually wanna try it for industrial robot stuff or small controllers etc.
there's no bad language - just occassional hiccups from us users who use those tools.
Unwrapping is a very powerful and important assertion to make in Rust whereby the programmer explicitly states that the value within will not be an error, otherwise panic. This is a contract between the author and the runtime. As you mentioned, this is a human failure, not a language failure.
Pause for a moment and think about what a C++ implementation of a globally distributed network ingress proxy service would look like - and how many memory vulnerabilities there would be… I shudder at the thought… (n.b. nginx)
This is the classic example of when something fails, the failure cause over indexes on - while under indexing on the quadrillions of memory accesses that went off without a single hitch thanks to the borrow checker.
I postulate that whatever the cost in millions or hundreds of millions of dollars by this Cloudflare outage, it has paid for more than by the savings of safe memory access.
Disagree. Rust is at least giving you an "are you sure?" moment here. Calling unwrap() should be a red flag, something that a code reviewer asks you to explain; you can have a linter forbid it entirely if you like.
No language will prevent you from writing broken code if you're determined to do so, and no language is impossible to write correct code in if you make a superhuman effort. But most of life happens in the middle, and tools like Rust make a huge difference to how often a small mistake snowballs into a big one.
This is not a Rust problem. Someone consciously chose to NOT handle an error, possibly thinking "this will never happen". Then someone else conconciouly reviewed (I hope so) a PR with an unwrap() and let it slide.
Anecdotally I can write code for several hours, deploy it to a test sandbox without review or running tests and it will run well enough to use it, without silly errors like null pointer exceptions, type mismatches, OOBs etc. That doesn't mean it's bug-free. But it doesn't immediately crash and burn either. Recently I even introduced a bug that I didn't immediately notice because careful error handling in another place recovered from it.
Do you grok what the issue was with the unwrap, though...?
Idiomatic Rust code does not use that. The fact that it's allowed in a codebase says more about the engineering practices of that particular project/module/whatever. Whoever put the `unwrap` call there had to contend with the notion that it could panic and they still chose to do it.
It's a programmer error, but Rust at least forces you to recognize "okay, I'm going to be an idiot here". There is real value in that.
could have been tight deadline, managerial pressure or just the occasional slip up.
Having the feature table pivoted (with 200 feature1, feature2, etc columns) meant they had to do meta queries to system.columns to get all the feature columns which made the query sensitive to permissioning changes (especially duplicate databases).
A Crowdstrike style config update that affects all nodes but obviously isn't tested in any QA or staged rollout strategy beforehand (the application panicking straight away with this new file basically proves this).
Finally an error with bot management config files should probably disable bot management vs crash the core proxy.
I'm interested here why they even decided to name Clickhouse as this error could have been caused by any other database. I can see though the replicas updating causing flip / flopping of results would have been really frustrating for incident responders.
The solution to that problem wasn't better testing of database permutations or a better staging environment (though in time we did do those things). It was (1) a watchdog system in our proxies to catch arbitrary deadlocks (which caught other stuff later), (2) segmenting our global broadcast domain for changes into regional broadcast domains so prod rollouts are implicitly staged, and (3) a process for operators to quickly restore that system to a known good state in the early stages of an outage.
(Cloudflare's responses will be different than ours, really I'm just sticking up for the idea that the changes you need don't follow obviously from the immediate facts of an outage.)
What would you propose to fix it? The fixed cost of being DDoS-proof is in the hundreds of millions of dollars.
I can imagine that this could easily lead to less visibility into issues.
I haven’t worked in Rust codebases, but I have never worked in a Go codebase where a `panic` in such a location would make it through code review.
Is this normal in Rust?
I get it, don’t pick languages just because they are trendy, but if any company’s use case is a perfect fit for Rust it’s cloudflare.
Of course, some users were still blocked, because the Turnstile JS failed to load in their browser but the subsequent siteverify check succeeded on the backend. But overall the fail-open implementation lessened impact to our customers nonetheless.
Fail-open with Turnstile works for us because we have other bot mitigations that are sufficient to fall back on in the event of a Cloudflare outage.
For something so critical, why aren't you using lints to identify and ideally deny panic inducing code. This is one of the biggest strengths of using Rust in the first place for this problem domain.
This simply means, the exception handling quality of your new FL2 is non-existent and is not at par / code logic wise similar to FL.
I hope it was not because of AI driven efficiency gains.
There are plenty of resources , yet it's somehow never enough. You do tons of pretty amazing things with pretty amazing tools that also have notable shortcomings.
You're surround by smart people who do lots of great work, but you also end up in incident reviews where you find facepalm-y stuff. Sometimes you even find out it was a known corner case that was deemed too unlikely to prioritize.
The last incident for my team that I remember dealing with there ended up with my coworker and I realizing the staging environment we'd taken down hours earlier was actually the source of data for a production dashboard, so we'd lost some visibility and monitoring for a bit.
I've also worked at Facebook (pre-Meta days) and at Datadog, and I'd say it was about the same. Most things are done quite well, but so much stuff is happening that you still end up with occasional incidents that feel like they shouldn't have happened.
They are escape hatches. Without those your language would never take off.
But here's the thing. Escape hatches are like emergency exits. They are not to be used by your team to go to lunch in a nearby restaurant.
---
Cloudflare should likely invest in better linting and CI/CD alerts. Not to mention isolated testing i.e. deploy this change only to a small subset and monitor, and only then do a wider deployment.
Hindsight is 20/20 and we can all be smartasses after the fact of course. But I am really surprised because lately I am only using Rust for hobby projects and even I know I should not use `unwrap` and `expect` beyond the first iteration phases.
---
I have advocated for this before but IMO Rust at this point will benefit greatly from disallowing those unsafe APIs by default in release mode. Though I understand why they don't want to do it -- likely millions of CI/CD pipelines will break overnight. But in the interim, maybe a rustc flag we can put in our `Cargo.toml` that enables such a stricter mode? Or have that flag just remove all the panicky API _at compile time_ though I believe this might be a Gargantuan effort and is likely never happening (sadly).
In any case, I would expect many other failures from Cloudflare but not _this_ one in particular.
Bubbling up the error or None does not make the program correct. Panicking may be the only reasonable thing to do.
If panicking is guaranteed because of some input mistake to the system your failure is in testing.
On 18 November 2025 at 11:20 UTC (all times in this blog are UTC), Cloudflare's network began experiencing significant failures
As of 17:06 all systems at Cloudflare were functioning as normal
6 hours / 5 years gives ~99.98% uptime.And here is the query they used ** (OK, so it's not exactly):
SELECT * from feature JOIN permissions on feature.feature_type_id = permissions.feature_type_id
someone added a new row to permissions and the JOIN started returning two dupe feature rows for each distinct feature.** "here is the query" is used for dramatic effect. I have no knowledge of what kind of database they are even using much less queries (but i do have an idea).
more edits: OK apparently it's described later in the post as a query against clickhouse's table metadata table, and because users were granted access to an additional database that was actually the backing store to the one they normally worked with, some row level security type of thing doubled up the rows. Not sure why querying system.columns is part of a production level query though, seems overly dynamic.
I don’t think the infrastructure has been as fully recovered as they think yet…
> The software had a limit on the size of the feature file that was below its doubled size. That caused the software to fail
A configuration error can cause internet-scale outages. What an era we live in
Edit: also, after finishing my reading, I have to express some surprise that this type of error wasn't caught in a staging environment. If the entire error is that "during migration of ClickHouse nodes, the migration -> query -> configuration file pipeline caused configuration files to become illegally large", it seems intuitive to me that doing this same migration in staging would have identified this exact error, no?
I'm not big on distributed systems by any means, so maybe I'm overly naive, but frankly posting a faulty Rust code snippet that was unwrapping an error value without checking for the error didn't inspire confidence for me!
I think it's quite rare for any company to have exact similar scale and size of storage in stage as in prod.
I also found the "remediation and follow up" section a bit lacking, not mentioning how, in general, regressions in query results caused by DB changes could be caught in future before they get widely rolled out.
Even if a staging env didn't have a production-like volume of data to trigger the same failure mode of a bot management system crash, there's also an opportunity to detect that something has gone awry if there were tests that the queries were returning functionally equivalent results after the proposed permission change. A dummy dataset containing a single http_requests_features column would suffice to trigger the dupe results behaviour.
In theory there's a few general ways this kind of issue could be detected, e.g. someone or something doing a before/after comparison to test that the DB permission change did not regress query results for common DB queries, for changes that are expected to not cause functional changes in behaviour.
Maybe it could have been detected with an automated test suite of the form "spin up a new DB, populate it with some curated toy dataset, then run a suite of important queries we must support and check the results are still equivalent (after normalising row order etc) to known good golden outputs". This style of regression testing is brittle, burdensome to maintain and error prone when you need to make functional changes and update what then "golden" outputs are - but it can give a pretty high probability of detecting that a DB change has caused unplanned functional regressions in query output, and you can find out about this in a dev environment or CI before a proposed DB change goes anywhere near production.
I'd agree that the use of `unwrap` could possibly make sense in a place where you do want the system to fail hard. There's lot of good reasons to make the system fail hard. I'd lean towards an `expect` here, but whatever.
That said, the function already returns a `Result` and we don't know what the calling code looks like. Maybe it does do an `unwrap` there too, or maybe there is a save way for this to log and continue that we're not aware of because we don't have enough info.
Should a system as critical as the CF proxy fail hard? I don't know. I'd say yes if it was the kind of situation that could revert itself (like an incremental rollout), but this is such an interesting situation since it's a config being rolled out. Hindsight is 20:20 obviously, but it feels like there should've been better logging, deployment, rollback, and parsing/validation capabilities, no matter what the `unwrap`/`Result` option is.
Also, it seems like the initial Clickhouse changes could've been testing much better, but I'm sure the CF team realizes that.
On the bright side, this is a very solid write up so quickly after the outage. Much better than those times we get it two weeks later.
Companies seem to place a lot of trust is configs being pushed automatically without human review into running systems. Considering how important these configs are, shouldn't they perhaps first be deployed to a staging/isolated network for a monitoring window before pushing to production systems?
Not trying to pontificate here, these systems are more complicated than anything I have maintained. Just trying to think of best practices perhaps everyone can adopt.
i’m a little confused on how this was initially confused for an attack though?
is there no internal visibility into where 5xx’s are being thrown? i’m surprised there isn’t some kind of "this request terminated at the <bot checking logic>" error mapping that could have initially pointed you guys towards that over an attack.
also a bit taken aback that .unwrap()’s are ever allowed within such an important context.
would appreciate some insight!
2. Attacks that make it through the usual defences make servers run at rates beyond their breaking point, causing all kinds of novel and unexpected errors.
Additionally, attackers try to hit endpoints/features that amplify severity of their attack by being computationally expensive, holding a lock, or trigger an error path that restarts a service — like this one.
1) Lack of validation of the configuration file.
Rolling out a config file across the global network every 5 minutes is extremely high risk. Even without hindsight, surely one would see then need for very careful validation of this file before taking on that risk?
There were several things "obviously" wrong with the file that validation should have caught:
- It was much bigger than expected.
- It had duplicate entries.
- Most importantly, when loaded into the FL2 proxy, the proxy would panic on every request. At the very least, part of the validation should involve loading the file into the proxy and serving a request?
2) Very long time to identify and then fix such a critical issue.
I can't understand the complete lack of monitoring or reporting? A panic in Rust code, especially from an unwrap, is the application screaming that there's a logic error! I don't understand how that can be conflated with a DDoS attack. How are your logs not filled with backtraces pointing to the exact "unwrap" in question?
Then, once identified, why was it so hard to revert to a known good version of the configuration file? How did noone foresee the need to roll back this file when designing a feature that deploys a new one globally every 5 minutes?
So they basically hardcoded something, didn't bother to cover the overflow case with unit tests, didn't have basic error catching that would fallback and send logs/alerts to their internal monitoring system and this is why half of the internet went down?
> Eliminating the ability for core dumps or other error reports to overwhelm system resources
but this is not mentioned at all in the timeline above. My best guess would be that the process got stuck in a tight restart loop and filled available disk space with logs, but I'm happy to hear other guesses for people more familiar with Rust.
> As well as returning HTTP 5xx errors, we observed significant increases in latency of responses from our CDN during the impact period. This was due to large amounts of CPU being consumed by our debugging and observability systems, which automatically enhance uncaught errors with additional debugging information.
(Just above https://blog.cloudflare.com/18-november-2025-outage/#how-clo...)
The report actually seems to confirm this - it was indeed a crash on ingesting the bad config. However I'm actually surprised that the long duration didn't come from "it takes a long time to restart the fleet manually" or "tooling to restart the fleet was bad".
The problem mostly seems to have been "we didn't knew whats going on". Some look into the proxy logs would hopefully have shown the stacktrace/unwrap, and metrics about the incoming requests would hopefully have shown that there's no abnormal amount of requests coming in.
They just sell proxies, to whoever.
Why are they the only company doing ddos protection?
I just don't get it.
I wrote a book on feature stores by O'Reilly. The bad query they wrote in Clickhouse could have been caused by another more error - duplicate rows in materialized feature data. For example, in Hopsworks it prevents duplicate rows by building on primary key uniqueness enforcement in Apache Hudi. In contrast, Delta lake and Iceberg do not enforce primary key constraints, and neither does Clickhouse. So they could have the same bug again due to a bug in feature ingestion - and given they hacked together their feature store, it is not beyond the bounds of possibility.
Reference: https://www.oreilly.com/library/view/building-machine-learni...
What were the teams doing between 11 to 1300 hrs , no explanation of what investigations were going on to not being able to figure the root cause.
What I'm trying to say is that things would be much better if everyone took a chill pill and accepted the possibility that in rare instances, the internet doesn't work and that's fine. You don't need to keep scrolling TikTok 24/7.
> but my use case is especially important
Take a chill pill. Probably it isn't.
How about 1. The permissions change project is paused or rolled back until 2. All impacted database interactions (SQL queries) are evaluated for improper assumptions or better 3. Their design that depends on database metainfo and schema is replaced with ones that use specific tables and rows in tables instead of using the meta info as part of their application. 4. All hard coded limits are centralized in a single global module and referenced from their users and then back propagated to any separate generator processes that validate against the limit before pushing generated changes
Cloudflare's incident report is written clearly and explicitly, so based on my own understanding, I’m going to try reproducing this outage. Already completed:
CK cluster Permission change triggering data doubling Cache propagation Unaffected proxy services Proxy services with bot score errors
TODO:
unwrap panic during pre-allocation of cache Full demonstration of the entire outage process
This is all gone. The internet is a centralised system in the hand of just a few companies. If AWS goes down half the internet does. If Azure, Google Cloud, Oracle Cloud, Tencent Cloud or Alibaba Cloud goes down a large part of the internet does.
Yesterday with Cloudflare down half the sites I tried gave me nothing but errors.
The internet is dead.
Shouldn't the architecture setup in such a way that subcomponents can fail without impacting the critical function of the component?
However, I have a question from a release deployment process perspective. Why was this issue not detected during internal testing ? I didn't find the RCA analysis covering this aspect. Doesn't cloudflare have an internal test stage as part of its CICD pipeline. Looking the description of the issue, it should have been immediately detected in internal stage test environment.
There never was an unbound "select all rows from some table" without a "fetch first N rows only" or "limit N"
If you knew that this design is rigid, why not leverage the query to actually do it ?
What am I missing ?
Anyway regardless of which language you use to construct a SQL query, you're not obligated to put in a max rows
All that said, to have an outage reported turned around practically the same day, that is this detailed, is quite impressive. Here's to hoping they make their changes from this learning, and we don't see this exact failure mode again.
i think this is happening way too frequently
meanwhile VPS, dedicated servers hum along without any issues
i dont want to use kubernetes but if we have to build mission critical systems doesn't seem like building on cloudflare is going to cut it
Gonna use that one at $WORK.
What could have prevented this failure?
Cloudflare's software could have included a check that refused to generate the feature file if it's size was higher than the limit.
A testcase could have caught this.
It sounds like the change could've been rolled out more slowly, halted when the incident started and perhaps rolled back just in case.
If that's true, is there a way to tell (easily) whether a site is using cloudflare or not?
Just ping the host and see if the ip belongs to CF.
I'm impressed they were able to corral people this quickly.
The last thing we need here is for more of the internet to sign up for Cloudflare.
Reputationally this is extremely embarrassing for Cloudflare, but imo they seem to get their feet back on the ground. I was surprised to see not just one, but two apologies to the internet. This just cements how professional and dedicated the Cloudflare team is to ensure stable resilient internet and how embarrassed they must have been.
A reputational hit for sure, but outcome is lessons learned and hopefully stronger resilience.
> Throwing us off and making us believe this might have been an attack was another apparent symptom we observed: Cloudflare’s status page went down. The status page is hosted completely off Cloudflare’s infrastructure with no dependencies on Cloudflare.
also cloudflare:
> The Cloudflare Dashboard was also impacted due to both Workers KV being used internally and Cloudflare Turnstile being deployed as part of our login flow.
Sounds like the ops team had one hell of a day.
Why have we built / permitted the building of / Subscribed to such a Failure-intolerant "Network"?
Even worse - the small botnet that controls everything.
https://blog.cloudflare.com/18-november-2025-outage/#:~:text...
Go to jeffblearning on LinkedIn. I took it down with 253 copies of a text file delivered through a vulnerability in Novo’s systems.
I’ve documented all of it.
It’s not done yet…
Even a simple key-value map per feature should have allowed for insertions as simple as a put/replace of the value and not appending to the file. That was not the case here, where Cloudflare kept appending to the file for any feature to be added. And I am assuming the features are bot attack patterns as features. Anyway, there is something fundamental here that Cloudflare should rethink. If someone can educate me on the design, I can continue reading the next few lines.
Didn't the services that were crashing due to OOM raise any alerts?
This is shitty at so many levels.
464646449
Big tech is a fucking joke.
...
(I'd pick Haskell, cause I'm having fun with it recently :P)
Excuse me, what you've just said? Who decided on “Cloudflare's importance in the Internet ecosystem”? Some see it differently, you know, there's no need for that self-assured arrogance of an inseminating alpha male.
No other time in history has one single company been responsible for so much commerce and traffic. I wonder what some outage analogs to the pre-internet ages would be.
Free fire
My dude, everything is a footgun if you hold it wrong enough
These folks weren't operating for charity. They were highly paid so-called professionals.
Who will be held accountable for this?
https://blog.cloudflare.com/18-november-2025-outage/#:~:text...
This is the first significant outage that has involved Rust code, and as you can see the .unwrap is known to carry the risk of a panic and should never be used on production code.
> The change explained above resulted in all users accessing accurate metadata about tables they have access to. Unfortunately, there were assumptions made in the past, that the list of columns returned by a query like this would only include the “default” database:
SELECT
name,
type
FROM system.columns
WHERE
table = 'http_requests_features'
order by name;
Note how the query does not filter for the database name. With us gradually rolling out the explicit grants to users of a given ClickHouse cluster, after the change at 11:05 the query above started returning “duplicates” of columns because those were for underlying tables stored in the r0 database. - Their database permissions changed unexpectedly (??)
- This caused a 'feature file' to be changed in an unusual way (?!)
- Their SQL query made assumptions about the database; their permissions change thus resulted in queries getting additional results, permitted by the query
- Changes were propagated to production servers which then crashed those servers (meaning they weren't tested correctly)
- They hit an internal application memory limit and that just... crashed the app
- The crashing did not result in an automatic backout of the change, meaning their deployments aren't blue/green or progressive
- After fixing it, they were vulnerable to a thundering herd problem
- Customers who were not using bot rules were not affected; CloudFlare's bot-scorer generated a constant bot score of 0, meaning all traffic is bots
In terms of preventing this from a software engineering perspective, they made assumptions about how their database queries work (and didn't validate the results), and they ignored their own application limits and didn't program in either a test for whether an input would hit a limit, or some kind of alarm to notify the engineers of the source of the problem.From an operations perspective, it would appear they didn't test this on a non-production system mimicing production; they then didn't have a progressive deployment; and they didn't have a circuit breaker to stop the deployment or roll-back when a newly deployed app started crashing.