Preliminary Post Incident Review (opens in new tab)

(crowdstrike.com)

200 pointscavilatrest1y ago210 comments

210 comments

130 comments · 34 top-level

squirrel1y ago· 19 in thread

There’s only one sentence that matters:

"Provide customers with greater control over the delivery of Rapid Response Content updates by allowing granular selection of when and where these updates are deployed."

This is where they admit that:

1. They deployed changes to their software directly to customer production machines; 2. They didn’t allow their clients any opportunity to test those changes before they took effect; and 3. This was cosmically stupid and they’re going to stop doing that.

Software that does 1. and 2. has absolutely no place in critical infrastructure like hospitals and emergency services. I predict we’ll see other vendors removing similar bonehead “features” very very quietly over the next few months.

98codes1y ago

Combined with this, presented as a change they could potentially make, it's a killer:

> Implement a staggered deployment strategy for Rapid Response Content in which updates are gradually deployed to larger portions of the sensor base, starting with a canary deployment.

They weren't doing any test deployments at all before blasting the world with an update? Reckless.

dijksterhuis1y ago

> our staging environment, which consists of a variety of operating systems and workloads

they have a staging environment at least, but no idea what they were running in it or what testing was done there.

SketchySeaBeast1y ago

Unfortunately, putting the onus on risk adverse organizations like hospitals and governments to validate the AV changes means they just won't get pushed and will be chronically exposed.

That said, maybe Crowdstrike should considering validating every step of the delivery pipeline before pushing to customers.

EvanAnderson1y ago

> That said, maybe Crowdstrike should considering validating every step of the delivery pipeline before pushing to customers.

If they'd just had a lab of a couple dozen PCs acting as canaries they'd have caught this. Apparently that was too complicated or expensive for them.

dmazzoni1y ago

Why can't they just do it more like Microsoft security patches, making them mandatory but giving admins control over when they're deployed?

1 more reply

throw0101d1y ago

> Unfortunately, putting the onus on risk adverse organizations like hospitals and governments to validate the AV changes means they just won't get pushed and will be chronically exposed.

I have a similar feeling.

At the very least perhaps have an "A" and a "B" update channel, where "B" is x hours behind A. This way if, in an HA configuration, one side goes down there's time to deal with it while your B-side is still up.

thaumasiotes1y ago

> Unfortunately, putting the onus on risk adverse organizations like hospitals and governments to validate the AV changes means they just won't get pushed and will be chronically exposed.

Being chronically exposed may be the right call, in the same way that Roman cities didn't have walls.

Compare this perspective from Matt Levine:

https://archive.is/4AvgO

> So for instance if you run a ransomware business and shut down, like, a marketing agency or a dating app or a cryptocurrency exchange until it pays you a ransom in Bitcoin, that’s great, that’s good money. A crime, sure, but good money. But if you shut down the biggest oil pipeline in the U.S. for days, that’s dangerous, that’s a U.S. national security issue, that gets you too much attention and runs the risk of blowing up your whole business. So:

>> In its own statement, the DarkSide group hinted that an affiliate may have been behind the attack and that it never intended to cause such upheaval.

>> In a message posted on the dark web, where DarkSide maintains a site, the group suggested one of its customers was behind the attack and promised to do a better job vetting them going forward.

>> “We are apolitical. We do not participate in geopolitics,” the message says. “Our goal is to make money and not creating problems for society. From today, we introduce moderation and check each company that our partners want to encrypt to avoid social consequences in the future.”

> If you want to use their ransomware software to do crimes, apparently you have to submit a resume demonstrating that you are good at committing crimes. (“Hopeful affiliates are subject to DarkSide’s rigorous vetting process, which examines the candidate’s ‘work history,’ areas of expertise, and past profits among other things.”) But not too good! The goal is to bring a midsize company to its knees and extract a large ransom, not to bring society to its knees and extract terrible vengeance.

https://archive.is/K9qBm

> We have talked about this before, and one category of crime that a ransomware compliance officer might reject is “hacks that are so big and disastrous that they could call down the wrath of the US government and shut down the whole business.” But another category of off-limits crime appears to be “hacks that are so morally reprehensible that they will lead to other criminals boycotting your business.”

>> A global ransomware operator issued an apology and offered to unlock the data targeted in a ransomware attack on Toronto’s Hospital for Sick Children, a move cybersecurity experts say is rare, if not unprecedented, for the infamous group.

>> LockBit’s apology, meanwhile, appears to be a way of managing its image, said [cybersecurity researcher Chester] Wisniewski.

>> He suggested the move could be directed at those partners who might see the attack on a children’s hospital as a step too far.

> If you are one of the providers, you have to choose your hacker partners carefully so that they do the right amount of crime: You don’t want incompetent or unambitious hackers who can’t make any money, but you also don’t want overly ambitious hackers who hack, you know, the US Department of Defense, or the Hospital for Sick Children. Meanwhile you also have to market yourself to hacker partners so that they choose your services, which again requires that you have a reputation for being good and bold at crime, but not too bold. Your hacker partners want to do crime, but they have their limits, and if you get a reputation for murdering sick children that will cost you some criminal business.

hello_moto1y ago

> I predict we’ll see other vendors removing similar bonehead “features” very very quietly over the next few months.

Absolutely this is what will happen.

I don't know much about the practice of AV definition-like feature across Cybersecurity but I would imagine there might be a possibility that no vendors do rolling update today because it involves Opt-in/Opt-out which might influence the vendor's speed to identify attack which in turns affect their "Reputation" as well.

"I bought Vendor-A solution but I got hacked and have to pay Ransomware" (with a side note: because I did not consume the latest critical update of AV definition) is what Vendors worried.

Now that this Global Outage happened, it will change the landscape a bit.

XlA5vEKsMISoIln1y ago

>Now that this Global Outage happened, it will change the landscape a bit.

I seriously doubt that. Questions like "why should we use CrowdStrike" will be met with "suppose they've learned their lesson".

1 more reply

bawolff1y ago

> They deployed changes to their software directly to customer production machines; 2. They didn’t allow their clients any opportunity to test those changes before they took effect; and 3. This was cosmically stupid and they’re going to stop doing that.

Is it really all that surprising? This is basically their business model - its a fancy virus scanner that is supposed to instantly respond to threats.

koolba1y ago

> They didn’t allow their clients any opportunity to test those changes before they took effect

I’d argue that anyone that agrees to this is the idiot. Sure they have blame for being the source of the problem, but any CXO that signed off on software that a third party can update whenever they’d like is also at fault. It’s not an “if” situation, it’s a “when”.

throwaway20371y ago

I felt exactly the same when I read about the outage. What kind of CTO would allow 3rd party "security" software to automatically update? That's just crazy. Of course, your own security team would do some careful (canary-like) upgrades locally... run for a bit... run some tests, then sign-off. Then upgrade in a staged manner.

1 more reply

tptacek1y ago

They deployed changes to their software directly to customer production machines

This is part of the premise of EDR software.

nathanlied1y ago

>I predict we’ll see other vendors removing similar bonehead “features” very very quietly over the next few months.

If indeed this happens, I'd hail this event as a victory overall; but industry experience tells me that most of those companies will say "it'd never happen with us, we're a lot more careful", and keep doing what they're doing.

packetlost1y ago

I really wish we would get some regulation as a result of this. I know people that almost died due to hospitals being down. It should be absolutely mandatory for users, IT departments, etc. to be able to control when and where updates happen on their infrastructure but *especially* so for critical infrastructure.

mr_mitm1y ago

Does anyone test their antivirus updates individually as a customer? I thought they happen multiple times a day, who has time for that?

toast01y ago

Some sort of comprehensive test is unlikely.

But canary / smoke tests, you can do, if the vendor provides the right tools.

It's a cycle: pick the latest release, do some small cluster testing, including rollback testing, then roll out to 1%, if those machines are (mostly) still available in 5 minutes, roll out to 2%, if the 3% is (mostly) still available in 5 minutes, roll out to 4%, etc. If updates are fast and everything works, it goes quick. If there's a big problem, you'll have still have a lot of working nodes. If there's a small problem, you have a small problem.

It's gotta be automated though, but with an easy way for a person to pause if something is going wrong that the automation doesn't catch. If the pace is several updates a day, that's too much for people, IMHO.

1 more reply

packetlost1y ago

Yes? Not consumers typically, but many IT departments with certain risk profiles absolutely do.

Fire-Dragon-DoL1y ago

Now let's see if Microsoft listen and fixes Windows updates

Cyphase1y ago· 11 in thread

Lots of words about improving testing of the Rapid Response Content, very little about "the sensor client should not ever count on the Rapid Response Content being well-formed to avoid crashes".

> Enhance existing error handling in the Content Interpreter.

That's it.

Also, it sounds like they might have separate "validation" code, based on this; why is "deploy it in a realistic test fleet" not part of validation? I notice they haven't yet explained anything about what the Content Validator does to validate the content.

> Add additional validation checks to the Content Validator for Rapid Response Content. A new check is in process to guard against this type of problematic content from being deployed in the future.

Could it say any less? I hope the new check is a test fleet.

But let's go back to, "the sensor client should not ever count on the Rapid Response Content being well-formed to avoid crashes".

SoftTalker1y ago

> it sounds like they might have separate "validation" code

That's what stood out to me. From the CS post: "Template Instances are created and configured through the use of the Content Configuration System, which includes the Content Validator that performs validation checks on the content before it is published."

Lesson learned, a "Validator" that is not actually the same program that will be parsing/reading the file in production, is not a complete test. It's not entirely useless, but it doesn't guarantee anything. The production program could have a latent bug that a completely "valid" (by specification) file might trigger.

modestygrime1y ago

I'd argue that it is completely useless. They have the actual parser that runs in production and then a separate "test parser" that doesn't actually reflect reality? Why?

1 more reply

shepherdjerred1y ago

Parse, don't validate

https://lexi-lambda.github.io/blog/2019/11/05/parse-don-t-va...

pdpi1y ago

> very little about "the sensor client should not ever count on the Rapid Response Content being well-formed to avoid crashes"

That stood out to me as well.

Their response was the moral equivalent of Apple saying “iTunes crashes when you play a malformed mp3, so here’s how we’re going to improve how we test our mp3s before sending them to you”.

This is a security product that is expected to handle malicious inputs. If they can’t even handle their own inputs without crashing, I don’t like the odds of this thing being itself a potential attack vector.

Cyphase1y ago

That's a good comparison to add to the list for this topic, thanks. An example a non-techie can understand, where a client program is consuming data blobs produced by the creator of the program.

And great point that it's not just about crashing on these updates, even if they are properly signed and secure. What does this say about other parts of the client code? And if they're not signed, which seems unclear right now, then could anyone who gains access to a machine running the client get it to start boot looping again by copying Channel File 291 into place? What else could they do?

Echoes of the Sony BMG rootkit.

https://en.wikipedia.org/wiki/Sony_BMG_copy_protection_rootk...

WatchDog1y ago

Focusing on the rollout and QA process is the right thing to do.

The bug itself is not particularly interesting, nor is the fix for it.

The astounding thing about this issue, is the scale of the damage it caused, and that scale is all due to the rollout process.

gwd1y ago

Indeed, the very first thing they should be doing is adding fuzzing of their sensor to the test suite, so that it's not possible (or astronomically unlikely) for any corrupt content to crash the system.

hun31y ago

Is error handling enough? A perfectly valid rule file could hang (but not outright crash) the system, for example.

throwanem1y ago

If the rules are Turing-complete, then sure. I don't see enough in the report to tell one way or another; the way rules are made to sound as if filling templates about equally suggests either (if templates may reference other templates) and there is not a lot more detail. Halting seems relatively easy to manage with something like a watchdog timer, though, compared to a sound, crash- and memory-safe* parser for a whole programming language, especially if that language exists more or less by accident. (Again, no claim; there's not enough available detail.)

I would not want to do any of this directly on metal, where the only safety is what you make for yourself. But that's the line Crowdstrike are in.

* By EDR standards, at least, where "only" one reboot a week forced entirely by memory lost to an unkillable process counts as exceptionally good.

1 more reply

ReaLNero1y ago

Perhaps set a timeout on the operation then? Given this is kernel it's not as easy as userspace, but I'm sure you could request to set a interrupt on a timer.

DannyBee1y ago

Increase counter when you start loading

Have timeout

Decrement counter after successful load and parse

Check counter on startup. If it is like 3, maybe consider you are crashing

romwell1y ago· 8 in thread

This reads like a bunch of baloney to obscure the real problem.

The only relevant part you need to see:

>Due to a bug in the Content Validator, one of the two Template Instances passed validation despite containing problematic content data.

Problematic content? Yeah, this is telling exactly nothing.

Their mitigation is "ummm we'll test more and maybe not roll the updates to everyone at once", without any direct explanation on how that would prevent this from happening again.

Conspicuously absent:

— fixing whatever produced "problematic content"

— fixing whatever made it possible for "problematic content" to cause "ungraceful" crashes

— rewriting code so that the Validator and Interpreter would use the same code path to catch such issues in test

— allowing the sysadmins to roll back updates before the OS boots

— diversifying the test environment to include actual client machine configurations running actual releases as they would be received by clients

This is a nothing sandwich, not an incident review.

yuliyp1y ago

I presume the first two bullet points felt obvious enough to not bother stating: of course you fix the code that crashed. The architectural changes are the more interesting bits, and they're covered reasonably well. Your third point can help but no matter what there's still going to be parts of the interpreter that aren't exercised by the validator because it's not actually running the code. Your fourth one is a fair point: building in watchdogs of some sort to prevent a crashloop would be good. Also having a remote killswitch that can be checked before turning the sensor on would have helped in containing the damage of a crashloop. Your last one I feel like is mostly redundant with a lot of the follow-ups they did commit to.

It's far from perfect (both in terms of the lack of defenses to crashloop in the sensor and in what it said about their previous practices) but calling it a nothing sandwich is a bit hyperbolic.

romwell1y ago

>I presume the first two bullet points felt obvious enough to not bother stating: of course you fix the code that crashed.

I was not talking about the code that crashed.

I guess what I wrote was non-obvious enough that it needs an explanation:

— fixing whatever produced "problematic content":

The release doesn't talk about the subsystem that produced the "problematic content". The part that crashed was the interpreter (consumer of the content); the part that generated the "problematic content" might have worked as intended, for all we know.

— fixing whatever made it possible for "problematic content" to cause "ungraceful" crashes:

I am not talking about fixing this particular crash.

I am talking about design choices that allow such crashes in principle.

In this instance, the interpreter seemed to have been reading memory addresses from a configuration file (or something that would be equivalent to doing that). Adding an additional check will fix this bug, but not the fundamental issue that an interpreter should not be doing that.

>The architectural changes are the more interesting bits, and they're covered reasonably well

They are not covered at all. Are we reading the same press release?

>Your third point can help but no matter what there's still going to be parts of the interpreter that aren't exercised by the validator because it's not actually running the code.

Yes, that's the problem I am pointing out: the "validator" and "interpreter" should be the same code. The "validator" can issue commands to a mock operating system instead of doing real API calls, but it should go through the input with the actual interpreter.

In other words, the interpreter should be a part of the validator.

>It's far from perfect (both in terms of the lack of defenses to crashloop in the sensor and in what it said about their previous practices) but calling it a nothing sandwich is a bit hyperbolic.

Sure; that's my subjective assessment. Personally, I am very dissatisfied with their post-mortem. If you are happy with it, that's fair, but you'd need to say more if you want to make a point in addition to "the architectural changes are covered reasonably well".

Like, which specific changes those would be, for starters.

Zironic1y ago

>Add additional validation checks to the Content Validator for Rapid Response Content. A new check is in process to guard against this type of problematic content from being deployed in the future.

>Enhance existing error handling in the Content Interpreter.

They did write that they intended to fix the bugs in both the validator and the interpreter. Though it's a big mystery to me and most of the comments on the topic how an interpreter that crashes on a null template would ever get into production.

romwell1y ago

>They did write that they intended to fix the bugs

I strongly disagree.

Add additional validation and enhance error handling say as much as "add band-aids and improve health" in response to a broken arm.

Which is not something you'd want to hear from a kindergarten that sends your kid back to you with shattered bones.

Note that the things I said were missing are indeed missing in the "mitigation".

In particular, additional checks and "enhanced" error handling don't address:

— the fact that it's possible for content to be "problematic" for interpreter, but not the validator;

— the possibility for "problematic" content to crash the entire system still remaining;

— nothing being said about what made the content "problematic" (spoiler: a bunch of zeros, but they didn't say it), how that content was produced in the first place, and the possibility of it happening in the future still remaining;

— the fact that their clients aren't in control of their own systems, have no way to roll back a bad update, and can have their entire fleet disabled or compromised by CrowdStrike in an instant;

— the business practices and incentives that didn't result in all their "mitigation" steps (as well as steps addressing the above) being already implemented still driving CrowdStrike's relationship with its employees and clients.

The latter is particularly important. This is less a software issue, and more an organizational failure.

Elsewhere on HN and reddit, people were writing that ridiculous SLA's, such as "4 hour response to a vulnerability", make it practically impossible to release well-tested code, and that reliance on a rootkit for security is little more than CYA — which means that the writing was on the wall, and this will happen again.

You can't fix bad business practices with bug fixes and improved testing. And you can't fix what you don't look into.

Hence my qualification of this "review" as a red herring.

2 more replies

TheFragenTaken1y ago

What validates the Content Validator? A Content Validator Validator?

citrin_ru1y ago

> fixing whatever made it possible for "problematic content" to cause "ungraceful" crashes

Better not only fix this specific bug but continuously use fuzzing to find more places where external data (including updates) can trigger a crash (or worse RCE)

romwell1y ago

That is indeed necessary.

But it seems to me that putting the interpreter in a place in the OS where it can cause a system crash with the be the behavior that it's allowed to do is a fundamental design choice that is not at all addressed by fuzzing.

1 more reply

acdha1y ago

Also “using memory safe languages for critical components” and “detecting failures to load and automatically using the last-known-good configuration”

openasocket1y ago· 6 in thread

I work on a piece of software that is installed on a very large number of servers we do not own. The crowd strike incident is exactly our nightmare scenario. We are extremely cautious about updates, we roll it out very slowly with tons of metrics and automatic rollbacks. I’ve told my manager to bookmark articles about the crowdstrike incident and share it with anyone who complains about how slow the update process is.

The two golden rules are to let host owners control when to update whenever possible, and when it isn’t to deploy very very slowly. If a customer has a CI/CD system, you should make it possible for them to deploy your updates through the same mechanism. So your change gets all the same deployment safety guardrails and automated tests and rollbacks for free. When that isn’t possible, deploy very slowly and monitor. If you start seeing disruptions in metrics (like agents suddenly not checking in because of a reboot loop) rollback or at least pause the deployment.

taspeotis1y ago

I don’t have much sympathy for CrowdStrike but deploying slowly seems mutually exclusive to protecting against emerging threats. They have to strike a balance.

zavec1y ago

Even a staged rollout over a few hours would have made a huge difference here. "Slow" in the context of a rollout can still be pretty fast.

2 more replies

getcrunk1y ago

Seriously like rolling out on some exponential scale even over the course of 10 minutes would have stopped this dead in its tracks

yardstick1y ago

In CrowdStrikes case, they could have rolled out to even 1 million endpoints first and done an automated sanity/wellness check before unleashing the content update on everyone.

In the past when I have designed update mechanisms I’ve included basic failsafes such as automated checking for a % failed updates over a sliding 24-hour window and stopping any more if there’s too many failures.

goalieca1y ago

They need a lab full of canaries.

1 more reply

Am4TIfIsER0ppos1y ago

> let [...] owners control when to update

The only acceptable update strategy for all software regardless of size or importance

red2awn1y ago· 6 in thread

> How Do We Prevent This From Happening Again?

> Software Resiliency and Testing

> * Improve Rapid Response Content testing by using testing types such as:

> * Local developer testing

So no one actually tested the changes before deploying?!

Narretz1y ago

And why is it "local developer testing" and not CI/CD. This makes them look like absolute amateurs.

belter1y ago

> This makes them look like absolute amateurs.

This applies also to all Architects and CTO's at all these Fortune 500 companies, who allowed these self updating systems into their critical systems.

I would offer a copy of Antifragile to each of these teams: https://en.wikipedia.org/wiki/Antifragile_(book)

"Every captain goes down with every ship"

1 more reply

radicaldreamer1y ago

They don't care, CI/CD, like QA, is considered a cost center for some of these companies. The cheapest thing for them is to offload the burden of testing every configuration onto the developer, who is also going to be tasked with shipping as quickly as possible or getting canned.

Claw back executive pay, stock, and bonuses imo and you'll see funded QA and CI teams.

hyperpape1y ago

It sure sounds like the "Content Validator" they mention is a form of CI/CD. The problem is that it passed that validation, but was capable of failing in reality.

1 more reply

RaftPeople1y ago

The fact that they even listed "local developer testing" is pretty weird.

That is just part of the basic process and is hardly the thing that ensures a problem like this doesn't happen.

spacebanana71y ago

This also becomes a security issue at some point. If these updates can go in untested, what's to stop a rogue employee from deliberately pushing a malicious update?

I know insider threats are very hard to protect against in general but these companies must be the most juicy target for state actors. Imagine what you could do with kernel space code in emergency services, transport infrastructure and banks.

rurban1y ago· 6 in thread

They bypassed the tests and staged deployment, because their previous update looked good. Ha.

What if they implemented a release process, and follow it? Like everyone else does. Hackers at the workplace, sigh.

fulafel1y ago

Also it must have been a manual testing effort, otherwise there would be no motive to skip it. IOW, missing test automation.

yuliyp1y ago

This feels natural, though: the first time you do something you do it 10x more slowly because there's a lot more risk. Continuing to do things like that forever isn't realistic. Complacency is a double-edged sword: sometimes it gets us to avoid wasting time and energy on needless worry (the first time someone drives a car they go 5 mph and brake at anything surprising), sometimes it gets us to be too reckless (drivers forgetting to check blind spots or driving at dangerous speeds).

throwaway7ahgb1y ago

Where do you see that, it looks like there was a bug in the template tester? Or you mean the manual tests?

kasabali1y ago

> Based on the testing performed before the initial deployment of the Template Type (on March 05, 2024), trust in the checks performed in the Content Validator, and previous successful IPC Template Instance deployments, these instances were deployed into production.

1 more reply

CommanderData1y ago

They know better obviously, transcending process and bureaucracy.

rurban1y ago

Same thing happened with Falcon on Debian before. Later they admitted that they didn't test some platforms they were releasing. Never heard of Docker?

How can you keep on with such a Q&R manager? He'll cost them billions

1 more reply

nodesocket1y ago· 5 in thread

Why do they insist on using what sounds like military pseudo jargon throughout the document?

ex. sensors? I mean how about hosts, machines, clients?

com1y ago

It’s endemic in the tech security industry - they’ve been mentally colonised by ex-mil and ex-law enforcement (wannabe mil) folks for a long time.

I try to use social work terms and principles in professional settings, which blows these people’s minds.

Advocacy, capacity evaluation, community engagement, cultural competencies, duty of care, ethics, evidence-based intervention, incentives, macro-, mezzo- and micro-practice, minimisation of harm, respect, self concept, self control etc etc

It means that my teams aren’t focussed on “nuking the bad guys from orbit” or whatever, but building defence in depth and indeed our own communities of practice (hah!), and using psychological and social lenses as well as tech and adversarial ones to predict, prevent and address disruptive and dangerous actors.

YMMV though.

phaedrus1y ago

Even computer security itself is a metaphor (at least in its inception). I often wonder what if instead of using terms like access, key, illegal operation, firewall, etc. we'd instead chosen metaphors from a different domain, for example plumbing. I'm sure a plumbing metaphor could also be found for every computer security concern. Would be so quick to romanticize as well as militarize a field dealing with "leaks," "blockages," "illegal taps," and "water quality"?

1 more reply

edm0nd1y ago

"military grade encryption!" aka just AES-256

always makes me laugh

justusthane1y ago

The sensor isn't a host, machine, or a client. It's the software component that detects threats. I guess maybe you could call it an agent instead, but I think sensor is pretty accepted terminology in the EDR space - it's not specific to Crowdstrike.

notepad0x901y ago

because those things are different? i didn't see a single "military" jargon. there is absolutely nothing unusual about their wording. It's like someone saying "why do these people use such nerdy words" regarding HN content.

sgammon1y ago· 5 in thread

1) Everything went mostly well

2) The things that did not fail went so great

3) Many many machines did not fail

4) macOS and Linux unaffected

5) Small lil bug in the content verifier

6) Please enjoy this $10 gift card

7) Every windows machine on earth bsod'd but many things worked

mikequinlan1y ago

Regarding the gift card, TechCrunch says

"On Wednesday, some of the people who posted about the gift card said that when they went to redeem the offer, they got an error message saying the voucher had been canceled. When TechCrunch checked the voucher, the Uber Eats page provided an error message that said the gift card “has been canceled by the issuing party and is no longer valid.”"

https://techcrunch.com/2024/07/24/crowdstrike-offers-a-10-ap...

sgammon1y ago

There's a KB up about this now. To use your voucher, reboot into safe mode and...

1 more reply

keyle1y ago

I get that canary rollout is tricky in this business, since it's all about stopping the spread of viruses and attacks.

That said, this incident review doesn't mention numbers, unless I missed it; how colossal of a fuck up it was.

The reality is that they don't apologize "bad shit just happens", they work their engineers to the grave, make no apology and completely screw up. This reads like a minor bump in processes.

Crowdstrike engineered the biggest computer attack the world has ever seen, with a sole purpose of preventing those. They're slowly becoming the Oracle of security and I see no sign of improvement here.

rm4451y ago

Fun post, but I'll state the obvious because I think many people do believe that every Windows machine BSOD'd. It was only ones with Crowdstrike software. Which is apparently very common but isn't actually pre-installed by Microsoft in Windows, or anything like that.

Source: work in a Windows shop and had a normal day.

sgammon1y ago

True, and definitely worth a mention. This is only Microsoft's fault insofar as it was possible at all to crash this way, this broadly, with so little recourse via remote tooling.

1 more reply

anonu1y ago· 3 in thread

In my experience with outages, usually the problem lies in some human error not following the process: Someone didn't do something, checks weren't performed, code reviews were skipped, someone got lazy.

In this post mortem there are a lot of words but not one of them actually explains what the problem was. which is: what was the process in place and why did it fail?

They also say a "bug in the content validation". Like what kind of bug? Could it have been prevented with proper testing or code review?

yuliyp1y ago

> In my experience with outages, usually the problem lies in some human error not following the process

Everyone makes mistakes. Blaming them for making those mistakes doesn't help prevent mistakes in the future.

> what kind of bug? Could it have been prevented with proper testing or code review?

It doesn't matter what the exact details of the bug are. A validator and the thing it tries to defend being imperfect mates is a failure mode. They happened to trip that failure mode spectacularly.

Also saying "proper testing and code review" in a post-mortem is useless like 95% of the time. Short of a culture of rubber-stamping and yolo-merging where there is something to do, it's a truism that any bug could have been detected with a test or caught by a diligent reviewer in code review. But they could also have been (and were) missed. "git gud" is not an incident prevention strategy, it's wishful thinking or blaming the devs unlucky enough to break it.

More useful as follow-ups are things like "this type of failure mode feels very dangerous, we can do something to make those failures impossible or much more likely to be caught"

flanked-evergl1y ago

> Everyone makes mistakes. Blaming them for making those mistakes doesn't help prevent mistakes in the future.

You can't reliably fix problems you don't understand.

gwd1y ago

> ...what was the process in place and why did it fail?

It appears the process was:

1. Channel files are considered trusted; so no need to sanity-check inputs in the sensor, and no need to fuzz the sensor itself to make sure it deals gracefully with corrupted channel files.

2. Channel files are trusted if they pass a Content Validator. No additional testing is needed; in particular, the channel files don't even need to be smoke-tested on a real system.

3. A Content Validator is considered 100% effective if it has been run on three previous batches of channel files without incident.

Now it's possible that there were prescribed steps in the process which were not followed; but those too are to be expected if there is no automation in place. A proper process requires some sort of explicit override to skip parts of it.

coremoff1y ago· 3 in thread

Such a disingenuous review; waffle and distraction to hide the important bits (or rather bit: bug in content validator) behind a wall of text that few people are going to finish.

If this is how they are going to publish what happened, I don't have any hope that they've actually learned anything from this event.

> Throughout this PIR, we have used generalized terminology to describe the Falcon platform for improved readability

Translation: we've filled this PIR with technobable so that when you don't understand it you won't ask questions for fear of appearing slow.

notepad0x901y ago

> "behind a wall of text that few people are going to finish."

heh? it's not that long and very readable.

coremoff1y ago

I disagree; it's much longer than it needs to be, is filled with pseudo-technoese to hide that there's little of consequence in there, and the tiny bit of real information in there is couched with distractions and unnecessary detail.

As I understand it, they're telling us that the outage was caused by an unspecified bug in the "Content Validator", and that the file that was shipped was done so without testing because it worked fine last time.

I think they wrote what they did because they couldn't publish the above directly without being rightly excoriated for it, and at least this way a lot of the people reading it won't understand what they're saying but it sounds very technical.

1 more reply

hello_moto1y ago

In the current situation, it's better to be complete no?

This information is not just for _you_.

Scaevolus1y ago· 3 in thread

"problematic content"? It was a file of all zero bytes. How exactly was that produced?

Zironic1y ago

If I had to guess blindly based on their writeup, it would seem that if their Content Configuration System is given invalid data, instead of aborting the template, it generates a null template.

To a degree it makes sense because it's not unusual for a template generator to provide a null response if given invalid inputs however the Content Validator then took that null and published it instead of handling the null case as it should have.

jiggawatts1y ago

Returning null instead of throwing an exception when an error occurs is the quality of programming I see from junior outsourced developers.

“if (corrupt digital signature) return null;”

is the type of code I see buried in authentication systems, gleefully converting what should be a sudden stop into a shambling zombie of invalid state and null reference exceptions fifty pages of code later in some controller that’s already written to the database on behalf of an attacker.

If I peer into my crystal ball I see a vision of CrowdStrike error handling code quality that looks suspiciously the same.

(If I sound salty, it’s because I’ve been cleaning up their mess since last week.)

1 more reply

chrisjj1y ago

The've said the crash was not related to those zero bytes. https://www.crowdstrike.com/blog/falcon-update-for-windows-h...

Ukv1y ago· 3 in thread

A summary, to my understanding:

* Their software reads config files to determine which behavior to monitor/block

* A "problematic" config file made it through automatic validation checks "due to a bug in the Content Validator"

* Further testing of the file was skipped because of "trust in the checks performed in the Content Validator" and successful tests of previous versions

* The config file causes their software to perform an out-of-bounds memory read, which it does not handle gracefully

Narretz1y ago

* Further testing of the file was skipped because of "trust in the checks performed in the Content Validator" and successful tests of previous versions

that's crazy. How costly can it be to test the file fully in a CI job? I fail to see how this wasn't implemented already.

denton-scratch1y ago

> How costly can it be to test the file fully in a CI job?

It didn't need a CI job. It just needed one person to actually boot and run a Windows instance with the Crowdstrike software installed: a smoke test.

TFA is mostly an irrelevent discourse on the product architecture, stuffed with proprietary Crowdstrike jargon, with about a couple of paragraphs dedicated to the actual problem; and they don't mention the non-existence of a smoke test.

To me, TFA is not a signal that Crowdstrike has a plan to remediate the problem, yet.

1 more reply

modestygrime1y ago

Just reeks of incompetence. Do they not have e2e smoketests of this stuff?

duped1y ago· 3 in thread

Here is my summary with the marketing bullshit ripped out.

Falcon configuration is shipped with both direct driver updates ("sensor content"), and out of band ("rapid response content"). "Sensor Content" are scripts (*) that ship with the driver. "Rapid response content" are data that can be delivered dynamically.

One way that "Rapid Response Content" is implemented is with templated "Sensor Content" scripts. CrowdStrike can keep the behavior the same but adjust the parameters by shipping "channel" files that fill in the templates.

"Sensor content", including the templates, are a part of the normal test and release process and goes through testing/verification before being signed/shipped. Customers have control over rollouts and testing.

"Rapid Response Content" is deployed through a different channel that customers do not have control over. Crowdstrike shipped a broken channel file that passed validation but was not tested.

They are going to fix this by adding testing of "rapid response" content updates and support the same rollout logic they do for the driver itself.

(*) I'm using the word "script" here loosely. I don't know what these things are, but they sound like scripts.

---

In other words, they have scripts that would crash given garbage arguments. The validator is supposed to check this before they ship, but the validator screwed it up (why is this a part of release and not done at runtime? (!)). It appears they did not test it, they do not do canary deployments or support rollout of these changes, and everything broke.

Corrupting these channel files sounds like a promising way to attack CS, I wonder if anyone is going down that road.

hello_moto1y ago

> Corrupting these channel files sounds like a promising way to attack CS, I wonder if anyone is going down that road.

Would have happened long time ago if it was that easy no?

duped1y ago

How do we know it hasn't?

1 more reply

WatchDog1y ago

> Corrupting these channel files sounds like a promising way to attack CS, I wonder if anyone is going down that road.

You would have to get into the supply chain to do much damage.

Otherwise, you would somehow need access to the hosts running the agent.

If you a threat-actor that already has access to hosts running CS, at a scale that would make the news, why would you blow your access on trying to ruin CS's reputation further?

Perhaps if you are a vendor of a competing or adjacent product that deploys an agent, you could deliberately try and crash the CS agent, but you would be caught.

cataflam1y ago· 2 in thread

Besides missing the actual testing (!), the staged rollout (!), looks like they also weren't fuzzing this kernel driver that routinely takes instant worldwide updates. Oops.

l00tr1y ago

check their developer github, "i write kernel-safe bytecode interpreters" :D, [link redacted]

brcmthrowaway1y ago

He Codes With Honor(tm)

aenis1y ago· 2 in thread

"Based on the testing performed before the initial deployment of the Template Type (on March 05, 2024), trust in the checks performed in the Content Validator, and previous successful IPC Template Instance deployments, these instances were deployed into production."

So they did not test this update at all, even locally. Its going to be interesting how this plays out in courts. The contract they have with us limits their liability significantly, but this - surely - is gross negligence.

throwaway20371y ago

As I understand, it is incredibly difficult to prove "gross negligence". It is better to pressure them to settle in a giant class action lawsuit. I am curious what the total amount of settlements / fines will be in the end. I guess ~2B USD.

aenis1y ago

Same here. Our losses were quite significant - between lost productivity, inability to provide services, inability of our clients to actually use contracted services, and having to fix their mess - its very easily in the millions.

And then there will be the costs of litigation. It was crazy in the IT department over the weekend, but not much less crazy in our legal teams, who were being bombarded with pitches from law firms offering help in recovery. It will be a fun space to watch, and this 'we haven't tested because we, like, did that before and nothing bad happened' statement in the initial report will be quoted in many lawsuits.

1 more reply

1970-01-011y ago· 2 in thread

>When received by the sensor and loaded into the Content Interpreter, problematic content in Channel File 291 resulted in an out-of-bounds memory read triggering an exception.

Wasn't 'Channel File 291' a garbage file filled with null pointers? Meaning it's problematic content in the same way as filling your parachute bag with ice cream and screws is problematic.

hyperpape1y ago

They specifically denied that null bytes were the issue in an earlier update. https://www.crowdstrike.com/blog/falcon-update-for-windows-h...

1970-01-011y ago

Null pointers, not a null array

1 more reply

gostsamo1y ago· 2 in thread

Cowards. Why don't you just stand up and admit that you didn't bother testing everything you send to production?

Everything else is smoke and the smell of sulfur.

yuliyp1y ago

> Why don't you just stand up and admit that you didn't bother testing everything you send to production?

The "What Happened on July 19, 2024?" section combined with the "Rapid Response Content Deployment" make it very clear to anyone reading that that is the case. Similarly, the discussion of the sensor release process in "Sensor Content" and lack of discussion of a release process in the "Rapid Response Content" section solidify the idea that they didn't consider validated rapid response content causing bad behavior as a thing to worry about.

NoPicklez1y ago

Because producing smoke and the smell of sulfur is how you keep your business afloat after an incident like this

Getting on your knees and admitting terrible fault with apologies galore isn't going to garner you any more sympathy.

romwell1y ago· 2 in thread

Copying my content from the duplicate thread[1] here:

This reads like a bunch of baloney to obscure the real problem. The only relevant part you need to see:

>Due to a bug in the Content Validator, one of the two Template Instances passed validation despite containing problematic content data.

Problematic content? Yeah, this is telling exactly nothing.

Their mitigation is "ummm we'll test more and maybe not roll the updates to everyone at once", without any direct explanation on how that would prevent this from happening again.

Conspicuously absent:

— fixing whatever produced "problematic content"

— fixing whatever made it possible for "problematic content" to cause "ungraceful" crashes

— rewriting code so that the Validator and Interpreter would use the same code path to catch such issues in test

— allowing the sysadmins to roll back updates before the OS boots

— diversifying the test environment to include actual client machine configurations running actual releases as they would be received by clients

This is a nothing sandwich, not an incident review.

[1] https://news.ycombinator.com/item?id=41053703

dang1y ago

> Copying my content from the duplicate thread[1] here

Please don't do this! It makes merging threads a pain because then we have to find the duplicate subthreads (i.e. your two comments) and merge the replies as well.

Instead, if you or anyone will let us know at hn@ycombinator.com which threads need merging, we can do that. The solution is deduplication, not further duplication!

romwell1y ago

Oops. Noted!

Apologies for inadvertently adding work!

Somehow, I never realized that duplicate threads were merged (instead of one of them being nuked), because it seems like a lot of work in the first place.

Thanks for doing it!

1 more reply

Cyphase1y ago· 1 in thread

Direct link to the PIR, instead of the list of posts: https://www.crowdstrike.com/blog/falcon-content-update-preli...

Cyphase1y ago

The article link has been updated to that; it used to be the "hub" page at https://www.crowdstrike.com/falcon-content-update-remediatio...

Some updates from the hub page:

They published an "executive summary" in PDF format: https://www.crowdstrike.com/wp-content/uploads/2024/07/Crowd...

That includes a couple of bullet points under "Third Party Validation" (independent code/process reviews), which they added to the PIR on the hub page, but not on the dedicated PIR page.

> Updated 2024-07-24 2217 UTC

> ### Third Party Validation

> - Conduct multiple independent third-party security code reviews.

> - Conduct independent reviews of end-to-end quality processes from development through deployment.

CommanderData1y ago· 1 in thread

"We didn't properly test our update."

Should be the tldr. On threads there's information about CrordStrike slashing QA team numbers, whether that was a factor should be looked at.

hulitu1y ago

They write perfect software. Why should they test it ? /s

nine_zeros1y ago· 1 in thread

Will managers continue to push engineers even when engineers advise to go slower or no?

bobwaycott1y ago

Always.

thayne1y ago· 1 in thread

So this event is probably close to a worst case scenario for an untested sensor update. But have they never had issues with such untested updates before, like an update resulting in false positives on legitimate software? Because if they did, that should have been a clue that these types if updates should be tested too.

novia1y ago

Crowdstrike issues false positives allll the time. They'll fix them and then they'll come back in a future update. One such false positive is an empty file. Crowdstike hates empty files.

EricE1y ago· 1 in thread

A file full of zeros is an "undetected error"? Good grief.

dmazzoni1y ago

It wasn't a file full of zeros that caused the problem.

While some affected users did have a file full of zeros, that was actually a result of the system in the process of trying to download an update, and not the version of the file that caused the crash.

mdriley1y ago

It compiled, so they shipped it to everyone all at once without ever running it themselves.

They fell short of "works on my machine".

amluto1y ago

CrowdStrike is more than big enough to have a real 2000’s-style QA team. There should be actual people with actual computers whose job is to break the software and write bug reports. Nothing is deployed without QA sign off, and no one is permitted to apply pressure to QA to sign off on anything. CI/CD is simply not sufficient for a product that can fail in a non-revertable way.

A good QA team could turn around a rapid response update with more than enough testing to catch screwups like this and even some rather more subtle ones in an hour or two.

meshko1y ago

I so hate it when people fill these postmortems with marketing speak. Don't they know it is counterproductive?

dmitrygr1y ago

> How Do We Prevent This From Happening Again?

> * Local developer testing

Yup... now that all machines are internet connected, telemetry has replaced QA departments. There are actual people in positions of power that think that they do not need QA and can just test on customers. If there is anything right in the world, crowdsuck will be destroyed by lawsuits and every decisionmaker involved will never work as such again.

aeyes1y ago

Do you see how they only talk about technical changes to prevent this from happening again?

To me this was a complete failure on the process and review side. If something so blatantly obvious can slip through, how could ever I trust them to prevent an insider from shipping a backdoor?

They are auto updating code with the highest privileges on millions of machines. I'd expect their processes to be much much more cautious.

yashafromrussia1y ago

Well I'm glad they at least released a public postmortem on the incident. To be honest, I feel naive saying this, but having worked at a bunch of startups my whole life, I expected companies like CrowdStrike to do better than not testing it on their own machines before deploying an update without the ability to roll it back.

mianos1y ago

I see a path to this every day.

An actual scenario: Some developer starts working on pre deployment validation of config files. Let's say in a pipeline.

Most of the time the config files are OK.

Management says: "Why are you spending so long on this project, the sprint plan said one week, we can't approve anything that takes more than a week."

Developer: "This is harder than it looks" (heard that before).

Management: "Well, if the config file is OK then we won't have a problem in production. Stop working on it".

Developer: Stops working on it.

Config file with a syntax error slips through, .. The rest is history

notepad0x901y ago

One lesson I've learned from this fiasco is to examine my own self when it comes to these situations. I am so befuddled by all the wild opinions, speculations and conclusions as well as observations of the PIR here. You can never have enough humility.

SirMaster1y ago

I feel like for a system that is this widely used and installed in such a critical position that upon a BSOD crash due to a faulting kernel module like this, the system should be able to automatically roll back to try the previous version on subsequent boot(s).

jvreeland1y ago

I really dislike reading website that take over half the screen and make me read off to the side like this. I can fix it by zooming in but I don't understand why they thought making the navigation take up that much of the screen or not be collapsable was a good move.

m3kw91y ago

Still have kernel access

j / k navigate · click thread line to collapse