"Provide customers with greater control over the delivery of Rapid Response Content updates by allowing granular selection of when and where these updates are deployed."
This is where they admit that:
1. They deployed changes to their software directly to customer production machines; 2. They didn’t allow their clients any opportunity to test those changes before they took effect; and 3. This was cosmically stupid and they’re going to stop doing that.
Software that does 1. and 2. has absolutely no place in critical infrastructure like hospitals and emergency services. I predict we’ll see other vendors removing similar bonehead “features” very very quietly over the next few months.
> Implement a staggered deployment strategy for Rapid Response Content in which updates are gradually deployed to larger portions of the sensor base, starting with a canary deployment.
They weren't doing any test deployments at all before blasting the world with an update? Reckless.
they have a staging environment at least, but no idea what they were running in it or what testing was done there.
That said, maybe Crowdstrike should considering validating every step of the delivery pipeline before pushing to customers.
If they'd just had a lab of a couple dozen PCs acting as canaries they'd have caught this. Apparently that was too complicated or expensive for them.
I have a similar feeling.
At the very least perhaps have an "A" and a "B" update channel, where "B" is x hours behind A. This way if, in an HA configuration, one side goes down there's time to deal with it while your B-side is still up.
Being chronically exposed may be the right call, in the same way that Roman cities didn't have walls.
Compare this perspective from Matt Levine:
> So for instance if you run a ransomware business and shut down, like, a marketing agency or a dating app or a cryptocurrency exchange until it pays you a ransom in Bitcoin, that’s great, that’s good money. A crime, sure, but good money. But if you shut down the biggest oil pipeline in the U.S. for days, that’s dangerous, that’s a U.S. national security issue, that gets you too much attention and runs the risk of blowing up your whole business. So:
>> In its own statement, the DarkSide group hinted that an affiliate may have been behind the attack and that it never intended to cause such upheaval.
>> In a message posted on the dark web, where DarkSide maintains a site, the group suggested one of its customers was behind the attack and promised to do a better job vetting them going forward.
>> “We are apolitical. We do not participate in geopolitics,” the message says. “Our goal is to make money and not creating problems for society. From today, we introduce moderation and check each company that our partners want to encrypt to avoid social consequences in the future.”
> If you want to use their ransomware software to do crimes, apparently you have to submit a resume demonstrating that you are good at committing crimes. (“Hopeful affiliates are subject to DarkSide’s rigorous vetting process, which examines the candidate’s ‘work history,’ areas of expertise, and past profits among other things.”) But not too good! The goal is to bring a midsize company to its knees and extract a large ransom, not to bring society to its knees and extract terrible vengeance.
> We have talked about this before, and one category of crime that a ransomware compliance officer might reject is “hacks that are so big and disastrous that they could call down the wrath of the US government and shut down the whole business.” But another category of off-limits crime appears to be “hacks that are so morally reprehensible that they will lead to other criminals boycotting your business.”
>> A global ransomware operator issued an apology and offered to unlock the data targeted in a ransomware attack on Toronto’s Hospital for Sick Children, a move cybersecurity experts say is rare, if not unprecedented, for the infamous group.
>> LockBit’s apology, meanwhile, appears to be a way of managing its image, said [cybersecurity researcher Chester] Wisniewski.
>> He suggested the move could be directed at those partners who might see the attack on a children’s hospital as a step too far.
> If you are one of the providers, you have to choose your hacker partners carefully so that they do the right amount of crime: You don’t want incompetent or unambitious hackers who can’t make any money, but you also don’t want overly ambitious hackers who hack, you know, the US Department of Defense, or the Hospital for Sick Children. Meanwhile you also have to market yourself to hacker partners so that they choose your services, which again requires that you have a reputation for being good and bold at crime, but not too bold. Your hacker partners want to do crime, but they have their limits, and if you get a reputation for murdering sick children that will cost you some criminal business.
Absolutely this is what will happen.
I don't know much about the practice of AV definition-like feature across Cybersecurity but I would imagine there might be a possibility that no vendors do rolling update today because it involves Opt-in/Opt-out which might influence the vendor's speed to identify attack which in turns affect their "Reputation" as well.
"I bought Vendor-A solution but I got hacked and have to pay Ransomware" (with a side note: because I did not consume the latest critical update of AV definition) is what Vendors worried.
Now that this Global Outage happened, it will change the landscape a bit.
I seriously doubt that. Questions like "why should we use CrowdStrike" will be met with "suppose they've learned their lesson".
Is it really all that surprising? This is basically their business model - its a fancy virus scanner that is supposed to instantly respond to threats.
I’d argue that anyone that agrees to this is the idiot. Sure they have blame for being the source of the problem, but any CXO that signed off on software that a third party can update whenever they’d like is also at fault. It’s not an “if” situation, it’s a “when”.
This is part of the premise of EDR software.
If indeed this happens, I'd hail this event as a victory overall; but industry experience tells me that most of those companies will say "it'd never happen with us, we're a lot more careful", and keep doing what they're doing.
But canary / smoke tests, you can do, if the vendor provides the right tools.
It's a cycle: pick the latest release, do some small cluster testing, including rollback testing, then roll out to 1%, if those machines are (mostly) still available in 5 minutes, roll out to 2%, if the 3% is (mostly) still available in 5 minutes, roll out to 4%, etc. If updates are fast and everything works, it goes quick. If there's a big problem, you'll have still have a lot of working nodes. If there's a small problem, you have a small problem.
It's gotta be automated though, but with an easy way for a person to pause if something is going wrong that the automation doesn't catch. If the pace is several updates a day, that's too much for people, IMHO.
The two golden rules are to let host owners control when to update whenever possible, and when it isn’t to deploy very very slowly. If a customer has a CI/CD system, you should make it possible for them to deploy your updates through the same mechanism. So your change gets all the same deployment safety guardrails and automated tests and rollbacks for free. When that isn’t possible, deploy very slowly and monitor. If you start seeing disruptions in metrics (like agents suddenly not checking in because of a reboot loop) rollback or at least pause the deployment.
In the past when I have designed update mechanisms I’ve included basic failsafes such as automated checking for a % failed updates over a sliding 24-hour window and stopping any more if there’s too many failures.
The only acceptable update strategy for all software regardless of size or importance
> Enhance existing error handling in the Content Interpreter.
That's it.
Also, it sounds like they might have separate "validation" code, based on this; why is "deploy it in a realistic test fleet" not part of validation? I notice they haven't yet explained anything about what the Content Validator does to validate the content.
> Add additional validation checks to the Content Validator for Rapid Response Content. A new check is in process to guard against this type of problematic content from being deployed in the future.
Could it say any less? I hope the new check is a test fleet.
But let's go back to, "the sensor client should not ever count on the Rapid Response Content being well-formed to avoid crashes".
That's what stood out to me. From the CS post: "Template Instances are created and configured through the use of the Content Configuration System, which includes the Content Validator that performs validation checks on the content before it is published."
Lesson learned, a "Validator" that is not actually the same program that will be parsing/reading the file in production, is not a complete test. It's not entirely useless, but it doesn't guarantee anything. The production program could have a latent bug that a completely "valid" (by specification) file might trigger.
https://lexi-lambda.github.io/blog/2019/11/05/parse-don-t-va...
That stood out to me as well.
Their response was the moral equivalent of Apple saying “iTunes crashes when you play a malformed mp3, so here’s how we’re going to improve how we test our mp3s before sending them to you”.
This is a security product that is expected to handle malicious inputs. If they can’t even handle their own inputs without crashing, I don’t like the odds of this thing being itself a potential attack vector.
And great point that it's not just about crashing on these updates, even if they are properly signed and secure. What does this say about other parts of the client code? And if they're not signed, which seems unclear right now, then could anyone who gains access to a machine running the client get it to start boot looping again by copying Channel File 291 into place? What else could they do?
Echoes of the Sony BMG rootkit.
https://en.wikipedia.org/wiki/Sony_BMG_copy_protection_rootk...
The bug itself is not particularly interesting, nor is the fix for it.
The astounding thing about this issue, is the scale of the damage it caused, and that scale is all due to the rollout process.
I would not want to do any of this directly on metal, where the only safety is what you make for yourself. But that's the line Crowdstrike are in.
* By EDR standards, at least, where "only" one reboot a week forced entirely by memory lost to an unkillable process counts as exceptionally good.
Have timeout
Decrement counter after successful load and parse
Check counter on startup. If it is like 3, maybe consider you are crashing
It compiled, so they shipped it to everyone all at once without ever running it themselves.
They fell short of "works on my machine".
> Software Resiliency and Testing
> * Improve Rapid Response Content testing by using testing types such as:
> * Local developer testing
So no one actually tested the changes before deploying?!
This applies also to all Architects and CTO's at all these Fortune 500 companies, who allowed these self updating systems into their critical systems.
I would offer a copy of Antifragile to each of these teams: https://en.wikipedia.org/wiki/Antifragile_(book)
"Every captain goes down with every ship"
Claw back executive pay, stock, and bonuses imo and you'll see funded QA and CI teams.
That is just part of the basic process and is hardly the thing that ensures a problem like this doesn't happen.
I know insider threats are very hard to protect against in general but these companies must be the most juicy target for state actors. Imagine what you could do with kernel space code in emergency services, transport infrastructure and banks.
A good QA team could turn around a rapid response update with more than enough testing to catch screwups like this and even some rather more subtle ones in an hour or two.
What if they implemented a release process, and follow it? Like everyone else does. Hackers at the workplace, sigh.
How can you keep on with such a Q&R manager? He'll cost them billions
In this post mortem there are a lot of words but not one of them actually explains what the problem was. which is: what was the process in place and why did it fail?
They also say a "bug in the content validation". Like what kind of bug? Could it have been prevented with proper testing or code review?
Everyone makes mistakes. Blaming them for making those mistakes doesn't help prevent mistakes in the future.
> what kind of bug? Could it have been prevented with proper testing or code review?
It doesn't matter what the exact details of the bug are. A validator and the thing it tries to defend being imperfect mates is a failure mode. They happened to trip that failure mode spectacularly.
Also saying "proper testing and code review" in a post-mortem is useless like 95% of the time. Short of a culture of rubber-stamping and yolo-merging where there is something to do, it's a truism that any bug could have been detected with a test or caught by a diligent reviewer in code review. But they could also have been (and were) missed. "git gud" is not an incident prevention strategy, it's wishful thinking or blaming the devs unlucky enough to break it.
More useful as follow-ups are things like "this type of failure mode feels very dangerous, we can do something to make those failures impossible or much more likely to be caught"
You can't reliably fix problems you don't understand.
It appears the process was:
1. Channel files are considered trusted; so no need to sanity-check inputs in the sensor, and no need to fuzz the sensor itself to make sure it deals gracefully with corrupted channel files.
2. Channel files are trusted if they pass a Content Validator. No additional testing is needed; in particular, the channel files don't even need to be smoke-tested on a real system.
3. A Content Validator is considered 100% effective if it has been run on three previous batches of channel files without incident.
Now it's possible that there were prescribed steps in the process which were not followed; but those too are to be expected if there is no automation in place. A proper process requires some sort of explicit override to skip parts of it.
So they did not test this update at all, even locally. Its going to be interesting how this plays out in courts. The contract they have with us limits their liability significantly, but this - surely - is gross negligence.
And then there will be the costs of litigation. It was crazy in the IT department over the weekend, but not much less crazy in our legal teams, who were being bombarded with pitches from law firms offering help in recovery. It will be a fun space to watch, and this 'we haven't tested because we, like, did that before and nothing bad happened' statement in the initial report will be quoted in many lawsuits.
ex. sensors? I mean how about hosts, machines, clients?
I try to use social work terms and principles in professional settings, which blows these people’s minds.
Advocacy, capacity evaluation, community engagement, cultural competencies, duty of care, ethics, evidence-based intervention, incentives, macro-, mezzo- and micro-practice, minimisation of harm, respect, self concept, self control etc etc
It means that my teams aren’t focussed on “nuking the bad guys from orbit” or whatever, but building defence in depth and indeed our own communities of practice (hah!), and using psychological and social lenses as well as tech and adversarial ones to predict, prevent and address disruptive and dangerous actors.
YMMV though.
always makes me laugh
The only relevant part you need to see:
>Due to a bug in the Content Validator, one of the two Template Instances passed validation despite containing problematic content data.
Problematic content? Yeah, this is telling exactly nothing.
Their mitigation is "ummm we'll test more and maybe not roll the updates to everyone at once", without any direct explanation on how that would prevent this from happening again.
Conspicuously absent:
— fixing whatever produced "problematic content"
— fixing whatever made it possible for "problematic content" to cause "ungraceful" crashes
— rewriting code so that the Validator and Interpreter would use the same code path to catch such issues in test
— allowing the sysadmins to roll back updates before the OS boots
— diversifying the test environment to include actual client machine configurations running actual releases as they would be received by clients
This is a nothing sandwich, not an incident review.
It's far from perfect (both in terms of the lack of defenses to crashloop in the sensor and in what it said about their previous practices) but calling it a nothing sandwich is a bit hyperbolic.
I was not talking about the code that crashed.
I guess what I wrote was non-obvious enough that it needs an explanation:
— fixing whatever produced "problematic content":
The release doesn't talk about the subsystem that produced the "problematic content". The part that crashed was the interpreter (consumer of the content); the part that generated the "problematic content" might have worked as intended, for all we know.
— fixing whatever made it possible for "problematic content" to cause "ungraceful" crashes:
I am not talking about fixing this particular crash.
I am talking about design choices that allow such crashes in principle.
In this instance, the interpreter seemed to have been reading memory addresses from a configuration file (or something that would be equivalent to doing that). Adding an additional check will fix this bug, but not the fundamental issue that an interpreter should not be doing that.
>The architectural changes are the more interesting bits, and they're covered reasonably well
They are not covered at all. Are we reading the same press release?
>Your third point can help but no matter what there's still going to be parts of the interpreter that aren't exercised by the validator because it's not actually running the code.
Yes, that's the problem I am pointing out: the "validator" and "interpreter" should be the same code. The "validator" can issue commands to a mock operating system instead of doing real API calls, but it should go through the input with the actual interpreter.
In other words, the interpreter should be a part of the validator.
>It's far from perfect (both in terms of the lack of defenses to crashloop in the sensor and in what it said about their previous practices) but calling it a nothing sandwich is a bit hyperbolic.
Sure; that's my subjective assessment. Personally, I am very dissatisfied with their post-mortem. If you are happy with it, that's fair, but you'd need to say more if you want to make a point in addition to "the architectural changes are covered reasonably well".
Like, which specific changes those would be, for starters.
>Enhance existing error handling in the Content Interpreter.
They did write that they intended to fix the bugs in both the validator and the interpreter. Though it's a big mystery to me and most of the comments on the topic how an interpreter that crashes on a null template would ever get into production.
I strongly disagree.
Add additional validation and enhance error handling say as much as "add band-aids and improve health" in response to a broken arm.
Which is not something you'd want to hear from a kindergarten that sends your kid back to you with shattered bones.
Note that the things I said were missing are indeed missing in the "mitigation".
In particular, additional checks and "enhanced" error handling don't address:
— the fact that it's possible for content to be "problematic" for interpreter, but not the validator;
— the possibility for "problematic" content to crash the entire system still remaining;
— nothing being said about what made the content "problematic" (spoiler: a bunch of zeros, but they didn't say it), how that content was produced in the first place, and the possibility of it happening in the future still remaining;
— the fact that their clients aren't in control of their own systems, have no way to roll back a bad update, and can have their entire fleet disabled or compromised by CrowdStrike in an instant;
— the business practices and incentives that didn't result in all their "mitigation" steps (as well as steps addressing the above) being already implemented still driving CrowdStrike's relationship with its employees and clients.
The latter is particularly important. This is less a software issue, and more an organizational failure.
Elsewhere on HN and reddit, people were writing that ridiculous SLA's, such as "4 hour response to a vulnerability", make it practically impossible to release well-tested code, and that reliance on a rootkit for security is little more than CYA — which means that the writing was on the wall, and this will happen again.
You can't fix bad business practices with bug fixes and improved testing. And you can't fix what you don't look into.
Hence my qualification of this "review" as a red herring.
Better not only fix this specific bug but continuously use fuzzing to find more places where external data (including updates) can trigger a crash (or worse RCE)
But it seems to me that putting the interpreter in a place in the OS where it can cause a system crash with the be the behavior that it's allowed to do is a fundamental design choice that is not at all addressed by fuzzing.
Some updates from the hub page:
They published an "executive summary" in PDF format: https://www.crowdstrike.com/wp-content/uploads/2024/07/Crowd...
That includes a couple of bullet points under "Third Party Validation" (independent code/process reviews), which they added to the PIR on the hub page, but not on the dedicated PIR page.
> Updated 2024-07-24 2217 UTC
> ### Third Party Validation
> - Conduct multiple independent third-party security code reviews.
> - Conduct independent reviews of end-to-end quality processes from development through deployment.
> * Local developer testing
Yup... now that all machines are internet connected, telemetry has replaced QA departments. There are actual people in positions of power that think that they do not need QA and can just test on customers. If there is anything right in the world, crowdsuck will be destroyed by lawsuits and every decisionmaker involved will never work as such again.
If this is how they are going to publish what happened, I don't have any hope that they've actually learned anything from this event.
> Throughout this PIR, we have used generalized terminology to describe the Falcon platform for improved readability
Translation: we've filled this PIR with technobable so that when you don't understand it you won't ask questions for fear of appearing slow.
heh? it's not that long and very readable.
As I understand it, they're telling us that the outage was caused by an unspecified bug in the "Content Validator", and that the file that was shipped was done so without testing because it worked fine last time.
I think they wrote what they did because they couldn't publish the above directly without being rightly excoriated for it, and at least this way a lot of the people reading it won't understand what they're saying but it sounds very technical.
This information is not just for _you_.
To me this was a complete failure on the process and review side. If something so blatantly obvious can slip through, how could ever I trust them to prevent an insider from shipping a backdoor?
They are auto updating code with the highest privileges on millions of machines. I'd expect their processes to be much much more cautious.
An actual scenario: Some developer starts working on pre deployment validation of config files. Let's say in a pipeline.
Most of the time the config files are OK.
Management says: "Why are you spending so long on this project, the sprint plan said one week, we can't approve anything that takes more than a week."
Developer: "This is harder than it looks" (heard that before).
Management: "Well, if the config file is OK then we won't have a problem in production. Stop working on it".
Developer: Stops working on it.
Config file with a syntax error slips through, .. The rest is history
Should be the tldr. On threads there's information about CrordStrike slashing QA team numbers, whether that was a factor should be looked at.
To a degree it makes sense because it's not unusual for a template generator to provide a null response if given invalid inputs however the Content Validator then took that null and published it instead of handling the null case as it should have.
“if (corrupt digital signature) return null;”
is the type of code I see buried in authentication systems, gleefully converting what should be a sudden stop into a shambling zombie of invalid state and null reference exceptions fifty pages of code later in some controller that’s already written to the database on behalf of an attacker.
If I peer into my crystal ball I see a vision of CrowdStrike error handling code quality that looks suspiciously the same.
(If I sound salty, it’s because I’ve been cleaning up their mess since last week.)
Wasn't 'Channel File 291' a garbage file filled with null pointers? Meaning it's problematic content in the same way as filling your parachute bag with ice cream and screws is problematic.
2) The things that did not fail went so great
3) Many many machines did not fail
4) macOS and Linux unaffected
5) Small lil bug in the content verifier
6) Please enjoy this $10 gift card
7) Every windows machine on earth bsod'd but many things worked
"On Wednesday, some of the people who posted about the gift card said that when they went to redeem the offer, they got an error message saying the voucher had been canceled. When TechCrunch checked the voucher, the Uber Eats page provided an error message that said the gift card “has been canceled by the issuing party and is no longer valid.”"
https://techcrunch.com/2024/07/24/crowdstrike-offers-a-10-ap...
That said, this incident review doesn't mention numbers, unless I missed it; how colossal of a fuck up it was.
The reality is that they don't apologize "bad shit just happens", they work their engineers to the grave, make no apology and completely screw up. This reads like a minor bump in processes.
Crowdstrike engineered the biggest computer attack the world has ever seen, with a sole purpose of preventing those. They're slowly becoming the Oracle of security and I see no sign of improvement here.
Source: work in a Windows shop and had a normal day.
* Their software reads config files to determine which behavior to monitor/block
* A "problematic" config file made it through automatic validation checks "due to a bug in the Content Validator"
* Further testing of the file was skipped because of "trust in the checks performed in the Content Validator" and successful tests of previous versions
* The config file causes their software to perform an out-of-bounds memory read, which it does not handle gracefully
that's crazy. How costly can it be to test the file fully in a CI job? I fail to see how this wasn't implemented already.
It didn't need a CI job. It just needed one person to actually boot and run a Windows instance with the Crowdstrike software installed: a smoke test.
TFA is mostly an irrelevent discourse on the product architecture, stuffed with proprietary Crowdstrike jargon, with about a couple of paragraphs dedicated to the actual problem; and they don't mention the non-existence of a smoke test.
To me, TFA is not a signal that Crowdstrike has a plan to remediate the problem, yet.
Everything else is smoke and the smell of sulfur.
The "What Happened on July 19, 2024?" section combined with the "Rapid Response Content Deployment" make it very clear to anyone reading that that is the case. Similarly, the discussion of the sensor release process in "Sensor Content" and lack of discussion of a release process in the "Rapid Response Content" section solidify the idea that they didn't consider validated rapid response content causing bad behavior as a thing to worry about.
Getting on your knees and admitting terrible fault with apologies galore isn't going to garner you any more sympathy.
While some affected users did have a file full of zeros, that was actually a result of the system in the process of trying to download an update, and not the version of the file that caused the crash.
Falcon configuration is shipped with both direct driver updates ("sensor content"), and out of band ("rapid response content"). "Sensor Content" are scripts (*) that ship with the driver. "Rapid response content" are data that can be delivered dynamically.
One way that "Rapid Response Content" is implemented is with templated "Sensor Content" scripts. CrowdStrike can keep the behavior the same but adjust the parameters by shipping "channel" files that fill in the templates.
"Sensor content", including the templates, are a part of the normal test and release process and goes through testing/verification before being signed/shipped. Customers have control over rollouts and testing.
"Rapid Response Content" is deployed through a different channel that customers do not have control over. Crowdstrike shipped a broken channel file that passed validation but was not tested.
They are going to fix this by adding testing of "rapid response" content updates and support the same rollout logic they do for the driver itself.
(*) I'm using the word "script" here loosely. I don't know what these things are, but they sound like scripts.
---
In other words, they have scripts that would crash given garbage arguments. The validator is supposed to check this before they ship, but the validator screwed it up (why is this a part of release and not done at runtime? (!)). It appears they did not test it, they do not do canary deployments or support rollout of these changes, and everything broke.
Corrupting these channel files sounds like a promising way to attack CS, I wonder if anyone is going down that road.
Would have happened long time ago if it was that easy no?
You would have to get into the supply chain to do much damage.
Otherwise, you would somehow need access to the hosts running the agent.
If you a threat-actor that already has access to hosts running CS, at a scale that would make the news, why would you blow your access on trying to ruin CS's reputation further?
Perhaps if you are a vendor of a competing or adjacent product that deploys an agent, you could deliberately try and crash the CS agent, but you would be caught.
This reads like a bunch of baloney to obscure the real problem. The only relevant part you need to see:
>Due to a bug in the Content Validator, one of the two Template Instances passed validation despite containing problematic content data.
Problematic content? Yeah, this is telling exactly nothing.
Their mitigation is "ummm we'll test more and maybe not roll the updates to everyone at once", without any direct explanation on how that would prevent this from happening again.
Conspicuously absent:
— fixing whatever produced "problematic content"
— fixing whatever made it possible for "problematic content" to cause "ungraceful" crashes
— rewriting code so that the Validator and Interpreter would use the same code path to catch such issues in test
— allowing the sysadmins to roll back updates before the OS boots
— diversifying the test environment to include actual client machine configurations running actual releases as they would be received by clients
This is a nothing sandwich, not an incident review.
Please don't do this! It makes merging threads a pain because then we have to find the duplicate subthreads (i.e. your two comments) and merge the replies as well.
Instead, if you or anyone will let us know at hn@ycombinator.com which threads need merging, we can do that. The solution is deduplication, not further duplication!
Apologies for inadvertently adding work!
Somehow, I never realized that duplicate threads were merged (instead of one of them being nuked), because it seems like a lot of work in the first place.
Thanks for doing it!