Yes, they aren’t perfect. They do some things that I disagree with.
But overall they prove themselves worthy of my trust, specifically because of the engineering mindset that the company shares, and how serious they take things like this.
Thank you for the blog post!
- Insist that you have better integrity than your competitors
- share a few operational investigations after your latest security event
what cloudflare doesnt do is provide their SOC risk analysis as a PCI/DSS payment card processor. Cloudflare doesnt explain why they ignored/failed to identify the elevated accounts or how those accounts became compromised to begin with. They just explain remediation without accountability.
They mention a third-party audit was conducted, but thats not because they care about you. Its because PCI/DSS mandates when an organization of any level experiences a data breach or cyber-attack that compromises payment card information, it needs to pass a yearly on-premise audit to ensure PCI compliance. if they didnt, major credit houses would stop processing their payments.
It's a very good writeup, as these things go. Cloudflare is huge. An ancillary system of theirs got popped as sequelae to the Okta breach. They reimaged every machine in their fleet and burned all their secrets. People are going to find ways to snipe at them, because that's fun to do, but none of those people are likely to handle an incident like this better.
I am not a Cloudflare customer (technically, I am a competitor). But my estimation of them went up (I won't say by how much) after reading this writeup.
They clearly defined the scope of impact, and demonstrated that none of this impacts systems in scope for PCI. There was no breach to change management inside of BitBucket, and none of the edge servers processing cardholder data were impacted. They will have plenty of artifacts to demonstrate that by bringing in an external firm. So I am really not clear why you're bringing up PCI at all here. They made it clear no cardholder data was impacted so your perspective on the required "on-site" audits is moot.
Cloudflare operates two entirely different scopes for PCI; The first being as a Merchant where you the customer pays for the services. This is a very small scope of systems. The second is as a Service Provider that processes cards over the network. The network works such that it is not feasible to exfiltrate card data from the network. There are many reasons as to why this is, but they demonstrate year over year that this is not something that is reasonably possible. You can review their PCI AoC and get the details (albeit limited) to understand this better. Or you could get their SOC 2 Type 2 Report which will cover many aspects of the edge networks control environment with much better testing details. After reading that you can then come back to the blog and see that clearly no PCI scoped systems were impacted in a way that would require any on-prem audit to occur.
And they are not a card network. They are a PCI Service Provider because cards transit over their network. They are not at risk of being unable to process payments or transactions for their Merchant scope even if there are issues with their Service Provider scope. Because, again, these are two separate PCI audits that are done, testing two different sets of systems and controls.
And, as an aside, Cloudflare effectively always has on-prem PCI audits occur. Because the PCI QSA's need to physically visit Cloudflare datacenters to demonstrate not only the software side of compliance, but the datacenters deployed globally.
They did, and they admitted that it was their fault. I have to give them credit for that much.
> They did this by using one access token and three service account credentials that had been taken, and that we failed to rotate, after the Okta compromise of October 2023...The one service token and three accounts were not rotated because mistakenly it was believed they were unused. This was incorrect and was how the threat actor first got into our systems and gained persistence to our Atlassian products. Note that this was in no way an error on the part of AWS, Moveworks or Smartsheet. These were merely credentials which we failed to rotate.
Like I get cynicism, but they very clearly explained the lead-up to the accounts being compromised and the mistakes that caused that. They took full accountability of it. Which is frankly more than most companies dealing with security incidents. This entire write-up is more than most companies obligations or responses.
No competitors were mentioned in Cloudflares article, they explained what kind of information was breached, nothing to do with payment/card info... so I doubt you even read past the first few paragraphs/conclusion.
No payment card information was compromised.
I don't believe they didn't lose anything. That's not how this works, and most Jira/Confluence I've seen is loaded with secrets.
For a nation state actor, the easiest way to accomplish that is to send one of their loyal citizens to become an employee of the target company and then have the person send back "information about the architecture, security, and management" of the target company.
Fun (but possibly apocryphal) fact: more than a decade ago in a social gathering of SREs at Google, several admitted to being on the payroll of some national intelligence bureaus.
They had government engagements with Google's consent, and all those various engagements could be disclosed to each other?
If not, what kind of drugs were flowing at this social gathering, to cause such an orgy of bad OPSEC?
Australians get the 'opportunity' to be part of that sort of that sort of espionage as a base level condition of citizenship [0].
As an upside, I guess it helps with encouraging good practices around zero trust processes and systems dev.
[0]: https://en.wikipedia.org/wiki/Mass_surveillance_in_Australia...
Precisely. Particularly in the case of US businesses. Why bother picking a lock when you have both the key and permission?
Code red is a standard term in emergency response that means smoke/fire. In general, in order to “redirect” that much effort one must do some paperwork to prove the urgency and immediacy of the threat.
The MO screams China to me but I wouldn’t read anything into the name “code red” which would have been selected before they identified the specific threat actor anyway.
I'm curious if they're rethinking being on Okta.
Okta deserves criticism for their failure, but this feels like CloudFlare punching down to shift blame for a miss on their part.
January 2022: https://blog.cloudflare.com/cloudflare-investigation-of-the-...
October 2023: https://blog.cloudflare.com/how-cloudflare-mitigated-yet-ano...
If anything Okta is a bigger company (by revenue, by employee count) and they were founded a year earlier.
It's fair to "punch down" imo as that's how the credentials were originally compromised. I'd agree with you if CF were trying to minimize their own mistake but that doesn't seem to be what is happening here
I am grandfathered in to an old MacBook that has absolutely no management software on it, from the “Early Days” when there was no IT and we just got brand new untouched laptops.
They offered me an upgrade to an M1/M2 pro, but I refused, saying that I wasn’t willing to use Okta’s login system if I have my own personal passwords or keys anywhere on my work computer.
Since that would hugely disrupt my work, I can’t upgrade. Maybe I can use incidents like this to justify my beliefs to the IT department…
Okta doesn't make device management software, thats made by companies like Jamf. Okta can integrate with them but Okta isn't what manages your laptop at all.
> I wasn’t willing to use Okta’s login system if I have my own personal passwords or keys anywhere on my work computer.
Do not do this, its not a personal device.
Well... don't do that? Why would you ever have personal anything on a work computer?
Eh? So why weren't they revoked entirely? I'm sure something's just unsaid there, or lost in communication or something, but as written that doesn't really make sense to me?
That is, I'd expect there was a flag in a database somewhere saying that those service accounts were "abandoned" or "cleaned up" or some other non-active status, but that this assertion was incorrect. Then they probably rotated all the passwords for active accounts, but skipped the inactive ones.
Speaking purely about PKI and certificate revocation, because that's the only similar context that I really know about, there is generally a difference between allowing certificates to expire, vs allowing them to be marked as "no longer used", vs fully revoking them: a certificate authority needs to do absolutely nothing in the first case, can choose to either do nothing or revoke in the second case, and must actively maintain and broadcast that revocation list for the third case. When someone says "hey I accidentally clobbered that private key can I please have a new cert for this new key," you generally don't add the old cert to the revocation list because why would you.
Great call out too
> Note that this was in no way an error on the part of AWS, Moveworks or Smartsheet. These were merely credentials which we failed to rotate.
i.e. instead of 'because they were mistakenly thought to be unused' you can say 'because they were mistakenly thought to be ok to leave as unused' (or something less awkward depending on exactly what the scenario was) and there's no more blame there? And if you really want to emphasise blamelessness you can say how your processes and training failed to sufficiently encourage least privilege, etc.
Because if you take it exactly as it's written it's just too weird, I'm not a security expert with something to teach Cloudflare about err maybe don't leave secrets lying around that aren't actually needed for anything, that's not news to many people, and they surely have many actual security people for whom that would not even be a fizzbuzz interview question reviewing any kind of secret storage or revocation policy/procedure. And also the mentioned third-party audit.
> The threat actor also attempted to access a console server in our new, and not yet in production, data center in São Paulo. All attempts to gain access were unsuccessful. To ensure these systems are 100% secure, equipment in the Brazil data center was returned to the manufacturers. The manufacturers’ forensic teams examined all of our systems to ensure that no access or persistence was gained. Nothing was found, but we replaced the hardware anyway.
They didn't have to go this far. It would have been really easy not to. But they did and I think that's worthy of kudos.
Getting in at the "ground floor" of a new datacentre build is pretty much the ultimate exploit. Imagine getting in at the centre of a new Meet-Me room (https://en.wikipedia.org/wiki/Meet-me_room) and having persistent access to key switches there.
Cloudflare datacentres tend to be at the hub of insane amounts of data traffic. The fact that the attacker knew how valuable a "pre-production" data centre is means that cloudflare probably realized themselves that it would be a 100% game over if someone managed to get a foot hold there before the regular security systems are set up. It would be a company ending event if someone managed to install themselves inside a data centre while it was being built/brought up.
Also remember, at the beginning of data centre builds, all switches/equipment have default / blank root passwords (admin/admin), and all switch/equipment firmware are old and full of exploits (you either go into each one and update the firmware one by one or hook them up to automation for fleet wide patching) Imagine that this exploit is taking place before automation services had a chance to patch all the firmware ... that's a "return all devices to make sure the manufacturer ships us something new" event.
It takes some honesty and good values by someone in the decision-making to go ahead with such a comprehensive plan. This is sad because it should be tablestakes, as you say correctly, but having seen many other cases, I think although they did "the expected", it's definitely above and beyond what peers have done.
It wouldn't. Most people like to assume the impact of breaches to be what it should be, not what it actually is.
Look at the 1-year stock chart of Okta and, without looking up the actual date, tell me when the breach happened/was disclosed.
Given they got out of cloudbleed without any real damage let alone lasting damage, I disagree.
(I don't disagree with your point about how bad of a problem this would be, I'm just insisting that security failure is not taken seriously at all by anyone)
I can just imagine the attackers licking their lips when they first breached the data center.
Good reminder to use "Full (Strict)" SSL in Cloudflare. Then even if they do get compromised, your reverse-proxied traffic still won't be readable. (Of course other things you might use Cloudflare for could be vectors, though.)
This wouldn't get you much. We already assume the network is insecure. This is why TLS is a thing (and mTLS for those who are serious).
Aha, the old replace-your-trusted-hardware trick.
Now imagine, instead of Steve from HR's laptop, it's one of Cloudflare's servers.
If you assume that they only accessed what you can prove they accessed, you've left a hole for them to live in. It should require a quorum of people to say you DON'T need to do this.
Of course, this is ideal world. I'm glad my group is afforded the time to implement features with no direct monetary or user benefit.
If Cloudflare is in a position where their security team can make a call to rotate every secret and reimage every machine, and then that happens in some reasonable amount of time, that's pretty impressive.
It's good that you think you can absorb a complicated security task, it's useless if you have no way to test or verify this action.
You find some scary things when you go looking for how exactly some written-by-greybeard script is authenticating against your started-in-1990s datastore.
https://github.com/BishopFox/sliver
"Since the Smartsheet service account had administrative access to Atlassian Jira, the threat actor was able to install the Sliver Adversary Emulation Framework, which is a widely used tool and framework that red teams and attackers use to enable “C2” (command and control), connectivity gaining persistent and stealthy access to a computer on which it is installed. Sliver was installed using the ScriptRunner for Jira plugin."
https://blog.cloudflare.com/thanksgiving-2023-security-incid...
obviously a customer data breach would be worse but this is really no bueno
The customer data next year is not the same as the customer data this year.
In Atlassian's Confluence even the built-in Apache Lucene search engine can leak sensitive information and this kind of access (to the info by the attacker) can be very hard to track/identify. They don't have to open a Confluence page if the sensitive information is already shown on the search results page.
This odd to me - unused credentials should probably be deleted, not rotated.
1: "what are these accounts?"
2: "oh they're unused, they don't even appear in the logs"
1: "we should rotate them"
2: "no, let's keep those rando accounts with the old credentials, the ones we think might be compromised ... y' know, for reasons"
?
But I think they should have put honeypots on them, and then waited to see what attackers did. Honeypots discourage the attackers from continuing for fear of being discovered too.
Am I missing something here?
There’s no machine cert used? AuthN tokens aren’t cryptographically bound?
This doesn’t meet my definition of ZT, it seems more like “we don’t have a VPN”
If they had a VPN in place secured with machine certs, that would be yet another layer for an attacker to defeat.
It also highlights the need for a faster move in the entire industry away from long-lived service account credentials (access tokens) and toward federated workload identity systems like OpenId connect in the software supply chain.
These tokens too often provide elevated privileges in devops tools while bypassing MFA, and in many cases are rotated yearly. Github [1], Gitlab, and AZDO now support OIDC, so update your service connections now!
Note: I’m not familiar with this incident and don’t know whether that is precisely what happened here or if OIdC would have prevented the attack.
Devsecops and Zero Trust are often-abused buzzwords, but the principles are mature and can significantly reduce blast radius.
[1] https://docs.github.com/en/actions/deployment/security-harde...
The very nature of Jira and Confluence (both terrible products, btw) is to collect documentation. I'm assuming it was an internal Jira/Confluence for engineering teams, but still. There have got to be addresses, passwords, service account info, all kinds of info. If it was a tech support server then it's impossible to assert that you didn't lose customer data.
So we have this double standard where you pay for this product that is designed to house your deepest secrets and most cherished organizational information, that's so important to you that you run on premises servers to keep it safe, but it's not important enough to constitute a real "beach".
You're lying. Either the server contained junk of no value in which case it wouldn't have existed in the first place, or you actually did lose something of value that you won't identify to us. Nobody sets up on-prem Jira just to leave it empty and never put secrets in it.
Viewing the HTML shows it's got an empty body tag, and a single script in the <head> with a URL of https://static.cloudflareinsights.com/beacon.min.js/v84a3a40...
EDIT: re-opened the link a few minutes later and now I see the post
The thoroughness is pretty amazing
Whenever some shitty Australian telco gets owned, people are angry and call them incompetent and idiots; it's nice to see Cloudflare gets owned in style with much more class and expertise.
Like the rest of the HN crowd, this incident has only increased my trust in Cloudflare.
Okta hitting everywhere
This seems incredibly wasteful.
Replacing an entire datacenter is effectively tossing tens of millions of dollars of compute hardware.
> To ensure these systems are 100% secure, equipment in the Brazil data center was returned to the manufacturers.
It doesn't say all equipment, and that would have been very helpful. But if it's just two or three access devices sitting on the border, it's not so bad.
Also, the manufacturer likely just sold the hardware to a different customer, sounds like it was pretty new and unused anyway. Just flash the firmware and you're good.
Anyway, I really hope that the hardware isn't just tossed into the recycling, but provided to schools and other places that could put them to good use.
Why didn't they start this effort BEFORE there was an incident?
> we undertook a comprehensive effort to rotate every production credential (more than 5,000 individual credentials
Bearer credentials should already be rotated on a regular basis. Why did they wait until an incident to do this?
> To ensure these systems are 100% secure
Nothing is 100% secure. Not being to see and acknowledge that is a huge red flag.
> Nothing was found, but we replaced the hardware anyway.
Well that is just plain stupid and wasteful.
> We also looked for software packages that hadn’t been updated
Why weren't you looking for that prior to the incident?
> we were (for the second time) the victim of a compromise of Okta’s systems which resulted in a threat actor gaining access to a set of credentials.
And yet they continue using Okta. The jokes just write themselves.
> The one service token and three accounts were not rotated because mistakenly it was believed they were unused.
Wait, wait, wait. You KNEW the accounts with remote access to your systems were UNUSED and yet they continue to be active? Hahahahaha.
> The wiki searches and pages accessed suggest the threat actor was very interested in all aspects of access to our systems: password resets, remote access, configuration, our use of Salt, but they did not target customer data or customer configurations.
Totally makes sense, I'm sure the attacker was just a connoisseur of credentials and definitely did not want them to target customer data.
Since they didn’t really have reason to believe my data was accessed, maybe that’s ok. I know from firsthand experience how hard rotating all your credentials across the whole org is.
The rule does not set any specific timeline between the incident and the materiality determination, but the materiality determination should be made without 'unreasonable delay'.
Blog: The threat actor (TA) accessed Okta’s customer support system and viewed files uploaded by Cloudflare (CF) as support cases.
Why was the session token part of the support files uploaded to Okta support? Does Okta require it for troubleshooting?
TA hijacked a session token of a CF employee from a support ticket.
Blog: Using the token extracted from Okta, the TA accessed Cloudflare’s Okta and compromised two separate Cloudflare employee accounts within the Okta platform.
How did this happen? Was the stolen token privileged? Also, why only 2 employee accounts? Were these different employees, or was one the same one whose token was compromised? What does Okta employee account compromise mean - did TA reset the password and MFA, and how or was there no MFA?
Blog: TA used stolen credentials to get access to our Atlassian server and accessed some documentation and a limited amount of source code.
Did the stolen employee credentials only have access to Atlassian?
Blog: TA gained access to a set of credentials
Does this mean multiple credentials got uploaded to the Okta support system?
Blog: The Okta compromise was in October, but the threat actor only began targeting CF systems using stolen credentials from the Okta compromise in mid-November.
Does this mean the compromised token was long-lived?
Blog: We failed to rotate one service token and three service accounts (out of thousands) of credentials that were leaked during the Okta compromise.
I didn’t get this. Does this mean that over time, CF employees had uploaded support info for 1000s of apps managed by Okta? Which credentials did CF rotate initially after the Okta compromise?
Leaked Credentials: 1. Moveworks service token that granted remote access into our Atlassian system.
Is this service token a bearer token? And without expiry. Is this like an API key?
TA accessed Atlassian Jira and Confluence using the Moveworks service token to authenticate through the gateway.
2. A service account used by the SaaS-based Smartsheet application that had administrative access to our Atlassian Jira instance,
So here, the Smartsheet Saas was given access to the on-prem Atlassian Jira instance? What kind of trust is it? Is this as well managed through Okta? And how does support case filing include a service account? Here, does the Service account mean again some kind of hardcoded API key without expiry
TA used the Smartsheet service account to gain access to the Atlassian suite. They used Smartsheet credentials to create an Atlassian account that looked like a normal Cloudflare user. They added this user to a number of groups within Atlassian so that they’d have persistent access to the Atlassian environment.
Since the Smartsheet service account had administrative access to Atlassian Jira, the TA was able to install the Sliver Adversary Emulation Framework, which is a widely used tool and framework that red teams and attackers use to enable “C2” (command and control), connectivity gaining persistent and stealthy access to a computer on which it is installed. Sliver was installed using the ScriptRunner for the Jira plugin. This allowed them continuous access to the Atlassian server, and they used this to attempt lateral movement. With this access, the Threat Actor attempted to gain access to a non-production console server in our São Paulo, Brazil data center due to a non-enforced ACL. The access was denied, and they could not access any global networks.
3. A Bitbucket service account, which was used to access our source code management system
4. AWS environment that had no access to the global network and no customer or sensitive data.
Were these AWS access keys? Also, it looks like these keys did provide access to the AWS account. That means the access key didn’t require MFA.
The only production system the TA could access using the stolen credentials was our Atlassian environment.
Mitigations:
Blog: We decided a huge effort was needed to further harden our security protocols to prevent the threat actor from being able to get that foothold had we overlooked something from our log files.
What does hardening security protocol mean here? Is it techniques for D&R or something else
Blog: We undertook a comprehensive effort to rotate every production credential (more than 5,000 individual credentials)
I believe this means forced resets on employees (Okta users), right?
> Over the next day, the threat actor viewed 120 code repositories (out of a total of 11,904 repositories
> They accessed 36 Jira tickets (out of a total of 2,059,357 tickets) and 202 wiki pages (out of a total of 14,099 pages).
Is it just me or 12K git repos and 2 million JIRA tickets sound like a crazy lot. 15K wiki pages is not that high though.
> Since the Smartsheet service account had administrative access to Atlassian Jira, the threat actor was able to install the Sliver Adversary Emulation Framework, which is a widely used tool and framework that red teams and attackers use to enable “C2” (command and control), connectivity gaining persistent and stealthy access to a computer on which it is installed. Sliver was installed using the ScriptRunner for Jira plugin.
> This allowed them continuous access to the Atlassian server, and they used this to attempt lateral movement. With this access the Threat Actor attempted to gain access to a non-production console server in our São Paulo, Brazil data center due to a non-enforced ACL.
Ouch. Full access to a server OS is always scary.
It can also happen in franken-build systems which encourage decoupling by making separate repos: one repo that defines a service’s API, containing just proto (for example). A second repo that contains generated client code, a third with generated server code, a fourth for implementation, a fifth which supplies integration test harnesses, etc…
Sound insane? It is! But its also how an awful lot of stuff worked at AWS, just as an example.
2M tickets - in my 4.5y at present company we've probably averaged about 10 engineers and totalled 4.5k tickets. Cloudflare has been around longer, has many more engineers, might use it for HR, IT, etc. too, might have processes like every ticket on close opens a new one for the reporter to test, etc. It sounds the right sort of order of magnitude to me.
Source: former Cloudflare employee
They accessed 36 Jira tickets (out of a total of 2,059,357 tickets) and 202 wiki pages (out of a total of 194,100 pages)
I think my org has on the order of 3 repositories per dev? They seem to have 3200 employees, with what I assume to be a slightly higher rate of devs, so you’d expect around 6-7 thousand?
2M Jira tickets is probably easily achieved if you create tickets using any automated process.