- the notification was a week ago to a small mailing list, which is tucked away on their site
- no notification to the registry to when you go to download salt (at least I never received an email, but still get plenty of marketing spam)
- no posts on social media as far as I can tell, I couldn't find a tweet, anything on reddit, or anything on hn.
- they only blogged about it on their official site yesterday, way after damage had been done
- one week's notice between the initial announcement and the patch coming out. The patch being released is basically a disclosure of the vulnerability
- the patch was released late Thursday early Friday depending on your timezone, giving attackers the weekend head start
- the official salt docker images were only patched yesterday
- You can't get a patch for older versions without filling out a form and supplying details
- Ubuntu and other repositories are still vulnerable
Not trying to downplay the critical nature of the vulnerability but the ones that were compromised by this issue have deeper security issues to deal with.
You seem to prescribe to the "hard shell soft gooey center" network security philosophy. Should people expose an Oracle server to the internet? Absolutely not. Does moving it behind a firewall change the fact that every mildly skilled exploit developer is sitting on an Oracle 0day? Absolutely not.
People have legitimate reasons for exposing Salt to the internet. I do. It's how I bootstrap random VMs and bare metal from the internet. But in my case the attack was mitigated by the fact that Salt cascades changes in a bunch of other systems and re-masters minions to a host only reachable over a tunnel. I blew away the internet master, restored from a backup, and patched.
> the ones that were compromised by this issue have deeper security issues to deal with
Or it was just another Monday. When you become sufficiently large you deal with incidents on a daily basis. Kudos to the people who publicly postmortem and talk about what went well and what didn't.
(For the record, I've already been working for a few months on a move to Ansible for non-security reasons)
It's far too easy to make something internet-visible. They could have set up a simple check to see if the service is internet, and refused to work if it was.
> Use a hardened bastion server or a VPN to restrict direct access to the Salt master from the internet
Is this SSH access or is this access to the salt master from minions? Or just access in general?
While your other points may be valid, one week should be plenty of time between announcement and patch. Any longer and i would call the time table problematic.
One week is nothing compared to what it would take to upgrade your configuration management system.
isn't really salt's problem though.. same could be said for relying on any distro-provided package
A number of people have carefully reviewed the payload that was deployed to servers, especially during what we're calling v1-v4 of the attack. (v5 onwards got more complex, but that wasn't until Monday (with variability for timezone).
> Nobody has any idea what was run on the servers ...
Well that's not true - there's a number of victims that have useful IDS tools, including auditd, plus the review of binaries and shell scripts deployed, etc.
Some of us also have netflow collection at the edge, and can review connections initiated from within our networks.
> ... once the initial attack script was deployed it downloaded and executed new scripts every 60s and then removed themselves.
I don't think any of us have found scripts that removed themselves. While that may sound naive, there's a few researchers that have been analysing these tools, including via large honeypot networks, and this just hasn't (at least for the first 2-3 days) been a profile of the attack.
Thankfully - and I appreciate it's very weird to say this - the initial attacks were very much vanilla crypto currency mining opportunities. It could have been a lot worse, and algolia's assessment matches a lot of other independent assessments on this front.
You said the v5 of the attack got more sophisticated. How do we know there wasn't a "v0" that was even more sophisticated and innocuous? You can't trust the server logs. Firewall tables were flushed, SELinux was disabled. It's just really hard to say the full extent of damages.
I'll try to give you some insight as I'm a security engineer at Algolia.
Your concern is valid, and it's true, we cannot know for sure. That's the reason why, as explained in the blog post, we are reinstalling all impacted servers and rotating our secrets. If our assumption is false, this should contain the issue.
That being said, we have good reasons to make that assumption.
- Our analysis of the incident and how the malware behaved on our systems didn't find any evidence towards access and transfer of data.
- There are other public analysis of the malware. Other companies hit have the same analysis than us, and you can have a look at https://saltexploit.com/ which is maintaining an interesting list of what is known on the attack, how it behaved, and how it's evolving fast to adapt.
I hope this answers your concern.
(The coin mining could be a cover like you mention, but it seems unlikely since it naturally draws attention.)
So this means they had Salt master ports publicly accessible? Why would anyone have salt ports open/exposed to public/internet?
If you're bootstrapping random servers, this is a fine approach.
The whole Salt connection methodology is 'trust on first connect' (a bit like the default SSH) with a manual stage in accepting an incoming request and the connection stream is encrypted.
If you're using salt to bootstrap your VPN servers or network appliances then it's understandable that you'd have it exposed to a more public network, and the documentation was clear that this was fine.
Not everything is a virtual machine on a cloud provider.
In light of this attack, maybe going forward have a setup script that creates an SSH tunnel back to a machine that can talk to the salt-master for you. You could then have VPN, but if it's flakey at all, it could cost the ability to update machines.
Or perhaps (and I say this as a saltstack user) ansible really is the more secure model for those scenarios.
Define "random". I think there is an alternative method not involving exposing you CM server on the Internet for almost any definition of random. In the Algolia case it's pretty sure because they now filter the access by IP (so they KNOW the IPs)
Even with zero-trust network or beyondcorp idea, I still found one extra layer of protection a VPC give are so great. Few years ago, it has an issue with K8S API Server, and updating k8s isn't a walk in the park. I felt relax back then because we have everything inside VPC.
You can use SSH or VPN to access service inside VPC. But any of tools that had permission to manage your infrastructure should never expose to the internet.
Same thing with Jenkins, if you are using Jenkins to manage Terraform or trigger Ansible/Salt/Chef run, make sure Jenkins is not reachable from internet. Using different method to route webhook into it.
Imo this is THE lesson to learn from this story.
Seondary: salt and ansible are not very mature yet.
What issues do you have with Ansible?
For Jenkins it's a bit more complicated because GitHub webhooks although they do publish their IPs in a programmatic form so you can whitelist them.
1. Configured webhook override in Jenkins. So Jenkins will register sth like https://ci-webhook.domain.com to github webhook.
2. This ci-webhook is a simple webapp that validate webhook and if it's valid(sign by correct key), write the payload to SQS queue
3. A small daemon, run on same Jenkins master, that pulls SQS queue, and replay it to local jenkins
I used to rely on Github IP whitelist but one day i realized anyone can hit my Jenkins use Github.
It creates a very high value target that is difficult to secure.
I prefer a model where the management commands are signed at a management workstation and those commands are pushed by the server and authenticated at the managed node against a security policy.
I’d consider open sourcing something based on them if there’s sufficient interest.
Perhaps as an integration for one of the major players.
(Disclaimer: long-time operator and fledgling programmer)
IMHO the two main advantages in favor of Algolia, are the sane defaults for relevancy and speed and the fact that the service is hosted and can grow with your business without having dedicated engineers to manage both the configuration and the infrastructure.
Also, on top of the Algolia services per se (search, analytics, recommendation, etc.), we're providing a lot of backend and frontend libraries which one would otherwise need to reimplement when using an elastic- or Solr-based implementations.
You don't have to look that far to find problems with that:
https://github.com/saltstack/salt/commit/5dd304276ba5745ec21...
Also think about how many years this vuln has been present and exposed. Who's to know blackhats haven't sat on this 0day for years, quietly compromosing private keys and other data? Spooky.
These choices all impact the reliability and security of the resulting system, especially the following:
* do they rely on SSH, or they have implemented their own authentication / authorization techniques? (personally I would be very reluctant to trust anything that just listens on a network port for deployment commands, and it's not SSH;)
* do the agents run with full `root` privileges, or is there a builtin mechanism that allows the agent to act only in a limited capacity, within the confines of a set of whitelisted actions? (perhaps even requiring a secondary authentication mechanism for certain "sensitive" actions, for example something integrated with `sudo`, that provides a sort of 2-factor-authentication with a human in the loop;)
* do the operators have enough "visibility" into what is happening during the deployments? (more specifically, are the deployment scripts easily auditable or are they a spaghetti of dependencies? are the concrete actions to be taken clearly described, or are they hidden in the source code of the tool?)
* are there builtin mechanisms to "verify" the results of the deployments?
* and building upon the previous item, are there mechanisms to continuously "verify" if the deployment hasn't changed behind the scenes?
I understand that some of these features wouldn't have helped directly to prevent this particular case, however it would have helped in alerting and diagnosis.