However, realistically, given the env that a container gives you, it certainly looks and feels like a security boundary. So are we just going to be stuck in this retroactive security cleanup mode forever? My point is that if it were designed from the ground up with the hard security boundary in mind, would we have ended up with containers in the first place? If not, is there any realistic way to go from where we are to where we should be?
The only other design I'm familiar with that sort of comes close are MicroVMs. Those have the downside of actually needing to run a VM though, and most (all?) cloud providers don't allow nested virtualization so you're stuck running on an enormous bare metal box.
There are AWS and GCP instance types with nested virtualization that'll let you run Firecracker. Digital Ocean apparently supports it everywhere.
The reason I mention this is because, sadly, baremetal instance types are only ever the largest size of a given family which is cost prohibitive for most users. And even if cost isn't an issue, they take much much longer to start (like 10-20+ minutes) and they actually fail to start far too frequently. It's really a shame that all instance types other than baremetal have virtualization extensions disabled, otherwise we'd be operating far more workloads in firecracker or kata. We operate huge kubernetes clusters so the cost is roughly the same whether it's fewer big instances or more smaller instances, but those startup times and reliability are terrible for autoscaling.
Please, AWS, bring nested virtualization to all nitro instance types!
Yes, because systems that are designed with these kinds of security boundaries in mind already look like containers -- they're a natural match to actual capability-based systems like, for example, plan9's.
The problem here stems entirely from trying to keep these globally-overriding capabilities like CAP_SYS_ADMIN and CAP_DAC_OVERRIDE while also allowing users to create their own namespaces. All these CVEs weren't things as long as only root could create new userns', and now that normal users can all these areas where things weren't checked are coming out of the woodwork.
But a ground up capability-based system avoids this kind of problem by simply making it impossible to elevate to a privilege level like 'root' on POSIX systems, and so namespacing within those systems is incredibly natural to the point that it didn't really get a name (containers) until one was needed for linux' cognitive dissonance around the idea.
Here is what capabilities are.
https://en.wikipedia.org/wiki/Capability-based_security
I don't think what you're advocating for makes a ton of sense tbh. You're basically saying "just make it impossible to privesc", which, yeah, that would be nice... but it's not like you can just do that.
I think your point is more that least privilege should be more common - that way exploits have less impact. I agree. That said, Linux Capabilities are extremely coarse, and most container escapes involve owning the Kernel, which from a real Capabilities model would be the trusted broker of capabilities to begin with.
Containers rely on the Linux kernel. The Linux kernel is shit, in terms of security, for a number of reasons. So all one requires is to own the kernel, and there are a lot of ways to do that. Containers block some system calls and can lower attack surface to a degree, which is great - I think it's a huge win that containers are so popular and, finally, some degree of isolation is widespread.
We'll be stuck in retroactive security mode until developers care to change that, especially ones with influence like kernel maintainers.
> My point is that if it were designed from the ground up with the hard security boundary in mind, would we have ended up with containers in the first place?
Absolutely not. We'd have ended up with something like Firecracker or GVisor. The issues with containers are fundamental to the concept of having a shared Linux kernel, which is basically what makes a container a container.
> If not, is there any realistic way to go from where we are to where we should be?
Use Firecracker or GVisor.
> Those have the downside of actually needing to run a VM though
I think at this point VMs are not that big of a deal. It's clearly good enough for the vast majority of people who are running on the cloud.
> don't allow nested virtualization so you're stuck running on an enormous bare metal box.
This part is a bummer.
The other option though is to just not care if your OS gets owned. Split your services up, move capabilities across other boundaries like mTLS.
You are absolutely deluded, if not stupid, if you think that a worldwide collection of software engineers who can't write operating systems or applications without security holes, can then turn around and suddenly write virtualization layers without security holes
It might have looked like FreeBSD jails or Illumos / Solaris Zones. Both of which are containers designed as a security boundary from the start.
> However, realistically, given the env that a container gives you, it certainly looks and feels like a security boundary.
It has to be secure. Browsers are using pretty much the same technologies (seccomp-bpf, cgroups, namespaces, etc) to tightly sandbox Javascript from websites. Browsers run wildly untrusted code from all over the web, and are expected to pass through many forms of malware, not letting them escape the sandbox.
If containers can't be made secure, we have bigger problems.
> So are we just going to be stuck in this retroactive security cleanup mode forever? My point is that if it were designed from the ground up with the hard security boundary in mind, would we have ended up with containers in the first place?
No! Linux and Unix APIs are a mess of patchworks. They are pretty much insecure by default, with rare exceptions.
We could make a new platform with a saner API and make it run on top of Linux, and write new backend services targeting it. I think WASI may just be that. The only problem is that wasm have some overhead / doesn't have access to all CPU features.
That said, even VM are best-effort security boundaries, then apparmor/selinux type restrictions put in place on the host should be the main hard security boundary IMO.
I disagree with that idea. The actual that may be as limited in capabilities as a standard bug. Let's say you have a problem with your webapp where you can read an arbitrary file, but nothing else. Containers are a perfect protection in this case if you want to isolate the app from any other services running on the host (monitoring, provisioning, etc.).
There's no perfection and defence in depth is what we need to use everywhere. Unless you can break through all layers at the same time, imperfect layers are a valid improvement. See how many default protections you have to turn off to even make this bug viable.
https://cloud.google.com/compute/docs/instances/nested-virtu...
Yes. The only difference is the Linux based systems and tools as opposed to Zones or Jails were the first pivot to a developer focused view rather than that of the sysadmin. This utility is why containers gained critical mass, not because the security focused foundations of other implementations was an impediment.
My opinion: want security? Separate (bare metal) machines. Period.
i.e. root inside a container is root on the host; the container itself doesn't help that. But other security features, that are applied to the processes within the container when the container is created, might.
"Fortunately, the default security hardenings in most container environments are enough to prevent container escape. Containers running with AppArmor or SELinux are protected. "
So, all that hard work on SELinux continues to pay off.
I really wanted an audit mode that could also say "this command will unlock the specific thing I just blocked".
That was a few years ago. Since then, I've turned off selinux whenever I'm getting screwed by some opaque process, stuff starts working, and closing it back down while leaving what I need open remains impossible black magic.
SELinux remains inscrutable and unusuable to the lay person. Microsoft had the same problem with Windows XP and especially after its service pack 2 when the Windows Firewall was introduced, that it was difficult to debug and applications didn't prompt to open ports or have an API to do so. So many a lay person posted on forums "disable firewall".
Users don't care why their tools don't work, they don't understand why or how to fix it. Technically complex SELinux audit tutorials are not helpful. There needs to be real, genuine attention to user experience an almost tutorial like CLI command. Something so simple anyone could safely make a program run. Whether that program is safe itself is another question, and users should be told that too.
See how easy that was?
The article doesn't do much better on that front, but it is in there at least.
sources: https://bugzilla.redhat.com/show_bug.cgi?id=2051505 https://lwn.net/Articles/883949/
- Stable: 5.16.6
- LTS (for Alpine Linux): 5.15.20
Alpine 3.15 main is currently at 5.15.16 and thus vulnerable.