Skip to content

Top Best Ask Show New Jobs

New Linux vulnerability affecting cgroups: can containers escape? (opens in new tab)

(unit42.paloaltonetworks.com)

123 pointszelivans4y ago82 comments

82 comments

49 comments · 9 top-level

xxpor4y ago· 25 in thread

Back in the day, people insisted that containers were not security boundaries and should not be treated as such. They're meant to contain things from going off the rails unintentionally, but an actual threat was another story.

However, realistically, given the env that a container gives you, it certainly looks and feels like a security boundary. So are we just going to be stuck in this retroactive security cleanup mode forever? My point is that if it were designed from the ground up with the hard security boundary in mind, would we have ended up with containers in the first place? If not, is there any realistic way to go from where we are to where we should be?

The only other design I'm familiar with that sort of comes close are MicroVMs. Those have the downside of actually needing to run a VM though, and most (all?) cloud providers don't allow nested virtualization so you're stuck running on an enormous bare metal box.

I don't think the industry is moving towards deepening dependence on container/jail interfaces for multitenant workloads --- virtualization has gotten incredibly cheap. So these issues are mostly problems for internal data center segregation and blast radius reduction. It's not nothing, they're important security problems, but unless you're doing something dubious, they shouldn't be existentially important.

There are AWS and GCP instance types with nested virtualization that'll let you run Firecracker. Digital Ocean apparently supports it everywhere.

paulfurtado4y ago

Slightly pedantic: ec2 doesn't actually support nested virtualization on any instance type I know of, but does have baremetal instance types that support virtualization.

The reason I mention this is because, sadly, baremetal instance types are only ever the largest size of a given family which is cost prohibitive for most users. And even if cost isn't an issue, they take much much longer to start (like 10-20+ minutes) and they actually fail to start far too frequently. It's really a shame that all instance types other than baremetal have virtualization extensions disabled, otherwise we'd be operating far more workloads in firecracker or kata. We operate huge kubernetes clusters so the cost is roughly the same whether it's fewer big instances or more smaller instances, but those startup times and reliability are terrible for autoscaling.

Please, AWS, bring nested virtualization to all nitro instance types!

stormbrew4y ago

> My point is that if it were designed from the ground up with the hard security boundary in mind, would we have ended up with containers in the first place? If not, is there any realistic way to go from where we are to where we should be?

Yes, because systems that are designed with these kinds of security boundaries in mind already look like containers -- they're a natural match to actual capability-based systems like, for example, plan9's.

The problem here stems entirely from trying to keep these globally-overriding capabilities like CAP_SYS_ADMIN and CAP_DAC_OVERRIDE while also allowing users to create their own namespaces. All these CVEs weren't things as long as only root could create new userns', and now that normal users can all these areas where things weren't checked are coming out of the woodwork.

But a ground up capability-based system avoids this kind of problem by simply making it impossible to elevate to a privilege level like 'root' on POSIX systems, and so namespacing within those systems is incredibly natural to the point that it didn't really get a name (containers) until one was needed for linux' cognitive dissonance around the idea.

staticassertion4y ago

You're confusing capabilities systems. Linux capabilities are not "capabilities", they're a misnomer. They're just groupings of privileges.

Here is what capabilities are.

https://en.wikipedia.org/wiki/Capability-based_security

I don't think what you're advocating for makes a ton of sense tbh. You're basically saying "just make it impossible to privesc", which, yeah, that would be nice... but it's not like you can just do that.

I think your point is more that least privilege should be more common - that way exploits have less impact. I agree. That said, Linux Capabilities are extremely coarse, and most container escapes involve owning the Kernel, which from a real Capabilities model would be the trusted broker of capabilities to begin with.

What do you think about Fuchsia ? It's fully capability-based: https://fuchsia.dev/fuchsia-src/concepts/components/v2/capab...

I'm not sure a year has gone by without a vulnerability that breaks shared-kernel isolation in reasonable configurations. Nobody was going to DAC or MAC out `waitid`, but `waitid` for a time take a kernel address for its siginfo_t parameter.

staticassertion4y ago

It's not a binary thing. I would say something is a boundary if it requires an additional vulnerability to bypass. Containers these days fit that model. The nuance is how strong of a boundary it is.

Containers rely on the Linux kernel. The Linux kernel is shit, in terms of security, for a number of reasons. So all one requires is to own the kernel, and there are a lot of ways to do that. Containers block some system calls and can lower attack surface to a degree, which is great - I think it's a huge win that containers are so popular and, finally, some degree of isolation is widespread.

We'll be stuck in retroactive security mode until developers care to change that, especially ones with influence like kernel maintainers.

> My point is that if it were designed from the ground up with the hard security boundary in mind, would we have ended up with containers in the first place?

Absolutely not. We'd have ended up with something like Firecracker or GVisor. The issues with containers are fundamental to the concept of having a shared Linux kernel, which is basically what makes a container a container.

> If not, is there any realistic way to go from where we are to where we should be?

Use Firecracker or GVisor.

> Those have the downside of actually needing to run a VM though

I think at this point VMs are not that big of a deal. It's clearly good enough for the vast majority of people who are running on the cloud.

> don't allow nested virtualization so you're stuck running on an enormous bare metal box.

This part is a bummer.

The other option though is to just not care if your OS gets owned. Split your services up, move capabilities across other boundaries like mTLS.

gvisor doesn't require nested virtualization, right? If you're willing to take a tenable user-mode-Linux performance hit, you should be able to run it on anything?

throwawayboise4y ago

I'll quote Theo deRaadt here, he was talking about virtualization but I would guess the same could be said of containers:

You are absolutely deluded, if not stupid, if you think that a worldwide collection of software engineers who can't write operating systems or applications without security holes, can then turn around and suddenly write virtualization layers without security holes

richardfey4y ago

Who was he referring to?

> My point is that if it were designed from the ground up with the hard security boundary in mind, would we have ended up with containers in the first place?

It might have looked like FreeBSD jails or Illumos / Solaris Zones. Both of which are containers designed as a security boundary from the start.

I'm here to push back on the fabled security powers of ground-up security-focused shared-kernel isolation. People love to bring up Zones and Jails in these conversations, presumably since both are much more coherent designs than Linux namespaces, MAC, BPF and cgroups, which are now comparably (if not more) featureful, but shambolic and hard to reason about. But none of these systems are sufficient for multitenant isolation. It would not be OK to rely on Zones for a major multitenant compute workload.

pjmlp4y ago

Or HP-UX vaults grown out of Tru64.

nextaccountic4y ago

> Back in the day, people insisted that containers were not security boundaries and should not be treated as such. They're meant to contain things from going off the rails unintentionally, but an actual threat was another story.

> However, realistically, given the env that a container gives you, it certainly looks and feels like a security boundary.

It has to be secure. Browsers are using pretty much the same technologies (seccomp-bpf, cgroups, namespaces, etc) to tightly sandbox Javascript from websites. Browsers run wildly untrusted code from all over the web, and are expected to pass through many forms of malware, not letting them escape the sandbox.

If containers can't be made secure, we have bigger problems.

> So are we just going to be stuck in this retroactive security cleanup mode forever? My point is that if it were designed from the ground up with the hard security boundary in mind, would we have ended up with containers in the first place?

No! Linux and Unix APIs are a mess of patchworks. They are pretty much insecure by default, with rare exceptions.

We could make a new platform with a saner API and make it run on top of Linux, and write new backend services targeting it. I think WASI may just be that. The only problem is that wasm have some overhead / doesn't have access to all CPU features.

_8j504y ago

I think Unikernel VMs are the future. Build your app into One blob with no user/kernel space boundary that runs in a guest VM. No boot time or wasted memory/latency (context switch) issues.

That said, even VM are best-effort security boundaries, then apparmor/selinux type restrictions put in place on the host should be the main hard security boundary IMO.

Good luck debugging that.

> They're meant to contain things from going off the rails unintentionally, but an actual threat was another story.

I disagree with that idea. The actual that may be as limited in capabilities as a standard bug. Let's say you have a problem with your webapp where you can read an arbitrary file, but nothing else. Containers are a perfect protection in this case if you want to isolate the app from any other services running on the host (monitoring, provisioning, etc.).

There's no perfection and defence in depth is what we need to use everywhere. Unless you can break through all layers at the same time, imperfect layers are a valid improvement. See how many default protections you have to turn off to even make this bug viable.

GCP allows nested virtualization:

https://cloud.google.com/compute/docs/instances/nested-virtu...

At least some of the Azure series support nested virtualization. See https://docs.microsoft.com/en-us/azure/virtual-machines/dv4-.... There are a lot of series and I don’t know the breakdown but I would expect dsv4 to be one of the more widely used options because it is for generic CPU workloads.

> My point is that if it were designed from the ground up with the hard security boundary in mind, would we have ended up with containers in the first place?

Yes. The only difference is the Linux based systems and tools as opposed to Zones or Jails were the first pivot to a developer focused view rather than that of the sysadmin. This utility is why containers gained critical mass, not because the security focused foundations of other implementations was an impediment.

pjmlp4y ago

When developer productivity come before sysadims that is when security goes south, as history has proven on desktop systems.

With Spectre we discovered that not even VMs are adequate security boundaries.

My opinion: want security? Separate (bare metal) machines. Period.

userbinator4y ago

...and in the spirit of the parent comment, Intel didn't intend for protected mode to be a security boundary either. The 286 and 386 programming manuals referred to the protections as a form of reducing the severity of bugs.

2OEH8eoCRo04y ago

How are they not a security boundary? Nearly everything is a security boundary using defense in depth no?

yrro4y ago

Security boundaries in Linux are UIDs/GIDs, capabilities, SELinux domains, and others. These can be applied to processes regardless of whether the process runs in a container.

i.e. root inside a container is root on the host; the container itself doesn't help that. But other security features, that are applied to the processes within the container when the container is created, might.

frabbit4y ago· 7 in thread

Important note on this:

"Fortunately, the default security hardenings in most container environments are enough to prevent container escape. Containers running with AppArmor or SELinux are protected. "

So, all that hard work on SELinux continues to pay off.

concerned_user4y ago

Sadly, many answers to questions related to selinux issues, or howto's start with: Disable selinux.

Things may have changed, but the last few times I looked, it was breathtakingly hard to a) identify if /when selinux is what's screwing you, then b) get selinux to stop it.

I really wanted an audit mode that could also say "this command will unlock the specific thing I just blocked".

That was a few years ago. Since then, I've turned off selinux whenever I'm getting screwed by some opaque process, stuff starts working, and closing it back down while leaving what I need open remains impossible black magic.

AaronFriel4y ago

I strongly believe that software that works for users is better than software that doesn't, and it's clear that for most lay folks, SELinux is software that doesn't work.

SELinux remains inscrutable and unusuable to the lay person. Microsoft had the same problem with Windows XP and especially after its service pack 2 when the Windows Firewall was introduced, that it was difficult to debug and applications didn't prompt to open ports or have an API to do so. So many a lay person posted on forums "disable firewall".

Users don't care why their tools don't work, they don't understand why or how to fix it. Technically complex SELinux audit tutorials are not helpful. There needs to be real, genuine attention to user experience an almost tutorial like CLI command. Something so simple anyone could safely make a program run. Whether that program is safe itself is another question, and users should be told that too.

I'd argue the vast majority of Linux desktop users (already a small group) don't use SELinux. So naturally when trying to help someone using something we don't have experience with and don't find necessary, that advice becomes more prevalent.

yrro4y ago

Which is an excellent indicator that the following advice is bad.

2OEH8eoCRo04y ago

Why is the Linux community full of horrible advice?

you are also safe if you are not running (EDIT: inside) the container as root, which is a common security practice for containers nowadays.

jeffbee4y ago· 3 in thread

This style of writing sucks, and the abuse of the meaningless term "container" does nothing to clear it up. To reduce this CVE to one sentence: a process running in the top level control group, which has the ability to create user namespace, can take over the machine, because the kernel fails to check for CAP_SYS_ADMIN.

See how easy that was?

stormbrew4y ago

You kind of missed the key to the whole thing here, though, which is that users are able to create userns' now by default. This is really important to understanding this and the last few container escape CVEs.

The article doesn't do much better on that front, but it is in there at least.

jackpirate4y ago

Isn't the whole purpose of this style of writing to define terms like "top level control group" and "CAP_SYS_ADMIN" for those people who don't already understand what they mean?

The article doesn't do that. It throws around jargon without defining it, or defining it vaguely or inaccurately.

encryptluks24y ago· 2 in thread

Podman and other container tools are now using user namespaces by default. I think it is clear there are some extra precautions needed, but ultimately the goal with running rootless containers is to improve security.

Podman also works fine rootless and with cgroups2, double win.

MonaroVXR4y ago

Does it support docker-compose?

leephillips4y ago· 2 in thread

“Container” seems to be used throughout to mean “Docker container”. There are other types of containers.

AaronFriel4y ago

I think container escape is well understood by most to mean (for Linux) cgroups and/or the stack most folks use (containerd, Docker). It's a generic term but useful term, like VM escape, even though there are many kinds of virtual machine managers and hypervisors.

stormbrew4y ago

the other reply alluded to this, but to make it explicit: nothing about this CVE requires docker and it looks like you should be able to do it with a few syscalls in any process starting with a call to unshare(), unless something else (like selinux) is getting in your way.

suifbwish4y ago· 1 in thread

Couldn’t you prevent against this sort of thing by using disposable VMs to host the containers? Sure it would be an extra layer of resources but it would double the complexity of the attack required to breach the physical node.

yjftsjthsd-h4y ago

Correct on both counts; you can, and it hurts performance / resource use. There's also intermediate options like gvisor. In practice, the performance issues mean that most people don't bother.

norenh4y ago

Wow, all that long article and no mention of the versions affected. Fortunately it is mentioned in the redhat bugzilla for that CVE and there it states that it is fixed in stable kernel v5.16.6. I assume it is also fixed in stable stable kernels released at the same time: 5.15.20, 5.10.97, and 5.4.177

sources: https://bugzilla.redhat.com/show_bug.cgi?id=2051505 https://lwn.net/Articles/883949/

throwaway9843934y ago

I consider cgroups an administrative layer, not a security layer. They're for keeping apps from accidentally blowing up the host, not to prevent them hacking it. If you want security with containers, use Firecracker.

fmajid4y ago

The article is maddening in its generic “update to the latest kernel” advice and not listing the specific kernel versions fixed. They are:

- Stable: 5.16.6

- LTS (for Alpine Linux): 5.15.20

Alpine 3.15 main is currently at 5.15.16 and thus vulnerable.

j / k navigate · click thread line to collapse