Server-side sandboxing: Containers and seccomp (opens in new tab)

(figma.com)

226 pointsemilsjolander2y ago44 comments

44 comments

38 comments · 12 top-level

imiric2y ago· 8 in thread

I haven't used seccomp, but have recently been playing around with the Linux pledge port[1]. It has a very friendly UI, but I still struggled with allowing some complex apps to run at all, because of the sheer amount of syscalls and devices they required. Digging through a mountain of strace output is tedious...

Can someone with experience with both comment on how (the Linux port of) pledge compares to seccomp? Can it be considered a replacement at this point?

It seems like it could handle the last scenario described in the article fine, since it allows setting granular rwcx permissions on individual paths.

[1]: https://justine.lol/pledge/

Arch-TK2y ago

Pledge works well if the software developers implement it on their own application.

It also works well if the software developers document what syscalls they rely on and what permissions they need.

When it comes to retrofitting something like pledge (or seccomp) into an existing application when you've not developed it and/or can't easily tell what syscalls are being called then it's always a nightmare.

It doesn't really matter if it's pledge or seccomp at that point (although undoubtedly seccomp is far harder to make use of), if you're doing this kind of security by retroactive whitelist, you're going to have trouble making it work. It's going to take time and effort to implement.

rollcat2y ago

> When it comes to retrofitting something like pledge (or seccomp) into an existing application when you've not developed it and/or can't easily tell what syscalls are being called then it's always a nightmare.

Quite the contrary. If the software in question has been written in a remotely sane way, adding some basic pledge restrictions is a matter of adding one line: pledge("stdio rpath whatever you need", NULL) - it usually goes somewhere in main, after setup() but before while(!quit).

You can usually figure out the permission set within a few attempts, even without a very good understanding of the internals, as most (sane) programs will do only a couple of things: an httpd needs to accept connections, read static files, write logs, etc; a window manager needs to talk to X11, open font files, etc; of course there are also complex beasts like Chrome but that one has been done as well.

The *real* challenge is breaking up a complex program (e.g. a streaming music player) into separate processes that are concerned with just one or two things, e.g. separate process to make requests over the network, a separate one to decode media, another to maintain an on-disk cache, and so on. Placing restrictions on these subprocesses is the easy part; figuring out where to draw these lines is what's hard.

https://man.openbsd.org/pledge.2

1 more reply

heroprotagonist2y ago

There's no tracing tool to build policy with pledge? Seems like an obvious area to add functionality if it doesn't exist.

Commercial tools have had it for a long time.. even automatic profiling. Either explicitly profile during a test stage, which is best, or profile-on-first-observation.

In the full automatic mode, which is not optimal but is least effort, any operation performed in the first XX minutes/hours/days are considered 'allowed behavior' and anything after that is denied. Then it will either enforce or 'wait-to-enforce' where enforcement mode only turns on if there are no policy violations in the next XX configurable units of time.

1 more reply

saagarjha2y ago

And it’s not just your application, too: it’s what you depend on. Surely your command line application that works with a handful of files is fine with read/write? Your libc might be using something else like iovec or press/pwrite under the hood.

eyberg2y ago

We too ended up adding pledge and unveil to Nanos.

Seccomp and seccomp-bpf are indeed entirely way too limiting. It wasn't really designed for end app developers who are, imo, the ones that should be dictating the policy. The whole lack of pointer deref'ing makes it really difficult for application level developers to make policies that are easier to create.

The promises arg in pledge, https://man.openbsd.org/pledge.2 , does a decent job of grouping related calls together but I think there is a ton of room to make all of this a lot better than it is today.

ShowalkKama2y ago

>Digging through a mountain of strace output is tedious

did you consider logging the syscalls invoked during normal usage with 'strace --output=/some/dir -f ...'? This + grep + uniq should make it really simple.

tedunangst2y ago

Despite the availability of linux pledge, and frequent comments mentioning its existence, I'm not aware of many people using it.

jmprspret2y ago

I think it would acquire more usage if it was part of mainline Linux distros. As far as I can tell people must feel like this is some kind of optional, nonstandard thing.

It works well with openbsd because its standard, and most if not all openbsd packages make use of it

Though maybe I'm misunderstanding how Linux pledge works. I'm only familiar with the openbsd usage of it

rubenfiszel2y ago· 6 in thread

If you are looking to self-host a scalable backend that runs arbitrary code in python/typescript/bash/go with optional sandboxing using nsjail like figma, nsjail is what we use as isolation layer at https://windmill.dev (Open-source alternative to Retool/Airplane)

(Our python nsjail config for instance: https://github.com/windmill-labs/windmill/blob/main/backend/...)

jagrsw2y ago

nsjail author here (the original one, as the tool is also maintained by others), good job!

Irrelevant nit: .proto files are protobuf definition files (like this one: https://github.com/google/nsjail/blob/master/config.proto), a text representation of a specific protobuf contents is typically called (as per man clang-format): .textpb .pb.txt or .textproto - I use .config for examples distributed with nsjail, but it's licentia poetica :)

rubenfiszel2y ago

The wonders of HN strikes again. Thank you for this amazing piece of technology that is nsjail. Nsjail is very core to our security, our multitenant would be so slow without it and I think we're one of the applications that leverage it in a way that showcase nsjail to its full extent (as in, we beat containers/firecracker cold starts by a fair margin while keeping most of their benefits). That's one of the reason we're order of magniture more efficient than Airplane that uses fargate under the hood. I would love to chat if you had time, my email in my profile.

xyzzy_plugh2y ago

Oddly enough the canonical extension seems to now be none of those but .txtpb:

https://protobuf.dev/reference/protobuf/textformat-spec/#tex...

shooshx2y ago

Why not just use json?

1 more reply

freeney2y ago

Running arbitrary user code inside a jail that doesn’t isolate networking might not be enough isolation. Also kernel mount namespace binds into the jailed env increases the attack surface. Great for some use-cases, but multi-tenant workloads might need a tighter setup? I'm definitely going to give Windmill a try. It looks really cool!

remram2y ago

Wow, this nsjail setup is now part of your opensource version? Last I tried Windmill there was no isolation mechanism for scripts on the free version.

lmeyerov2y ago· 3 in thread

Good intro. I'd be curious how they do the syscall tracing, eg, strace logs as part of CI?

Funny enough, we've gone the reverse path for LLM AI-generated code sandboxing for louie.ai / Graphistry . We started with container isolation with careful network, volume, compute etc enablement first, and only now adding nsjail to the runners within the container as an extra defense layer.

The negative space is interesting too. We initially explored alternatives like wasm (too slow and underpowered for our generated python GPU analytics workloads) and firecracker vm (too unwieldy and unportable for our small team). As we do more k8s and enable more interactive data viz customization + web-scale static serving, would love to revisit both.

On which note, we have a bit of budget for someone to help harden the nsjail layer, if of interest!

hhh2y ago

I have yet to find a firecracker-style thing for k8s that is simple to deploy. Firekube seemed interesting, but is archived...

Liquid Metal from Weaveworks seems interesting but I don't even know where I would start.

flurie2y ago

Virtink[1] has been reasonably stable as long as you're okay with Cloud Hypervisor instead.

[1] https://github.com/smartxworks/virtink

remram2y ago

Kata just released a new version, it is the only thing that I've found easy to setup with k8s... though my experience running Docker-in-Kata hasn't been very good.

1 more reply

bdahz2y ago· 2 in thread

So what's the difference between nsjail[1] and bubblewrap[2]?

[1] https://github.com/google/nsjail [2] https://github.com/containers/bubblewrap

xyzzy_plugh2y ago

bubblewrap aims to be reasonably secure by default but leaves sleeping soundly at night as an exercise to the reader. It's not exhaustive. It's more of a blast radius/convenience tool. Conversely nsjail aspires to facilitate sleeping soundly out of the box, with security as the primary motivating factor.

ximm2y ago

I don't have extensive experience with nsjail, but from reading the docs it seems to me like nsjail covers namespaces, cgroups and virtual networking, while bwrap only covers namespaces. On the other hand, bwrap is deliberately kept simple because it is SUID.

IcyWindows2y ago· 2 in thread

Are operating systems failing at their jobs if one can't run independent workloads on them anymore?

It seems like something is broken, and we are all patching things up piecemeal.

saagarjha2y ago

It’s actually not that hard to make OSes that run completely independent workloads. The problem is that this is not useful.

dboreham2y ago

Yes and yes.

sargun2y ago· 2 in thread

Seccomp BPF is great. There was some recent issues due to IO_uring and extensible syscalls, but I believe for now, those issues are avoidable.

I believe the next generation looks something like landlock (https://docs.kernel.org/userspace-api/landlock.html).

johnkoepi2y ago

I love ideas behind Landlock but I don't fully see the struggle currently without taking into considerations issues with io_uring api. Seccomp nowadays with AppArmor|SElinux is enough even for Nested rootless containers. Nested even into std runc things. Both AppArmor and Seccomp profiles are stackable. If you don't need to generate unique profiles per each container you should be fine...

johnkoepi2y ago

zxcvgm2y ago· 1 in thread

It's pretty easy to apply seccomp to a process using systemd by adding SystemCallFilter= in its unit file. There's a reasonable set of permitted syscalls for general system processes, aptly called `@system-service`, but you can tweak that to suit your needs [1]. I generally use this, among other settings, to further lock down system services [2].

[1] https://www.freedesktop.org/software/systemd/man/latest/syst...

[2] https://www.redhat.com/sysadmin/mastering-systemd

CAP_NET_ADMIN2y ago

Yep, can recommend systemd in this case, really easy to apply basic hardening to services that just works.

nosefrog2y ago· 1 in thread

We had seccomp containers at Dropbox, and I remember Max Serrano helping me set that up with ReactServer :) Talented engineer, though I do remember that the jails were kind of a maintenance nightmare for the security team.

johnkoepi2y ago

"seccomp containers" sounds weird... like what is a container in Linux anyway :D

declan_roberts2y ago· 1 in thread

seccomp is heaps better than selinux, but still too overly complicated to be using in everyday production unless you're truly on the "refine and secure" path or dealing with high-stakes sandboxing.

johnkoepi2y ago

way too different things, everyone using seccomp when they don't have AppArmor only profile. sometimes even do both.

minitoar2y ago

Nice, love seccomp though as with all things security it can be very fiddly.

SeriousM2y ago

Is there an isolation method close to the functionality to nsjail but for .net code? I know I can protect my AppDomain but how to protect the system/network from rouge .net code?

baggy_trough2y ago

Check out systemd-nspawn. Built in and works great!

j / k navigate · click thread line to collapse

44 comments

38 comments · 12 top-level

imiric2y ago· 8 in thread

Can someone with experience with both comment on how (the Linux port of) pledge compares to seccomp? Can it be considered a replacement at this point?

It seems like it could handle the last scenario described in the article fine, since it allows setting granular rwcx permissions on individual paths.

[1]: https://justine.lol/pledge/

Arch-TK2y ago

Pledge works well if the software developers implement it on their own application.

It also works well if the software developers document what syscalls they rely on and what permissions they need.

rollcat2y ago

https://man.openbsd.org/pledge.2

1 more reply

heroprotagonist2y ago

There's no tracing tool to build policy with pledge? Seems like an obvious area to add functionality if it doesn't exist.

Commercial tools have had it for a long time.. even automatic profiling. Either explicitly profile during a test stage, which is best, or profile-on-first-observation.

1 more reply

saagarjha2y ago

eyberg2y ago

We too ended up adding pledge and unveil to Nanos.

ShowalkKama2y ago

>Digging through a mountain of strace output is tedious

did you consider logging the syscalls invoked during normal usage with 'strace --output=/some/dir -f ...'? This + grep + uniq should make it really simple.

tedunangst2y ago

Despite the availability of linux pledge, and frequent comments mentioning its existence, I'm not aware of many people using it.

jmprspret2y ago

I think it would acquire more usage if it was part of mainline Linux distros. As far as I can tell people must feel like this is some kind of optional, nonstandard thing.

It works well with openbsd because its standard, and most if not all openbsd packages make use of it

Though maybe I'm misunderstanding how Linux pledge works. I'm only familiar with the openbsd usage of it

rubenfiszel2y ago· 6 in thread

(Our python nsjail config for instance: https://github.com/windmill-labs/windmill/blob/main/backend/...)

jagrsw2y ago

nsjail author here (the original one, as the tool is also maintained by others), good job!

rubenfiszel2y ago

xyzzy_plugh2y ago

Oddly enough the canonical extension seems to now be none of those but .txtpb:

https://protobuf.dev/reference/protobuf/textformat-spec/#tex...

shooshx2y ago

Why not just use json?

1 more reply

freeney2y ago

remram2y ago

Wow, this nsjail setup is now part of your opensource version? Last I tried Windmill there was no isolation mechanism for scripts on the free version.

lmeyerov2y ago· 3 in thread

Good intro. I'd be curious how they do the syscall tracing, eg, strace logs as part of CI?

On which note, we have a bit of budget for someone to help harden the nsjail layer, if of interest!

hhh2y ago

I have yet to find a firecracker-style thing for k8s that is simple to deploy. Firekube seemed interesting, but is archived...

Liquid Metal from Weaveworks seems interesting but I don't even know where I would start.

flurie2y ago

Virtink[1] has been reasonably stable as long as you're okay with Cloud Hypervisor instead.

[1] https://github.com/smartxworks/virtink

remram2y ago

Kata just released a new version, it is the only thing that I've found easy to setup with k8s... though my experience running Docker-in-Kata hasn't been very good.

1 more reply

bdahz2y ago· 2 in thread

So what's the difference between nsjail[1] and bubblewrap[2]?

[1] https://github.com/google/nsjail [2] https://github.com/containers/bubblewrap

xyzzy_plugh2y ago

ximm2y ago

IcyWindows2y ago· 2 in thread

Are operating systems failing at their jobs if one can't run independent workloads on them anymore?

It seems like something is broken, and we are all patching things up piecemeal.

saagarjha2y ago

It’s actually not that hard to make OSes that run completely independent workloads. The problem is that this is not useful.

dboreham2y ago

Yes and yes.

sargun2y ago· 2 in thread

Seccomp BPF is great. There was some recent issues due to IO_uring and extensible syscalls, but I believe for now, those issues are avoidable.

I believe the next generation looks something like landlock (https://docs.kernel.org/userspace-api/landlock.html).

johnkoepi2y ago

zxcvgm2y ago· 1 in thread

[1] https://www.freedesktop.org/software/systemd/man/latest/syst...

[2] https://www.redhat.com/sysadmin/mastering-systemd

CAP_NET_ADMIN2y ago

Yep, can recommend systemd in this case, really easy to apply basic hardening to services that just works.

nosefrog2y ago· 1 in thread

johnkoepi2y ago

"seccomp containers" sounds weird... like what is a container in Linux anyway :D

declan_roberts2y ago· 1 in thread

seccomp is heaps better than selinux, but still too overly complicated to be using in everyday production unless you're truly on the "refine and secure" path or dealing with high-stakes sandboxing.

johnkoepi2y ago

way too different things, everyone using seccomp when they don't have AppArmor only profile. sometimes even do both.

minitoar2y ago

Nice, love seccomp though as with all things security it can be very fiddly.

SeriousM2y ago

Is there an isolation method close to the functionality to nsjail but for .net code? I know I can protect my AppDomain but how to protect the system/network from rouge .net code?

baggy_trough2y ago

Check out systemd-nspawn. Built in and works great!

j / k navigate · click thread line to collapse