Can someone with experience with both comment on how (the Linux port of) pledge compares to seccomp? Can it be considered a replacement at this point?
It seems like it could handle the last scenario described in the article fine, since it allows setting granular rwcx permissions on individual paths.
It also works well if the software developers document what syscalls they rely on and what permissions they need.
When it comes to retrofitting something like pledge (or seccomp) into an existing application when you've not developed it and/or can't easily tell what syscalls are being called then it's always a nightmare.
It doesn't really matter if it's pledge or seccomp at that point (although undoubtedly seccomp is far harder to make use of), if you're doing this kind of security by retroactive whitelist, you're going to have trouble making it work. It's going to take time and effort to implement.
Quite the contrary. If the software in question has been written in a remotely sane way, adding some basic pledge restrictions is a matter of adding one line: pledge("stdio rpath whatever you need", NULL) - it usually goes somewhere in main, after setup() but before while(!quit).
You can usually figure out the permission set within a few attempts, even without a very good understanding of the internals, as most (sane) programs will do only a couple of things: an httpd needs to accept connections, read static files, write logs, etc; a window manager needs to talk to X11, open font files, etc; of course there are also complex beasts like Chrome but that one has been done as well.
The *real* challenge is breaking up a complex program (e.g. a streaming music player) into separate processes that are concerned with just one or two things, e.g. separate process to make requests over the network, a separate one to decode media, another to maintain an on-disk cache, and so on. Placing restrictions on these subprocesses is the easy part; figuring out where to draw these lines is what's hard.
Commercial tools have had it for a long time.. even automatic profiling. Either explicitly profile during a test stage, which is best, or profile-on-first-observation.
In the full automatic mode, which is not optimal but is least effort, any operation performed in the first XX minutes/hours/days are considered 'allowed behavior' and anything after that is denied. Then it will either enforce or 'wait-to-enforce' where enforcement mode only turns on if there are no policy violations in the next XX configurable units of time.
Seccomp and seccomp-bpf are indeed entirely way too limiting. It wasn't really designed for end app developers who are, imo, the ones that should be dictating the policy. The whole lack of pointer deref'ing makes it really difficult for application level developers to make policies that are easier to create.
The promises arg in pledge, https://man.openbsd.org/pledge.2 , does a decent job of grouping related calls together but I think there is a ton of room to make all of this a lot better than it is today.
did you consider logging the syscalls invoked during normal usage with 'strace --output=/some/dir -f ...'? This + grep + uniq should make it really simple.
It works well with openbsd because its standard, and most if not all openbsd packages make use of it
Though maybe I'm misunderstanding how Linux pledge works. I'm only familiar with the openbsd usage of it
(Our python nsjail config for instance: https://github.com/windmill-labs/windmill/blob/main/backend/...)
Irrelevant nit: .proto files are protobuf definition files (like this one: https://github.com/google/nsjail/blob/master/config.proto), a text representation of a specific protobuf contents is typically called (as per man clang-format): .textpb .pb.txt or .textproto - I use .config for examples distributed with nsjail, but it's licentia poetica :)
https://protobuf.dev/reference/protobuf/textformat-spec/#tex...
Funny enough, we've gone the reverse path for LLM AI-generated code sandboxing for louie.ai / Graphistry . We started with container isolation with careful network, volume, compute etc enablement first, and only now adding nsjail to the runners within the container as an extra defense layer.
The negative space is interesting too. We initially explored alternatives like wasm (too slow and underpowered for our generated python GPU analytics workloads) and firecracker vm (too unwieldy and unportable for our small team). As we do more k8s and enable more interactive data viz customization + web-scale static serving, would love to revisit both.
On which note, we have a bit of budget for someone to help harden the nsjail layer, if of interest!
Liquid Metal from Weaveworks seems interesting but I don't even know where I would start.
[1] https://github.com/google/nsjail [2] https://github.com/containers/bubblewrap
It seems like something is broken, and we are all patching things up piecemeal.
I believe the next generation looks something like landlock (https://docs.kernel.org/userspace-api/landlock.html).
[1] https://www.freedesktop.org/software/systemd/man/latest/syst...