undefined | Better HN

0 pointskro1d ago0 comments

CAP_NET/SYS_ADMIN is required for this. So this would be "not as bad" as the others.

0 comments

kam1d ago

Also "The page pool is only created on a real ZCRX-capable NIC (mlx5 ConnectX-6+, Intel E800, NFP)"

t0mas881d ago

It could work for container escape?

kroOP1d ago

Containers, even with root user, are often stripped of these capabilities unless --privileged

nyrikki15h ago

It is a minimal improvement due to the introduction of user namespaces and the fallout from local team convenience for Docker and thus OCI.

It is very important that you realize that any capability is a slice of superuser privileges, and there are no implicit protections, only explicit additional constraints that restrict it in reference to root.

Look at the bounding set for a normal user on a fresh install of rhel/debian based systems:

     $ grep ^Cap /proc/$$/status
     CapInh: 0000000000000000
     CapPrm: 0000000000000000
     CapEff: 0000000000000000
     CapBnd: 000001ffffffffff

Note how trivial it is to gain all of those capabilities:

    $ podman unshare
    # grep ^Cap /proc/$$/status
    CapInh: 0000000000000000
    CapPrm: 000001ffffffffff
    CapEff: 000001ffffffffff
    CapBnd: 000001ffffffffff
    CapAmb: 0000000000000000
    # capsh --decode=000001ffffffffff
    0x000001ffffffffff=cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_linux_immutable,cap_net_bind_service,cap_net_broadcast,cap_net_admin,cap_net_raw,cap_ipc_lock,cap_ipc_owner,cap_sys_module,cap_sys_rawio,cap_sys_chroot,cap_sys_ptrace,cap_sys_pacct,cap_sys_admin,cap_sys_boot,cap_sys_nice,cap_sys_resource,cap_sys_time,cap_sys_tty_config,cap_mknod,cap_lease,cap_audit_write,cap_audit_control,cap_setfcap,cap_mac_override,cap_mac_admin,cap_syslog,cap_wake_alarm,cap_block_suspend,cap_audit_read,cap_perfmon,cap_bpf,cap_checkpoint_restore

The capabilities(7)[0] man page will help you with all of those.

But capabilities are just a thread local segmentation, which grants superuser or root rights in a vertical segmented fashion.

True, if a mechanism chooses to do additional tests based on credentials(7)[1], you can run with those elevated privileges in a lower bound, but that requires implicit coding.

Add in that LSMs are suffering from both resources and upstream teams that won't provide guidance or are challenging to work with, and there are literally a hundred commands to either abuse or just ld_preload to get unrestricted userns, allowing you to get around basic controls on clone()/unshare() that may be implemented.

      $ grep -ir "userns," /etc/apparmor.d/ | wc -l
      100

With apparmor every single browser (firefox,chrome,msedge,etc...) as well as busybox, slack, steam, visual studio, ... all have the unrestricted user namespaces and the ability to gain the FULL set of capabilities in the bounding set.

If you run `busybox` on a debian system, note how it has nsenter and unshare, so you can't mask those and yet busybox itself is unconstrained with elevated privlages.

The TL;DR point being, don't assume that any capability() is in itself a gate, as there are so many ways even for the user nobody to gain them.

[0] https://man7.org/linux/man-pages/man7/capabilities.7.html [1] https://man7.org/linux/man-pages/man7/credentials.7.html

cyphar2h ago

1. The privilege check in question here is capable(CAP_NET_ADMIN), so it doesn't work in user namespaces.

2. Most sandboxes (including Docker and Podman) disable creating unprivileged user namespaces inside them via seccomp. In this mode, you end up with a more secure setup than requiring a privileged process to spawn containers (for one, it massively reduces the risk of confused deputy attacks against container runtimes). You can also restrict it with ucounts (as rough of a system as that is).

3. The kernel provides this facility and the feature was added back in early 2013 (before Docker existed and long before they added user namespace support, let alone rooless containers), so I don't understand why you think this is somehow the fault of OCI? We're just making something useful out of existing kernel infrastructure. Folks have asked the kernel to provide a knob to disable unprivileged user namespaces but the maintainer has refused to do so for years (the best you get is ucounts and seccomp). I would also prefer to have such a knob (or even adding a separate ucount with configurable per-user limits) but it's not up to me.

(Disclaimer: I implemented rootless containers for runc back in the day and work on OCI, so I do have some bias here.)

j / k navigate · click thread line to collapse