Next thing you know someone will run as many as two whole "processes" in a container!
Having dispensed with that bit of bitter sarcasm; solving their local filesystem performance/security problems is great and all, but what I'd like to see for containers is to utilize an already invented wheel of remote block devices; ah la iSCSI and friends. I dream of getting there with Cloud Hypervisor or some such where every container has a kernel that can network transparently mount whatever it has the credentials to mount from whatever 'worker' node it happens to be running on.
This is my excited face. :|
[1] 2005: https://lwn.net/Articles/131747/
There are other advantages — low fixed resource costs, global memory management and scheduling, no resource stranding, etc. — but the core intent of gVisor is to capture as many valuable semantics as possible (including the file system semantics) while adding a sufficiently hard security boundary.
I’m not saying moving the file system up into the sandbox is bad (which is basically what a block device gives you), just that there are complex trade-offs. The gVisor root file system overlay is essentially that (the block device is a single sparse memfd, with metadata kept in memory) but applied only to the parts of the file system that are modified.
If your containers use many of the same base layers (e.g. the same Node or Python image), the code pages will be shared, as they would be shared with plain OS processes.
Running several processes in a container is the norm. First, you run with --init anyway, so there is a `tini` parent process inside. Then, Node workers and Java threads are pretty common.
Running several pieces of unrelated software in a container is less common, that's true.
Containers are a way to isolate processes better, and to package dependencies. You could otherwise be doing that with tools like selinux and dpkg, and by setting LD_nnn env variables. Containers just make it much easier.
I'm highly aware. The reason the word "process" is quoted in my highly down-voteable comment is the misuse of the term "process" by Docker et al. to mean "application." Google the "one process per container" mantra to see what I mean. Somehow the Docker crowd were oblivious to the 60+ year old concept of and terminology related to OS processes when they promulgated their guidance on how containers should be used.
I try not to indulge too many hang-ups in life, but that particular bit of damage is insufferable.
Whereas a simple 'we run everything in a VM' seems much simpler and less fragile.
'We run this process in a VM-like mode where Linux syscalls aren't allowed but instead we define a new syscall-like interface which goes to privileged host code' seems like a good compromise. But in this case, that host code should have special abilities to mmap files into the address space of the 'VM' to make IO fast and efficient.
One way to do this would be to use undefined instruction traps to enter a debugger, which could then implement a syscall-like API. That would make it portable to any OS, yet ultra fast.
To me this resolves a very narrow use case where you have to run untrusted containers on trusted hosts. This is a very narrow use case. I imagine main target users for this are people that want to offer a service like fargate and run multiple customers on a single host. Why would they want to do that instead of separating customers with VMs? My suspicion is this has something to do with the increasing availability of very energy efficient arm servers that have hundreds of cores per socket. My impression is traditional virtualisation on arm is rarely used (I'm not sure why as kvm supports it, arm since armv8.1 has hw support for it). So "containers to the rescue".
Personally I'd much rather extra security to enable untrusted containers access to the hosts fs is implemented in the container runtime, not as a separate component. Or if the "security issues" it addresses perhaps even in the hosts operating system?
Isn’t that exactly what the original gofer/RPC solution is? The gvisor container runtime operates in userland to ensure that compromises in the runtime don’t result in an immediate compromise of the system kernel.
But running in userland and intercepting syscalls that do IO always has significant performance implications, because all your IO now needs multiple copy operations to get into the reading process address space, because userland process generally can’t directly interact with each other address space (to ensure process isolation), without asking the kernel to come in to do all the heavy lifting.
So if you want fast local IO, you have find a way to allowing the untrusted processes in the container to make direct syscalls, so that you can avoid all the additional high latency hops in userland, and let the kernel directly copy data into the target processes address space.
To magically allow the container runtime to provide direct host fs access itself, with native level performance, that would require the runtime to be operating as part of the kernel. Which is exactly how normal containers work, comes with a whole load of security risks, and is ultimately the reason gvisor exists.
gVisor, if not using hw-backed virtualization, has absolutely horrendous performance because of, amongst other things, ptrace, which is one reason why this blogpost exists.
In particular firecracker runs on bare metal or VMs that support nested virtualization, which unfortunately is not widely available in the clouds (and bare metal is expensive)
AWS Fargate (containers as a service for ECS/EKS) uses Firecracker under the hood, and you can easily have the container up for weeks, and probably even for months.
Similarly, Fly.io also uses Firecracker, and again, you can have weeks/months long uptime on containers.
The reason to have this in a separate process is so it can be audited "to death" because the code base is small.
gvisor itself is so big that doing an exhaustive audit is out of the question. Google has mostly switched to fuzzing because the code bases have all become too bloated to audit them properly.
The reason you have gvisor is to contain something you consider dangerous. If that contained code managed to break out and take over gvisor, it is still contained in the kernel level namespaces and still cannot open files unless the broker process agrees. That process better be as small as possible then, so we can trust it to not be compromisable from gvisor.
EDIT: Hmm looks like they aren't removing the broker process, just "reducing round-trips". Never mind then. That reduces the security cost to you not being able to take write access away at run time to a file that was already opened for writing.
Process isolation is not the only tool that you have to build a secure architecture. In this case, capabilities are still being limited by available FDs in the first process (as well as seccomp and the noting namespacing and file system controls), and access to FDs is still mediated by the second process. There is no such thing as “being able to take access away … to a file that was already opened” as this is simple not part of the threat model or security model being provided. You still need to be diligent about these security mechanisms as well.
The idea that Google has given up and just does fuzzing is nonsense. Fuzzing is a great tool, and has become more common and standardized — that’s all. It is being added to the full suite of tools.
The old model howevwr was that read and write were translated to rpc calls to the broker. In that model you can take write access away even after you have given it to a process, because you have not actually given it. All writes still go through the broker process.
I think, in the context of security, this is like asking if it's worse to die by a car or die by a bus.
If you make reads less secure writes, then you'd be weakening the Confidentiality aspect.
The rest of the identity theft and pillaging your accounts would require no security weaknesses, just things working correctly in presence of legitimate credentials.
Linux allows one process to send an opened file descriptor to another process over a domain socket with the SCM_RIGHTS message [1]. The DirectFS setup is basically letting the Gofer process to open a file on the host machine and ships the file descriptor to the sandbox process. The sandbox can then read and write directly on the local file system using the file descriptor.
How the heck can this be securely isolated? Well, via the magic of the pivot_root and umount Linux commands. First, Gofer only sends file descriptors of the files permitted to be accessed by the sandbox, like the files under /sandbox/foobar/. Second, the Gofer process does a pivot_root to change its own file system root "/" to "/sandbox/foobar/." It then does an umount on its old "/" to make it completely unaccessible to any opened file descriptors. This prevents someone using the opened file descriptor to change directory to ../.., ../../etc/passwd or to somewhere in the old root's directories.
I believe this is how it works, based on the reading of the blog post.
"We recently landed support for directfs feature in runsc. This is a filesystem optimization feature. It enables the sandbox to access the container filesystem directly (without having to go through the gofer). This should improve performance for filesystem heavy workloads.
You can enable this feature by adding `--directfs` flag to the runtime configuration. The runtime configuration is in `/etc/docker/daemon.json` if you are using Docker. This feature is also supported properly on k8s.
We are looking for early adopters of this feature. You can file bugs or send feedback using this link. We look forward to hearing from you!
NOTE: This is completely orthogonal to the "Root Filesystem Overlay Feature" introduced earlier. You can stack these optimizations together for max performance."
[1] https://groups.google.com/g/gvisor-users/c/v-ODHzCrIjE/m/pqI...
> Directfs is a new filesystem access mode that uses these primitives to expose the container filesystem to the sandbox in a secure manner.
So, it's likely this is not a filesystem, but just an implementation detail.
IMO it doesn’t make much sense to call things that run on the CPU “direct.” Direct access to resources is the assumption if you are running on the CPU, right?