Faster filesystem access with Directfs (opens in new tab)

(gvisor.dev)

169 pointsjhalstead2y ago56 comments

56 comments

39 comments · 9 top-level

ec1096852y ago· 9 in thread

I still don’t know why Google has gvisor and AWS has firecracker. Isn’t the firecracker approach strictly better than Google’s approach?

dilyevsky2y ago

Firecracker is hardware-based virtualization. gVisor is not virtualization at all but more like advanced sandboxing - it intercepts syscalls and proxies them on processeses behalf. That means gVisor is slower on i/o (which this new feature is trying to solve) but it also means it’s easier to implement and operate and you can run it in more environments (for examples in VMs where nested virtualization is not supported).

ec1096852y ago

What are the reasons these days to not enable nested virtualization? I know AWS doesn’t.

1 more reply

eyberg2y ago

If you want to join us in the peanut gallery, AWS originally "adapted" Google's crosvm for firecracker.

gVisor, if not using hw-backed virtualization, has absolutely horrendous performance because of, amongst other things, ptrace, which is one reason why this blogpost exists.

znkr2y ago

Note that ptrace is only one platform and it’s no longer even the default. It’s been replaced by systrap. When running on bare metal, the KVM platform provides the best performance: https://gvisor.dev/docs/architecture_guide/platforms/

intelVISA2y ago

Outside of the peanut gallery we just roll our own VMM, VMX and friends is well established at this point why settle for a hacky impl.?

1 more reply

ithkuil2y ago

Firecracker may be better but it's irrelevant if I cannot use it in my environment.

In particular firecracker runs on bare metal or VMs that support nested virtualization, which unfortunately is not widely available in the clouds (and bare metal is expensive)

Patrickmi2y ago

Firescracker is good and all but if one wants to use it, one has to change its ecosystem, it’s communication with other servers, why change your entire ecosystem for one tool or just build a tool to fit your ecosystem, and really like the concept of firecracker-containerd but still need some modifications and also I wouldn’t expect Google to put their entire Cloud Run, App engine under the hands of aws (even tho it’s FOSS)

Thaxll2y ago

Firecracker does not work with long running process. It's only good for function as a service / serverless stuff.

watermelon02y ago

What is your definition of long running processes?

AWS Fargate (containers as a service for ECS/EKS) uses Firecracker under the hood, and you can easily have the container up for weeks, and probably even for months.

Similarly, Fly.io also uses Firecracker, and again, you can have weeks/months long uptime on containers.

topspin2y ago· 6 in thread

Accessing local file systems from a container? What heresy is this? Containers must all be stateless webscale single-"process" microservices with no need of local file systems and other obsolescent concepts.

Next thing you know someone will run as many as two whole "processes" in a container!

Having dispensed with that bit of bitter sarcasm; solving their local filesystem performance/security problems is great and all, but what I'd like to see for containers is to utilize an already invented wheel of remote block devices; ah la iSCSI and friends. I dream of getting there with Cloud Hypervisor or some such where every container has a kernel that can network transparently mount whatever it has the credentials to mount from whatever 'worker' node it happens to be running on.

dilyevsky2y ago

In k8s that already exists via CSI[0] but kubelet is handling the setup/teardown signaling and it requires 3rd party provisioner daemon so higher level than container runtime (runsc in this case).

[0] - https://kubernetes-csi.github.io/docs/

topspin2y ago

Yes. I know. K8s has delivered the moral equivalent of what we've had built-in to our OS kernels[1] since before some of the people reading this were born, and they've only had to add two layers of complexity, fragility and inscrutability on top of k8s itself, one of which is a third party dependency.

This is my excited face. :|

[1] 2005: https://lwn.net/Articles/131747/

1 more reply

amscanne2y ago

This would mean that every container has its own buffer cache, you can no longer have intentional shared state (K8s secrets, shared volumes, etc.), and must construct block overlays instead of cheap file overlays. You’re definitely losing some of the advantages a container brings.

There are other advantages — low fixed resource costs, global memory management and scheduling, no resource stranding, etc. — but the core intent of gVisor is to capture as many valuable semantics as possible (including the file system semantics) while adding a sufficiently hard security boundary.

I’m not saying moving the file system up into the sandbox is bad (which is basically what a block device gives you), just that there are complex trade-offs. The gVisor root file system overlay is essentially that (the block device is a single sparse memfd, with metadata kept in memory) but applied only to the parts of the file system that are modified.

nine_k2y ago

A container, being basically a chroot, consumes a rather small amount of resources, mostly as space in namespace and ipfilter tables.

If your containers use many of the same base layers (e.g. the same Node or Python image), the code pages will be shared, as they would be shared with plain OS processes.

Running several processes in a container is the norm. First, you run with --init anyway, so there is a `tini` parent process inside. Then, Node workers and Java threads are pretty common.

Running several pieces of unrelated software in a container is less common, that's true.

Containers are a way to isolate processes better, and to package dependencies. You could otherwise be doing that with tools like selinux and dpkg, and by setting LD_nnn env variables. Containers just make it much easier.

topspin2y ago

> Running several processes in a container is the norm.

I'm highly aware. The reason the word "process" is quoted in my highly down-voteable comment is the misuse of the term "process" by Docker et al. to mean "application." Google the "one process per container" mantra to see what I mean. Somehow the Docker crowd were oblivious to the 60+ year old concept of and terminology related to OS processes when they promulgated their guidance on how containers should be used.

I try not to indulge too many hang-ups in life, but that particular bit of damage is insufferable.

Roark662y ago

I quite like containers to limit/reserve the ram/cpu use for certain processes. For example imagine a tiny service used by Few concurrent users that needs a SQL db, a app server, and a reverse proxy (For ssl/caching) in front. I'm quite happy to put stuff like this on a tiny VM with 1vCPU and 1GB RAM. Mnthly cost ~$5 for compute. I typically reserve/limit 64MB/128MB for nginx, 384mb/512mb for mariadb, and 256mb/384mb for the app server (PHP etc).Also I have cpu share reservation/limits too. Of course it requires tuning the configs, but it runs great(verified with load testing and actual use). If you put the same software on the same host with no reservations/limits there are situations where latencies grow a lot or the whole thing freezes because one component consumes too much resources. If anyone knows any non-container lie overhead ways to partition a single vcpu and a gig of ram like this I'd be interested to hear about it.

3 more replies

Patrickmi2y ago· 5 in thread

Am new to these kernel space but isn’t writes operation more security at risk than Reads if it is why not break gofer into 2 categories one writes, one reads embed the one with reads with sentry user space, this may not show any significant performance in real world use but it gets both benefits

nomel2y ago

> writes operation more security at risk than reads

I think, in the context of security, this is like asking if it's worse to die by a car or die by a bus.

Patrickmi2y ago

Lol at least one is recoverable

1 more reply

Bilal_io2y ago

When you think of security you gotta think of Confidentiality, Integrity and Availability.

If you make reads less secure writes, then you'd be weakening the Confidentiality aspect.

nine_k2y ago

One would only need to read your password via some unsecured hole, once.

The rest of the identity theft and pillaging your accounts would require no security weaknesses, just things working correctly in presence of legitimate credentials.

dilyevsky2y ago

The risk here is that there's a bug in kernel that can enable dos / local code execution by the caller. Also like others pointed out - reads can be equally harmful if you read ssh private keys and whatnot.

nextaccountic2y ago· 4 in thread

What is directfs? The linked webpage doesn't say

ww5202y ago

The gVisor sandbox doesn't provide direct access to the local file system of the host machine. It routes file requests over RPC to the outside Gofer server running on the host machine. The Gofer server reads the files on the host machine and ships the data back to the sandbox over RPC. This setup is understandably slow.

Linux allows one process to send an opened file descriptor to another process over a domain socket with the SCM_RIGHTS message [1]. The DirectFS setup is basically letting the Gofer process to open a file on the host machine and ships the file descriptor to the sandbox process. The sandbox can then read and write directly on the local file system using the file descriptor.

How the heck can this be securely isolated? Well, via the magic of the pivot_root and umount Linux commands. First, Gofer only sends file descriptors of the files permitted to be accessed by the sandbox, like the files under /sandbox/foobar/. Second, the Gofer process does a pivot_root to change its own file system root "/" to "/sandbox/foobar/." It then does an umount on its old "/" to make it completely unaccessible to any opened file descriptors. This prevents someone using the opened file descriptor to change directory to ../.., ../../etc/passwd or to somewhere in the old root's directories.

I believe this is how it works, based on the reading of the blog post.

[1] https://man7.org/linux/man-pages/man7/unix.7.html

JaimeThompson2y ago

I found this [1]

"We recently landed support for directfs feature in runsc. This is a filesystem optimization feature. It enables the sandbox to access the container filesystem directly (without having to go through the gofer). This should improve performance for filesystem heavy workloads.

You can enable this feature by adding `--directfs` flag to the runtime configuration. The runtime configuration is in `/etc/docker/daemon.json` if you are using Docker. This feature is also supported properly on k8s.

We are looking for early adopters of this feature. You can file bugs or send feedback using this link. We look forward to hearing from you!

NOTE: This is completely orthogonal to the "Root Filesystem Overlay Feature" introduced earlier. You can stack these optimizations together for max performance."

[1] https://groups.google.com/g/gvisor-users/c/v-ODHzCrIjE/m/pqI...

esjeon2y ago

I think it's a gVisor-specific concept. The page says:

> Directfs is a new filesystem access mode that uses these primitives to expose the container filesystem to the sandbox in a secure manner.

So, it's likely this is not a filesystem, but just an implementation detail.

dilyevsky2y ago

Yes, it's a gVisor feature. They basically utilize SCM_RIGHTS[0] Linux api to open files from the gofer process outside of sandbox and then pass opened fds into the sandbox.

[0] - https://blog.cloudflare.com/know-your-scm_rights/

fefe232y ago· 2 in thread

This is a step back.

The reason to have this in a separate process is so it can be audited "to death" because the code base is small.

gvisor itself is so big that doing an exhaustive audit is out of the question. Google has mostly switched to fuzzing because the code bases have all become too bloated to audit them properly.

The reason you have gvisor is to contain something you consider dangerous. If that contained code managed to break out and take over gvisor, it is still contained in the kernel level namespaces and still cannot open files unless the broker process agrees. That process better be as small as possible then, so we can trust it to not be compromisable from gvisor.

EDIT: Hmm looks like they aren't removing the broker process, just "reducing round-trips". Never mind then. That reduces the security cost to you not being able to take write access away at run time to a file that was already opened for writing.

amscanne2y ago

The reason you can focus auditing on the second process is because you have a security architecture that enables that. Of course the security mechanisms you’re relying on there need to be exercised and occasionally fall apart too (meltdown, MDS, etc.).

Process isolation is not the only tool that you have to build a secure architecture. In this case, capabilities are still being limited by available FDs in the first process (as well as seccomp and the noting namespacing and file system controls), and access to FDs is still mediated by the second process. There is no such thing as “being able to take access away … to a file that was already opened” as this is simple not part of the threat model or security model being provided. You still need to be diligent about these security mechanisms as well.

The idea that Google has given up and just does fuzzing is nonsense. Fuzzing is a great tool, and has become more common and standardized — that’s all. It is being added to the full suite of tools.

fefe232y ago

As I understand it, the new model is that the process gets an opened fd passed by the broker and can then read and write to it as fd permissions allow.

The old model howevwr was that read and write were translated to rpc calls to the broker. In that model you can take write access away even after you have given it to a process, because you have not actually given it. All writes still go through the broker process.

1 more reply

Dalewyn2y ago· 2 in thread

Not to be confused with DirectStorage, which is a DirectX API that lets the video card load textures from NVME SSD local storage more efficiently.

bee_rider2y ago

I was expecting something about GPUs as well.

IMO it doesn’t make much sense to call things that run on the CPU “direct.” Direct access to resources is the assumption if you are running on the CPU, right?

ninepoints2y ago

"Direct" here is more analogous to the Direct as in DirectX and Direct3D.

1 more reply

Roark662y ago· 1 in thread

This article is not very good at explaining what is it they are actually describing. Is directfs just a way to access hosts local fs? If so than my understanding of it is that they used to use rpc to access local fs before (horrible overhead) to sandbox it. Now they've just replaced a part of the operating system filesystem API that resolves paths to file descriptors with their tool so once a file descriptor is obtained the container can talk directly to the fs.

To me this resolves a very narrow use case where you have to run untrusted containers on trusted hosts. This is a very narrow use case. I imagine main target users for this are people that want to offer a service like fargate and run multiple customers on a single host. Why would they want to do that instead of separating customers with VMs? My suspicion is this has something to do with the increasing availability of very energy efficient arm servers that have hundreds of cores per socket. My impression is traditional virtualisation on arm is rarely used (I'm not sure why as kvm supports it, arm since armv8.1 has hw support for it). So "containers to the rescue".

Personally I'd much rather extra security to enable untrusted containers access to the hosts fs is implemented in the container runtime, not as a separate component. Or if the "security issues" it addresses perhaps even in the hosts operating system?

avianlyric2y ago

> Personally I'd much rather extra security to enable untrusted containers access to the hosts fs is implemented in the container runtime, not as a separate component. Or if the "security issues" it addresses perhaps even in the hosts operating system?

Isn’t that exactly what the original gofer/RPC solution is? The gvisor container runtime operates in userland to ensure that compromises in the runtime don’t result in an immediate compromise of the system kernel.

But running in userland and intercepting syscalls that do IO always has significant performance implications, because all your IO now needs multiple copy operations to get into the reading process address space, because userland process generally can’t directly interact with each other address space (to ensure process isolation), without asking the kernel to come in to do all the heavy lifting.

So if you want fast local IO, you have find a way to allowing the untrusted processes in the container to make direct syscalls, so that you can avoid all the additional high latency hops in userland, and let the kernel directly copy data into the target processes address space.

To magically allow the container runtime to provide direct host fs access itself, with native level performance, that would require the runtime to be operating as part of the kernel. Which is exactly how normal containers work, comes with a whole load of security risks, and is ultimately the reason gvisor exists.

7e2y ago· 1 in thread

When will gVisor be able to run processes in a Secure Enclave?

intelVISA2y ago

I thought gVisor was DOA. I guess this post confirms it.

londons_explore2y ago

These designs always seem so complex... And one overlooked feature of any API could totally break the sandbox.

Whereas a simple 'we run everything in a VM' seems much simpler and less fragile.

'We run this process in a VM-like mode where Linux syscalls aren't allowed but instead we define a new syscall-like interface which goes to privileged host code' seems like a good compromise. But in this case, that host code should have special abilities to mmap files into the address space of the 'VM' to make IO fast and efficient.

One way to do this would be to use undefined instruction traps to enter a debugger, which could then implement a syscall-like API. That would make it portable to any OS, yet ultra fast.

j / k navigate · click thread line to collapse

56 comments

39 comments · 9 top-level

ec1096852y ago· 9 in thread

I still don’t know why Google has gvisor and AWS has firecracker. Isn’t the firecracker approach strictly better than Google’s approach?

dilyevsky2y ago

ec1096852y ago

What are the reasons these days to not enable nested virtualization? I know AWS doesn’t.

1 more reply

eyberg2y ago

If you want to join us in the peanut gallery, AWS originally "adapted" Google's crosvm for firecracker.

gVisor, if not using hw-backed virtualization, has absolutely horrendous performance because of, amongst other things, ptrace, which is one reason why this blogpost exists.

znkr2y ago

intelVISA2y ago

Outside of the peanut gallery we just roll our own VMM, VMX and friends is well established at this point why settle for a hacky impl.?

1 more reply

ithkuil2y ago

Firecracker may be better but it's irrelevant if I cannot use it in my environment.

In particular firecracker runs on bare metal or VMs that support nested virtualization, which unfortunately is not widely available in the clouds (and bare metal is expensive)

Patrickmi2y ago

Thaxll2y ago

Firecracker does not work with long running process. It's only good for function as a service / serverless stuff.

watermelon02y ago

What is your definition of long running processes?

AWS Fargate (containers as a service for ECS/EKS) uses Firecracker under the hood, and you can easily have the container up for weeks, and probably even for months.

Similarly, Fly.io also uses Firecracker, and again, you can have weeks/months long uptime on containers.

topspin2y ago· 6 in thread

Next thing you know someone will run as many as two whole "processes" in a container!

dilyevsky2y ago

In k8s that already exists via CSI[0] but kubelet is handling the setup/teardown signaling and it requires 3rd party provisioner daemon so higher level than container runtime (runsc in this case).

[0] - https://kubernetes-csi.github.io/docs/

topspin2y ago

This is my excited face. :|

[1] 2005: https://lwn.net/Articles/131747/

1 more reply

amscanne2y ago

nine_k2y ago

A container, being basically a chroot, consumes a rather small amount of resources, mostly as space in namespace and ipfilter tables.

If your containers use many of the same base layers (e.g. the same Node or Python image), the code pages will be shared, as they would be shared with plain OS processes.

Running several processes in a container is the norm. First, you run with --init anyway, so there is a `tini` parent process inside. Then, Node workers and Java threads are pretty common.

Running several pieces of unrelated software in a container is less common, that's true.

topspin2y ago

> Running several processes in a container is the norm.

I try not to indulge too many hang-ups in life, but that particular bit of damage is insufferable.

Roark662y ago

3 more replies

Patrickmi2y ago· 5 in thread

nomel2y ago

> writes operation more security at risk than reads

I think, in the context of security, this is like asking if it's worse to die by a car or die by a bus.

Patrickmi2y ago

Lol at least one is recoverable

1 more reply

Bilal_io2y ago

When you think of security you gotta think of Confidentiality, Integrity and Availability.

If you make reads less secure writes, then you'd be weakening the Confidentiality aspect.

nine_k2y ago

One would only need to read your password via some unsecured hole, once.

The rest of the identity theft and pillaging your accounts would require no security weaknesses, just things working correctly in presence of legitimate credentials.

dilyevsky2y ago

nextaccountic2y ago· 4 in thread

What is directfs? The linked webpage doesn't say

ww5202y ago

I believe this is how it works, based on the reading of the blog post.

[1] https://man7.org/linux/man-pages/man7/unix.7.html

JaimeThompson2y ago

I found this [1]

We are looking for early adopters of this feature. You can file bugs or send feedback using this link. We look forward to hearing from you!

NOTE: This is completely orthogonal to the "Root Filesystem Overlay Feature" introduced earlier. You can stack these optimizations together for max performance."

[1] https://groups.google.com/g/gvisor-users/c/v-ODHzCrIjE/m/pqI...

esjeon2y ago

I think it's a gVisor-specific concept. The page says:

> Directfs is a new filesystem access mode that uses these primitives to expose the container filesystem to the sandbox in a secure manner.

So, it's likely this is not a filesystem, but just an implementation detail.

dilyevsky2y ago

Yes, it's a gVisor feature. They basically utilize SCM_RIGHTS[0] Linux api to open files from the gofer process outside of sandbox and then pass opened fds into the sandbox.

[0] - https://blog.cloudflare.com/know-your-scm_rights/

fefe232y ago· 2 in thread

This is a step back.

The reason to have this in a separate process is so it can be audited "to death" because the code base is small.

gvisor itself is so big that doing an exhaustive audit is out of the question. Google has mostly switched to fuzzing because the code bases have all become too bloated to audit them properly.

amscanne2y ago

fefe232y ago

As I understand it, the new model is that the process gets an opened fd passed by the broker and can then read and write to it as fd permissions allow.

1 more reply

Dalewyn2y ago· 2 in thread

Not to be confused with DirectStorage, which is a DirectX API that lets the video card load textures from NVME SSD local storage more efficiently.

bee_rider2y ago

I was expecting something about GPUs as well.

IMO it doesn’t make much sense to call things that run on the CPU “direct.” Direct access to resources is the assumption if you are running on the CPU, right?

ninepoints2y ago

"Direct" here is more analogous to the Direct as in DirectX and Direct3D.

1 more reply

Roark662y ago· 1 in thread

avianlyric2y ago

7e2y ago· 1 in thread

When will gVisor be able to run processes in a Secure Enclave?

intelVISA2y ago

I thought gVisor was DOA. I guess this post confirms it.

londons_explore2y ago

These designs always seem so complex... And one overlooked feature of any API could totally break the sandbox.

Whereas a simple 'we run everything in a VM' seems much simpler and less fragile.

One way to do this would be to use undefined instruction traps to enter a debugger, which could then implement a syscall-like API. That would make it portable to any OS, yet ultra fast.

j / k navigate · click thread line to collapse