undefined | Better HN

0 pointsparhamn1y ago0 comments

In the issue of abysmal performance on cloud-compute/PaaS Im talking about the container runtime (most Paas is gVisor or Firecracker, no?) cloudrun, DO, modal, etc.

But given this article is about improving gvisors userland tcp performance significantly, it seems like the netstack stuff causes major performance losses too.

I saw a github link in another top article today https://github.com/misprit7/computerraria where the Readme's Pitch section feels very relevant to gvisor.

0 comments

tptacek1y ago

I don’t believe many PAAS run gVisor; a surprising number just run multitenant docker.

The netstack stuff here has nothing to do with the rest of gVisor.

parhamnOP1y ago

> The netstack stuff here has nothing to do with the rest of gVisor.

How so? Besides being part of it, it is at least similar in the group of "bloated slow userland implementation of things the kernel handles well"

tptacek1y ago

A TCP/IP stack is not an "implementation of syscalls". The things most netstack users do with netstack have nothing to do with wanting to move the kernel into userland and everything to do with the fact that the kernel features they want to access are either privileged or (in a lot of IP routing cases) not available at all. Netstack (like any user-mode IP stack) allows programs to do things they couldn't otherwise do at all.

The gVisor/perf thing is a tendentious argument. You can have whatever opinion you like about whether running a platform under gVisor supervision is a good idea. But the post we're commenting on is obviously not about gVisor; it's about a library inside of gVisor that is probably a lot more popular than gVisor itself.

1 more reply

weitendorf1y ago

In the context of coder, the userspace TCP overhead should be negligible. Based on https://gvisor.dev/docs/architecture_guide/performance/ and assuming runc is mostly just using the regular kernel networking stack (I think it does, since it mostly just does syscall filtering?) it should be at most a 30% direct TCP performance hit. But in a real application you typically only spend a negligible amount of total time in the TCP stack - the client code, total e2e latency, and server code corresponding to a particular packet will take much more time.

You'll note their node/ruby benchmarks showed a substantially bigger performance hit. That's because the other gvisor sandboxing functionality (general syscall + file I/O) has more of an impact on performance, but also because these are network-processing bound applications (rare) that were still reaching high QPS in absolute terms for their perspective runtimes (do you know many real-world node apps doing 350qps-800qps per instance?).

Because coder is not likely to be bottlenecked by CPU availability for networking, the resource overhead should be inconsequential, and what's really important is the impact on user latency. But that's something likely on the order of 1ms for a roundtrip that is already spending probably 30-50ms at best in transit between client and server (given that coder's server would be running in a datacenter with clients at home or the office), plus the actual application logic overhead which is at best 10ms. And that's very similar to a lot of gvisor netstack use cases which is why it's not as big of a deal as you think it is.

TLDR: For the stuff you'd actually care about (roundtrip latency) in the coder usecase the perf hit of using gvisor netstack should be like 2% at most, and most likely much less. Either way it's small enough to be imperceivable to the actual human using the client.

parhamnOP1y ago

TCP overhead is part of the story. Theres 20-40x overhead in syscalls, 20% running a tensorflow project end to end, 50% fewer RPS in redis, etc.

tptacek1y ago

We are still talking about people using runsc/runc. That's not what `coder` is doing. All they did was poach a (popular) networking library from the gVisor codebase. None of this benchmarking has anything to do with their product.

1 more reply

tptacek1y ago

Are they even using runc/runsc?

weitendorf1y ago

At coder, no since "gVisor is a container runtime that reimplements the entire Linux ABI (syscalls) in Go, but we only need the networking for our purposes"

but gvisor was using full runsc for the networking benchmarks I linked, and IIUC runc's networking should be sufficiently similar to unsandboxed networking that I believe runsc<->runc network performance difference should approximate gvisor netstack<->vanilla kernel networking.

shanemhansen1y ago

Google is my former employer and this statement isn't referring to stuff I heard while employed there.

But after I left, I heard a that alot of the poor performance of Cloud Run is just plain old oversubscribed shared core e2 stuff.

j / k navigate · click thread line to collapse

0 comments

tptacek1y ago

I don’t believe many PAAS run gVisor; a surprising number just run multitenant docker.

The netstack stuff here has nothing to do with the rest of gVisor.

parhamnOP1y ago

> The netstack stuff here has nothing to do with the rest of gVisor.

How so? Besides being part of it, it is at least similar in the group of "bloated slow userland implementation of things the kernel handles well"

tptacek1y ago

1 more reply

weitendorf1y ago

parhamnOP1y ago

TCP overhead is part of the story. Theres 20-40x overhead in syscalls, 20% running a tensorflow project end to end, 50% fewer RPS in redis, etc.

tptacek1y ago

1 more reply

tptacek1y ago

Are they even using runc/runsc?

weitendorf1y ago

At coder, no since "gVisor is a container runtime that reimplements the entire Linux ABI (syscalls) in Go, but we only need the networking for our purposes"

shanemhansen1y ago

Google is my former employer and this statement isn't referring to stuff I heard while employed there.

But after I left, I heard a that alot of the poor performance of Cloud Run is just plain old oversubscribed shared core e2 stuff.

j / k navigate · click thread line to collapse