I took a short look at the benchmark setup (https://github.com/talawahtech/seastar/blob/http-performance...), and wonder if some simplifications there lead to overinflated performance numbers. The server here executes a single read() on the connection - and as soon as it receives any data it sends back headers. A real world HTTP server needs to read data until all header and body data is consumed before responding.
Now given the benchmark probably sends tiny requests, the server might get everything in a single buffer. However every time it does not, the server will send back two responses to the server - and at that time the client will already have a response for the follow-up request before actually sending it - which overinflates numbers. Might be interesting to re-test with a proper HTTP implementation (at least read until the last 4 bytes received are \r\n\r\n, and assume the benchmark client will never send a body).
Such a bug might also lead to a lot more write() calls than what would be actually necessary to serve the workload, or to stalling due to full send or receive buffers - all of those might also have an impact on performance.
Skipping the parsing of the HTTP requests definitely gives a performance boost, but for this comparison both sides got the same boost, so I didn't mind being less strict. Seastar's HTTP parser was being finicky, so I chose the easy route and just removed it from the equation.
For reference though, in my previous post[2] libreactor was able to hit 1.2M req/s while fully parsing the HTTP requests using picohttpparser[3]. But that is still a very simple and highly optimized implementation. FYI, from what I recall, when I played with disabling HTTP parsing in libreactor, I got a performance boost of about 5%.
1. https://talawah.io/blog/linux-kernel-vs-dpdk-http-performanc...
2. https://talawah.io/blog/extreme-http-performance-tuning-one-...
It's not actually an HTTP server though... For these purposes, it's essentially no more useful than netcat dumping out a preconfigured text file. Titling it "HTTP Performance showdown" is doubly bad here since there's no real-world (or even moderately synthetic) HTTP requests happening; you just always get the same static set of data for every request, regardless of what that request is. Call it whatever you like but that isn't HTTP. A key part of performance equation on the web is the difference in response time involved in returning different kinds (sizes and types) of responses.
A more compelling argument could be made for the improved performance you can get bypassing the Kernel's networking, but this article isn't it. What this article demonstrates is that in this one very narrow case where you want to always return the same static data, there's vast speed improvements to be had. This doesn't tell you anything useful about the performance in basically 100% of the cases of real-world use of the web, and its premise falls down when you consider that kernel interrupt speeds are unlikely to be the bottleneck in most servers, even caches.
I'd really love to see this adapted to do actual webserver work and see what the difference is. A good candidate might be an in-memory static cache server of some kind. It would require URL parsing to feed out resources but would better emulate an environment that might benefit from this kind of change and certainly would be a real-world situation that many companies are familiar with. Like it or not, URL parsing is part of the performance equation when you're talking HTTP.
The Data Plane Development Kit (DPDK) is an open source software project managed
by the Linux Foundation. It provides a set of data plane libraries and network
interface controller polling-mode drivers for offloading TCP packet processing
from the operating system kernel to processes running in user space. This offloading
achieves higher computing efficiency and higher packet throughput than is possible
using the interrupt-driven processing provided in the kernel.
https://en.wikipedia.org/wiki/Data_Plane_Development_Kit https://www.packetcoders.io/what-is-dpdk/Yeah, receiving packets is fast when you aren't doing anything with them.
Forwarders are usually not doing much with packets, just reading a few fields and choosing an output port. These are very common function in network cores (telcos, datacenters, ...).
DPDK is not well-suited for endpoint application programming, even though here you can still squeeze some additional performance.
But don't dismiss a framework that is currently so widely deployed just because you are not familiar with its most common use-case.
For these types of applications there are proprietary implementations that you can buy from vendors that are more suited to latency sensitive applications.
The next level of optimization after kernel bypass is to build or buy FPGAs which implement the wire protocol + transport as an integrated circuit.
Supercomputing people have done this repeatedly and successfully.
A more general purpose stack is UDP-based applications.
I wouldn't do this with TCP, which I agree is complicated and difficult to get right.
You can do XDP as well and gain zero copy but for a small subset of NICs and you still need to implement you own network stack I don't see ac way to avoid it much. There are existing TCP stacks for DPDK that you can use as well.
You can mix both a bit to make the selection of where packets go with a eBPF program, and running higher level stacks in user space where user space is still dealing with raw packets rather than sockets.
It supports IOPOLL (polling of the socket) and SQPOLL (kernel side polling of the request queue) so hopefully the fact that application driving it is in another thread wouldn't slow it too much... With multi-shot accept/recv you'd only need to tell it to keep accepting connections on the listener fd, but I'm not sure if you can chain recvs to the child fd automatically from kernel or not yet... We live in interesting times!
DPDK is huge and inflexible, it does a lot of things which I'd rather be in control of myself and I think it's easier to just do my own userspace vfio.
If the io_uring instance was configured for polling, by specifying IORING_SETUP_IOPOLL in the call to io_uring_setup(2), then min_complete has a slightly different meaning. Passing a value of 0 instructs the kernel to return any events which are already complete, without blocking. If min_complete is a non-zero value, the kernel will still return immediately if any completion events are available. If no event completions are available, then the call will poll either until one or more completions become available, or until the process has ex‐ ceeded its scheduler time slice.
... Well, TIL -- thanks! and the NAPI patch you pointed at looks interesting too.
> I am genuinely interested in hearing the opinions of more security experts on this (turning off speculative execution mitigatins). If this is your area of expertise, feel free to leave a comment
Are these generally safe if you have a machine that does not have multi-user access and is in a security boundary?
If someone exploited a process with a dedicated unprivileged user, had legit limited access, or got in a container on a physical, they might be able to leverage it for the forces of evil.
There’s really no such practical thing as single user Linux. If you’re running a network exposed app without dropping privileges, that’s a much bigger security risk than speculative execution.
Now, if you were skipping an OS and going full bare metal, then that could be different. But an audit for that would be a nightmare :).
Why would a regulatory framework care if a Linux box running one process was vulnerable to attacks that involve switching UIDs?
Converse, why would that same regulatory framework not care if users of that network service were able to impersonate each other / access each others’ data?
I've ran production systems with mitigations off. All of the intended (login) users had authorized privilege escalation, so I wasn't worried about them. There was a single primary service daemon per machine, and if you broke into that, you wouldn't get anything really useful by breaking into root from there. And the systems where I made a particularly intentional decision were more or less syscall limited; enabling mitigations significantly reduced capacity, so mitigations were disabled. (This was inline with guidance from the dedicated security team where I was working).
Generally.. depends on what you mean by generally. In the casual sense of the word, speculative execution attacks are not very common, so it can be said that most people are mostly safe from them independent of mitigations. Someone might also use "generally safe" to mean proven security against a whole class of attacks, in which case the answer would be no.
Anytime the discussion around turning off mitigations comes up, it's trumped by the "why don't we just leave them on and buy more computers" trump card.
An easier solution than mitigating is to just upgrade your AWS instance size.
Many intentionally run without mitigations.
If you just have a dumb "request/response" service you may not have to worry.
A database is an interesting case. Scylla uses CQL, which is a very limited language compared to something like javascript or SQL - there's no way to loop as far as I know, for example. I would probably recommend not exposing your database directly to an attacker anyways, that seems like a niche scenario.
If you're just providing, say, a gRPC API that takes some arguments, places them (safely) into a Scylla query, and gives you some result, I don't think any of those mitigations are necessary and you'll probably see a really nice win if you disable them.
This is my understanding as a security professional who is not an expert on those attacks, because I frankly don't have the time. I'll defer to someone who has done the work.
Separately, (quoting the article)
> Let's suppose, on the one hand, that you have a multi-user system that relies solely on Linux user permissions and namespaces to establish security boundaries. You should probably leave the mitigations enabled for that system.
Please don't ever rely on Linux user permissions/namespaces for a multi-tenant system. It is not even close to being sufficient, with or without those mitigations. It might be OK in situations where you also have strong auth (like ssh with fido2 mfa) but if your scenario is "run untrusted code" you can't trust the kernel to do isolationl
> On the other hand, suppose you are running an API server all by itself on a single purpose EC2 instance. Let's also assume that it doesn't run untrusted code, and that the instance uses Nitro Enclaves to protect extra sensitive information. If the instance is the security boundary and the Nitro Enclave provides defense in depth, then does that put mitigations=off back on the table?
Yeah that seems fine.
> Most people don't disable Spectre mitigations, so solutions that work with them enabled are important. I am not 100% sure that all of the mitigation overhead comes from syscalls,
This has been my experience and, based on how the mitigation works, I think that's going to be the case. The mitigations have been pretty brutal for syscalls - though I think the blame should fall on intel, not the mitigation that has had to be placed on top of their mistake.
Presumably io_uring is the solution, although that has its own security issues... like an entirely new syscall interface with its own bugs, lack of auditing, no ability to seccomp io_uring calls, no meaningful LSM hooks, etc. It'll be a while before I'm comfortable exposing io_uring to untrusted code.
DPDK performs fairly well, even better for the most part. For some years I maintained a modification of the ixgbe kernel driver for Intel NICs that allowed us to perform high-performance traffic capture. We finally moved to DPDK once it was stable enough and we had the need to support more NICs, and in our comparisons we didn't see a performance hit.
Maybe manufacturer-made drivers can be better than DPDK, but if I had to guess that would be not because of the abstractions but because of the knowledge of the NIC architecture and parameters. I remember when we tried to do a PoC of a Mellanox driver modification and a lot of the work to get high performance was understanding the NIC options and tweaking them to get the most for our use case.
In any case, the standard software API for RDMA is ibverbs. All adapters supporting RDMA (be it IB, RoCE or iWARP) will expose it. You can get cloud instances with RDMA on AWS and Azure.
Edit: Looks like they do but not for TCP from what I can find: https://static.googleusercontent.com/media/research.google.c...
I am not 100% sure that all of the mitigation overhead
comes from syscalls, but it stands to reason that a lot
of it arises from security hardening in user-to-kernel
and kernel-to-user transitions.
Will io_uring be also affected by Spectre mitigations given it has eliminated most kernel/user switches?And did anyone do a head-to-head comparison between io_uring and DPDK?
Optimizing raw http seems to me like a huge waste of time by now. I say that as someone who has spent years optimizing raw http performance. None of that matters these days.
https://www.adacore.com/papers/layered-formal-verification-o...
Might also give people here some ideas on how to combine symbolic execution, proof, C and SPARK code and how to gain confidence in each part of a network stack.
I think there's even some ongoing work climbing up the stack up to HTTP but not sure of the plan (not involved).