That is to say, I suspect that if micro-optimisations can double our performance, they will be more complicated than just writing a customised ring0 that implements HTTP directly inside the network driver.
Here is how I'm looking at it:
• 10Gb/sec network port
• 4k max requests and responses
• == 1.3 million HTTP requests per second.
Now the problem is that main memory is not much faster than our fastest network: About 15Gb/sec, so what we're talking about here is code and state staying entirely in L1, and streaming the network buffers across the CPU, and responding in one pass, to get that 1.3 million optimal performance.
My dash server gets ~135k HTTP requests per second on localhost (I should be able to approach 300k/sec over a network if I ever get around to it). That's 22% of our optimal performance, and a lot better than any other HTTP server I'm aware of.
At this speed, one of those micro-optimisations `writev()` is actually slower than `write()` -- likely because the code path is shorter in the simpler codebase -- but it illustrates my concern nicely: That we are close to that break-even point with the optimisations we can make. If we make our server bigger and more complicated, it might not make our programs any faster.
That suggests to me that the solution is actually fewer, simpler syscalls, not more, bigger ones.