undefined | Better HN

0 pointsgeocar10y ago0 comments

At some point, your complexity gets bigger than simply coding a state machine that operates directly on the network buffers themselves:

That is to say, I suspect that if micro-optimisations can double our performance, they will be more complicated than just writing a customised ring0 that implements HTTP directly inside the network driver.

Here is how I'm looking at it:

• 10Gb/sec network port

• 4k max requests and responses

• == 1.3 million HTTP requests per second.

Now the problem is that main memory is not much faster than our fastest network: About 15Gb/sec, so what we're talking about here is code and state staying entirely in L1, and streaming the network buffers across the CPU, and responding in one pass, to get that 1.3 million optimal performance.

My dash server gets ~135k HTTP requests per second on localhost (I should be able to approach 300k/sec over a network if I ever get around to it). That's 22% of our optimal performance, and a lot better than any other HTTP server I'm aware of.

At this speed, one of those micro-optimisations `writev()` is actually slower than `write()` -- likely because the code path is shorter in the simpler codebase -- but it illustrates my concern nicely: That we are close to that break-even point with the optimisations we can make. If we make our server bigger and more complicated, it might not make our programs any faster.

That suggests to me that the solution is actually fewer, simpler syscalls, not more, bigger ones.

0 comments

1 comments · 1 top-level

vegardx10y ago

What kind of "main memory" are you talking about? Regular, consumer grade memory, will have a bandwidth at least ten times faster than your 10Gb/s network interface. Change the bit to a byte and you're a little closer.

j / k navigate · click thread line to collapse