It really should be shameful to use unqualified adjectives in headline claims without also providing the supporting evidence.
I personally never care about benchmarks presented, it's much better to use and see for myself so didn't think much about having a table with values there but I can understand how it may help.
Of course it needs to pre-generate the file and you need enough RAM for both the server running and caching the file but it needs almost zero CPU during the test run and can probably produce even more load than this io_uring tool.
So I just tried your tool and it just hangs, I see you're sending close requests, is this configurable to keep-alive, or even better, nothing? In Http/1.1 keep-alive/close is better not used at all, never try to enforce this as it is not mandatory.
A lot of servers just ignore the close and don't close the connection (like the one I am using) so this can be the issue I am having.
Try the -shutwr option if the server doesn't close the connection itself. I used it to test lots of exotic implementations and there are weird things going on in overload situations and around connection management. NodeJS for example started dropping connections on localhost(!!) on high load.
The tool was built for high values of keepalive requests, if the server is too fast just use more requests, e.g. -n 1000000 or something similar. Unfortunately some servers close keepalive connections after quite few requests, nginx has a default of 1000 for example.
This is just a simple tool I hacked together as a student to collect some data, didn't spend any time making it more accessible/user friendly, sorry.
In fact, in all cases, you don't know when the syscall actually starts execution even with regular calls. The only thing you're sure is the kernel "knows" about the syscall you want. However, you have absolutely no indication on whether it started to run or not.
The real question is: are the classical measures accurate? All we have is an upper bound on the time it took: I fired the write at t0 and finished reading the response at t1. This does not really change with io_uring. Batches will mostly change one fact: multiple measurements will share a t0, and possibly a t1 when multiple replies arrive at once.
Is it important? Yes and no. The most important thing in such benchmarks is for the added delay to be consistent between measurements, and when it starts to break down. So it's important if you're chasing every µs in the stack, but not if your goal is lowering the p99 which happens under heavy load. In this case, consistency between measurements is paramount in order to get histograms and such that make sense.
Normally when I have run latency calculations in the past I run them from the perspective of the caller, not the server.
In most cases this is over the network, a named pipe or sock file.
I guess it should be possible to run multiple runtimes inside a program that run independently.