For something like networking, if you are maximizing packets per second, you'll hit kernel limits[1] very quickly and instead have to start leveraging features like GSO/GRO or completely bypass the network stack.
https://access.redhat.com/solutions/4723221
Go should reconsider support. They should have a 'go' at it.
Rdma, dpdk, io_uring it’s really kind of up to the user to do the memory isolation
In io_urings case tho, you can’t do much because the rings are in the kernel.
I’m hopeful though that with Llm things will get better.
But it’s just hard problem to solve . Very difficult to do in the kernel itself, and folks don’t really even understand tuning for it.
The two things that make io_uring fast are chaining of operations and zero syscall mode, the former would require that all async io frameworks/libs would need to be rewritten to make use of that and then all user facing apps would also need to be rewritten since all you’d get now are completions to operations instead of waiting if you can run a operation.
Know that the increase in CPU utilization may mean you've improved the performance of your "database server," because now your CPU cores are waiting less on IO. It also may not mean this, but just looking at htop won't tell you either way.
At the same load, how did latency look for A vs B.
What was throughput and latency at maximum load like for A vs B. For whichever one had the smaller max throughput, what did latency look like for the other option.
For bonus points while testing: is there another observable metric to indicate available capacity, if cpu % free is less useful.
I took a (very brief) look at the github repo [1], it doesn't look like you're doing anything with cpu pinning.
You can probably eke (thanks) out a bit more performance if you cpu pin your threads and cpu pin your listen sockets (sockopt SO_INCOMING_CPU).
If you also cpu align your outgoing sockets, you should get a significant boost, but afaik, there's no great api for that. Linux does have an api for compatible NICs (traffic steering/flow steering) which can work, but if you know what hash your NIC uses (it's probably toeplitz) and you manage source port selection to your backend, you can pick ports that will hash properly.
The goal is for your proxy to be able to handle packets without any cross cpu communication.
I saw you mentioned windows development elsewhere. You might be interested to know that Microsoft pionered Receive Side Scaling and Send Side Scaling. If you try your proxy out on Windows, be sure to hook into those systems there.
The less work your proxy does, the more important avoiding cross core communication is.
The hardest part is going to be generating enough load. I had production load, which has the benefit that you don't need to generate it. Otoh, it was a transitional need, and I couldn't reasonably test above 50% of peak traffic on a single machine ... I hit that mark around the time traffic started dropping, and then it wasn't fun anymore.
But really, I want to sendfile with io_uring, but that's not supported yet.
My writeup, with extra buzzwords like Rust and kTLS: https://blog.habets.se/2025/04/io-uring-ktls-and-rust-for-ze...
It was on HN too: https://news.ycombinator.com/item?id=44980865
But it's a relevant reply to both comments, so copied here:
Yes, my understanding is that I should be able to emulate sendfile via splice. The problem with that is that splice requires one end to be a pipe. So I think this means two extra file descriptors per connection (one per side of the pipe). And per connection this adds 5 slots in the submit/completion queue, with a LINK dependency. Maybe the trade off is worth it. I've not done concrete experiments with it, but I'm guessing it would be if the saved copy_from_user is large enough.
So for optimal performance this may mean using write() for short files, and a pipe(), a pair of splice() calls, and a pair of close() calls, for larger files.
So I think this means two extra file descriptors per connection (one per side of the pipe). And per connection this adds 5 slots in the submit/completion queue, with a LINK dependency. Maybe the trade off is worth it. I've not done concrete experiments with it, but I'm guessing it would be if the saved copy_from_user is large enough.
So for optimal performance this may mean using write() for short files, and a pipe(), a pair of splice() calls, and a pair of close() calls, for larger files.
Edit: I guess I could save some ops by reusing pipes, but then I'd have to make sure to flush them. Would add some complexity.
Windows network development is really, really different from Unixy stuff. But you might have fun :)
This sent me through a rabbit hole of uring, kernel development and C. I've been a rust and c++ dev for quite a few years now, but there's such a simplicity and even artistic feel to small(ish) C programs.
It's too low level, I cannot even tell that I understand everything, only the concept, so I will just share some links. It works only per NAPI epoll context, and one cannot easily control NAPI ID, but if an entire machine is dedicated for a proxy one can do a simple trick of assinging sockets by NAPI ID to dedicated pollers.
In my use case, it was not a proxy, but N socket polling on a machine that then processes received data. It does not look feasible for such case, maybe round-robin polling of NAPI contexts from a single thread may work. What I would really want to have one day from the kernel is that I can easily tell it: trust me, I will poll this single socket eventually, never ever use IRQ path for it.
Previous HN discussion of the kernel feature: https://news.ycombinator.com/item?id=43749271 Nice presentation by the Fastly contributor, with nice diagrams making the big picture much easier to understand: https://netdevconf.info/0x18/docs/netdev-0x18-paper10-talk-s... LWN articles: https://lwn.net/Articles/1008399/, https://lwn.net/Articles/997491/, https://lwn.net/Articles/959462/ Kernel docs: https://docs.kernel.org/networking/napi.html#irq-mitigation
If you make one run on an FPGA you it'll be even more complex.
The lesson is that cutting through abstraction like a hot knife through butter is a necessary mindset for performance but also makes things more difficult. Sockets and thread-per-connection were good approaches when networks were very slow relative to CPUs, and they're still often the simplest approach today.
https://theconsensus.dev/p/2026/05/18/serving-files-three-wa...