Epoll vs. io_uring in Linux (opens in new tab)

(sibexi.co)

255 pointsSibexico3d ago68 comments

68 comments

52 comments · 12 top-level

Uptrenda3d ago· 12 in thread

Yes, io_uring is significantly faster than epoll (I think I had like 20% faster req/s with io_uring.) The catch is that its kernel opt-in and disabled just about everywhere for security reasons. I think that it has direct memory sharing between the kernel and user-land which is kind of yikes. There's been multiple exploits that hit io_uring in recent times. It's because of this that even engineering projects that try to reach the highest performance possible (like Go) don't really bake io_uring in as a sane default. Though if you want to take the risk you can always run it yourself for your favourite language. It is faster but the cost is possible exploits.

Asmod4n3d ago

The main reason why it gets disabled is fixed now, the latest RC got cBPF support and as such you can restrict what OPs can be run now instead of just fully disabling it.

mort963d ago

Well the reason it's disabled now is the recent history of pretty bad vulnerabilities. It probably needs to go a while without new vulnerabilities before it makes sense to enable by default. It's pretty complex completely unsafe C code, after all.

tempaccount4203d ago

It's not complex actually, but it is C...

Cloudef3d ago

Quite depends, I had times when my posix emulation of io_uring (with poll, not epoll) was faster than io_uring. For large zero-copy buffers, io_uring is king however. Also io_uring is useful even for non asynchronous IO as it can implement chain of operations as single atomic operation (mkdir + open it for example).

For something like networking, if you are maximizing packets per second, you'll hit kernel limits[1] very quickly and instead have to start leveraging features like GSO/GRO or completely bypass the network stack.

1: https://github.com/axboe/liburing/discussions/1346

lukeh3d ago

Also it’s nice for things like SPI which have no user space non-blocking API.

nottorp2d ago

SPI the bus?

csdreamer73d ago

RHEL 9 and 10 now fully support io_uring by default. It is very recent, but this covers a lot of corporate Linux installs. Gemini 'said' Ubuntu and SuSE support it as well, but did not provide any links to prove it.

https://access.redhat.com/solutions/4723221

Go should reconsider support. They should have a 'go' at it.

insanitybit3d ago

It's still seccomp'd off in most environments because io-uring is still a seccomp bypass that doesn't play well with kernel security systems (audit subsystem), even if it weren't also like the #1 or #2 exploit vector for privesc.

Asmod4n3d ago

That’s solved as of last week, you can use cBPF now to disable functionality.

1 more reply

omcnoe3d ago

For a project like Go, wouldn't it be an option to do one-time iouring feature detection in the runtime startup? Exploits are an issue for the entire OS, not the program choosing to use iouring, yeah?

happyPersonR3d ago

Any kind of poll mode networking:

Rdma, dpdk, io_uring it’s really kind of up to the user to do the memory isolation

In io_urings case tho, you can’t do much because the rings are in the kernel.

I’m hopeful though that with Llm things will get better.

But it’s just hard problem to solve . Very difficult to do in the kernel itself, and folks don’t really even understand tuning for it.

kshri243d ago

The ring buffers are in shared memory not kernel private. The ring buffers (submission and completion) are shared between kernel space and user space. User publishes requests via submission queue entries (updates tail of buffer while kernel reads head of the buffer), kernel shifts the submission queue buffer on its end and returns a completion queue event by publishing to completion buffer. User pulls from this buffer (specifically the head, kernel updates tail of buffer) in user space.

mrlonglong3d ago· 12 in thread

Boost asio if you love C++ and asynchronous networking.

DmitryOlshansky2d ago

I’ve replaced Asio recently with stright epoll event loop and got about 16% RPS better. That is for resonably sized SQL server, so be careful with nice precanned libraries.

MathMonkeyMan3d ago

I switched out asio's epoll backend for its io_uring in a database server and CPU utilization shot up. Probably depends on usage and the specifics of how it's integrated into the event code.

Asmod4n3d ago

No async io framework exists which utilizes everything io_uring can, they are all build around the poll model. As such io_uring will always be worse than the poll like abstractions.

The two things that make io_uring fast are chaining of operations and zero syscall mode, the former would require that all async io frameworks/libs would need to be rewritten to make use of that and then all user facing apps would also need to be rewritten since all you’d get now are completions to operations instead of waiting if you can run a operation.

vlovich1233d ago

That’s paradoxically what you can expect on a busy server - your CPU can spend time doing work that would have been previously IO wait time. Of course, it could be a bug in the implementation where you’re spinning doing no work erroneously, but depends on the details.

saghm3d ago

Yeah, the explanation that I've usually heard for this sort of thing is that it's intended to get back CPU time that's lost when too many system threads are blocking to keep something on every core even during I/O (or pay for it in terms of the context switching overhead if you compensate for this with an extremely large number of system threads). The theory is that you'll avoid idle CPU compared to the common "one thread per core" way of doing things due to some of them being idle during I/O, at the cost of using some extra CPU to handle more things in user space. Obviously how much this helps can vary between use cases, but the measure of how much it's helping (or if it's maybe not helping at all!) is throughput, not CPU utilization.

1 more reply

FooBarWidget3d ago

This makes no sense. Epoll is already non-blocking, you never waste time waiting for I/O as long as there is work to do. Io_uring only boosts CPU efficiency (batching of syscalls, for example), it does not reduce blocking.

1 more reply

topspin2d ago

Classic.

Know that the increase in CPU utilization may mean you've improved the performance of your "database server," because now your CPU cores are waiting less on IO. It also may not mean this, but just looking at htop won't tell you either way.

toast03d ago

In addition to the other discussion. It's important to measure outcomes and not just look at the cpu meter...

At the same load, how did latency look for A vs B.

What was throughput and latency at maximum load like for A vs B. For whichever one had the smaller max throughput, what did latency look like for the other option.

For bonus points while testing: is there another observable metric to indicate available capacity, if cpu % free is less useful.

LoganDark3d ago

Boost is so inconvenient, they're huge dynamic libraries that are a pain to build and use. Even when I was already using CMake, getting Boost installed in a way where it could be discovered was super annoying. (I was on Mac, though)

Chaosvex3d ago

Asio also comes in standalone form and both versions are header-only. Not necessarily directly related to your comment but adding it on, anyway.

1 more reply

wavemode3d ago

Some (most?) Boost libraries are header-only. Including Boost.Asio nowadays.

cherryteastain3d ago

You can statically link boost

toast03d ago· 7 in thread

> But my students weren’t as happy as I was - they wanted to build something genuinely useful, and they were really disappointed that our “product” had strong architectural limits and couldn’t outperform titans like nginx and haproxy.

I took a (very brief) look at the github repo [1], it doesn't look like you're doing anything with cpu pinning.

You can probably eke (thanks) out a bit more performance if you cpu pin your threads and cpu pin your listen sockets (sockopt SO_INCOMING_CPU).

If you also cpu align your outgoing sockets, you should get a significant boost, but afaik, there's no great api for that. Linux does have an api for compatible NICs (traffic steering/flow steering) which can work, but if you know what hash your NIC uses (it's probably toeplitz) and you manage source port selection to your backend, you can pick ports that will hash properly.

The goal is for your proxy to be able to handle packets without any cross cpu communication.

[1] https://github.com/sibexico/TinyGate

SibexicoOP3d ago

Basically, v0 and v1 of the repo is completely different implementations, written almost from scratch. Now working on the 3rd one implementation, I believe the last one. :) Completely different architectural choices was made.

toast03d ago

If it's still running on more than a single core, and your students want it to go faster, aligning the work to cpus will almost certainly be useful.

I saw you mentioned windows development elsewhere. You might be interested to know that Microsoft pionered Receive Side Scaling and Send Side Scaling. If you try your proxy out on Windows, be sure to hook into those systems there.

The less work your proxy does, the more important avoiding cross core communication is.

camkego3d ago

Pin threads to cores, and make sure threads different cores aren’t writing to the same 64 or 128 byte block. Lookup “false sharing”

iamcreasy2d ago

Thanks for the write up. So the first version was synchronous, second version was using epoll and third one will be use io_uring?

ahepp3d ago

I would be interested to see benchmarks for that patch

toast02d ago

I don't have the right setup to make good benchmarks for this right now, but when I had the chance to put it into practice, the improvement between no cpu alignment and full alignment was quite large. That was on a 28 core machines (with 16 nic queues); many years ago, but IIRC, I got at least 10x the connections/sec out of the boxes after tuning and after tuning 12 cores were idle ... the machines were repurposed, if they were ordered for this, they should have had one core per nic queue in a single socket. The difference is likely smaller on a 4 core machine as described in the article.

The hardest part is going to be generating enough load. I had production load, which has the benefit that you don't need to generate it. Otoh, it was a transitional need, and I couldn't reasonably test above 50% of peak traffic on a single machine ... I hit that mark around the time traffic started dropping, and then it wasn't fun anymore.

jibal3d ago

eke

thomashabets23d ago· 4 in thread

I've not yet tested the shared buffers for my io uring based web server, but that's because instead of reading from a file and writing, i send directly from a mmaped region.

But really, I want to sendfile with io_uring, but that's not supported yet.

My writeup, with extra buzzwords like Rust and kTLS: https://blog.habets.se/2025/04/io-uring-ktls-and-rust-for-ze...

It was on HN too: https://news.ycombinator.com/item?id=44980865

Ne02ptzero3d ago

FYI you can use sendfile ish with uring, since splice(2) is implemented. Not as user friendly as sendfile, but should work fairly similarly.

thomashabets23d ago

Oops, I actually replied to the wrong comment. I replied here: https://news.ycombinator.com/item?id=48617774

But it's a relevant reply to both comments, so copied here:

Yes, my understanding is that I should be able to emulate sendfile via splice. The problem with that is that splice requires one end to be a pipe. So I think this means two extra file descriptors per connection (one per side of the pipe). And per connection this adds 5 slots in the submit/completion queue, with a LINK dependency. Maybe the trade off is worth it. I've not done concrete experiments with it, but I'm guessing it would be if the saved copy_from_user is large enough.

So for optimal performance this may mean using write() for short files, and a pipe(), a pair of splice() calls, and a pair of close() calls, for larger files.

luke54413d ago

On the Linux side sendfile is implemented via splice. So it is a more generic API that covers the sendfile case.

thomashabets23d ago

Yes, my understanding is that I should be able to emulate sendfile via splice. The problem with that is that splice requires one end to be a pipe.

So I think this means two extra file descriptors per connection (one per side of the pipe). And per connection this adds 5 slots in the submit/completion queue, with a LINK dependency. Maybe the trade off is worth it. I've not done concrete experiments with it, but I'm guessing it would be if the saved copy_from_user is large enough.

So for optimal performance this may mean using write() for short files, and a pipe(), a pair of splice() calls, and a pair of close() calls, for larger files.

Edit: I guess I could save some ops by reusing pipes, but then I'd have to make sure to flush them. Would add some complexity.

up2isomorphism3d ago· 3 in thread

The author takes a very benchmark focus on this topic which only says part of the story particularly for complex systems. Noticed that there are a number of very similar interface that exist on other platform like windows long before io_uring, but that does make Linux’s I/O system worse or slow than these platforms. A fast server is likely fast in either multiplexing or async API if implemented correctly in almost all cases.

SibexicoOP3d ago

I'm now a Windows developer, mostly working with Linux and FreeBSD. Thx for the point, I'll look how it works in Windows systems.

muststopmyths3d ago

Equivalent in Windows is Registered I/O (RIO) for sockets.

Windows network development is really, really different from Unixy stuff. But you might have fun :)

RossBencina3d ago

There is no benchmark in the post. There is analysis, discussion and code examples for epoll and io_uring usage.

spliffedr3d ago· 1 in thread

Take a look at https://github.com/concurrencykit/ck and https://github.com/microsoft/mimalloc, it will fit well for a zero-copy and mem aligned reverse proxy. Also, if you want to add a DDoS protection and more advanced L4 stuff check out https://docs.ebpf.io/ebpf-library/libxdp/libxdp/

SibexicoOP3d ago

Yeah, the plan was to apply optimizations at the other levels, then we will go to allocators. Studying the allocators rn with my students, the previous post in the blog was about custom allocator on the Zig lang.

GalaxyNova3d ago· 1 in thread

The year is 2050; there are 20 different ways to poll a socket on Linux.

Uptrenda3d ago

Yes, even for io_uring. io_uring singshot and then multishot to go even faster.

witx3d ago

Such a great article!

This sent me through a rabbit hole of uring, kernel development and C. I've been a rust and c++ dev for quite a few years now, but there's such a simplicity and even artistic feel to small(ish) C programs.

buybackoff3d ago

In the context of a proxy one should mention epoll_wait busy poll. I've recently dived into this when reviewing low-latency options, and found that it's almost possible to do user space busy polling just for simple sockets, no DPDK/VMA/io_uring needed, and Fastly contributed to this and uses it.

It's too low level, I cannot even tell that I understand everything, only the concept, so I will just share some links. It works only per NAPI epoll context, and one cannot easily control NAPI ID, but if an entire machine is dedicated for a proxy one can do a simple trick of assinging sockets by NAPI ID to dedicated pollers.

In my use case, it was not a proxy, but N socket polling on a machine that then processes received data. It does not look feasible for such case, maybe round-robin polling of NAPI contexts from a single thread may work. What I would really want to have one day from the kernel is that I can easily tell it: trust me, I will poll this single socket eventually, never ever use IRQ path for it.

Previous HN discussion of the kernel feature: https://news.ycombinator.com/item?id=43749271 Nice presentation by the Fastly contributor, with nice diagrams making the big picture much easier to understand: https://netdevconf.info/0x18/docs/netdev-0x18-paper10-talk-s... LWN articles: https://lwn.net/Articles/1008399/, https://lwn.net/Articles/997491/, https://lwn.net/Articles/959462/ Kernel docs: https://docs.kernel.org/networking/napi.html#irq-mitigation

inigyou3d ago

If you write one with DPDK, it'll be infinitely more complex but you'll have the opportunity to blow away nginx in performance.

If you make one run on an FPGA you it'll be even more complex.

The lesson is that cutting through abstraction like a hot knife through butter is a necessary mindset for performance but also makes things more difficult. Sockets and thread-per-connection were good approaches when networks were very slow relative to CPUs, and they're still often the simplest approach today.

eatonphil3d ago

I was also always curious about this and recently wrote a few implementations of an http file server to teach myself the key differences.

https://theconsensus.dev/p/2026/05/18/serving-files-three-wa...

gafferongames3d ago

Just use AF_XDP

1 more reply

j / k navigate · click thread line to collapse

68 comments

52 comments · 12 top-level

Uptrenda3d ago· 12 in thread

Asmod4n3d ago

The main reason why it gets disabled is fixed now, the latest RC got cBPF support and as such you can restrict what OPs can be run now instead of just fully disabling it.

mort963d ago

tempaccount4203d ago

It's not complex actually, but it is C...

Cloudef3d ago

1: https://github.com/axboe/liburing/discussions/1346

lukeh3d ago

Also it’s nice for things like SPI which have no user space non-blocking API.

nottorp2d ago

SPI the bus?

csdreamer73d ago

https://access.redhat.com/solutions/4723221

Go should reconsider support. They should have a 'go' at it.

insanitybit3d ago

Asmod4n3d ago

That’s solved as of last week, you can use cBPF now to disable functionality.

1 more reply

omcnoe3d ago

happyPersonR3d ago

Any kind of poll mode networking:

Rdma, dpdk, io_uring it’s really kind of up to the user to do the memory isolation

In io_urings case tho, you can’t do much because the rings are in the kernel.

I’m hopeful though that with Llm things will get better.

But it’s just hard problem to solve . Very difficult to do in the kernel itself, and folks don’t really even understand tuning for it.

kshri243d ago

mrlonglong3d ago· 12 in thread

Boost asio if you love C++ and asynchronous networking.

DmitryOlshansky2d ago

I’ve replaced Asio recently with stright epoll event loop and got about 16% RPS better. That is for resonably sized SQL server, so be careful with nice precanned libraries.

MathMonkeyMan3d ago

I switched out asio's epoll backend for its io_uring in a database server and CPU utilization shot up. Probably depends on usage and the specifics of how it's integrated into the event code.

Asmod4n3d ago

No async io framework exists which utilizes everything io_uring can, they are all build around the poll model. As such io_uring will always be worse than the poll like abstractions.

vlovich1233d ago

saghm3d ago

1 more reply

FooBarWidget3d ago

1 more reply

topspin2d ago

Classic.

toast03d ago

In addition to the other discussion. It's important to measure outcomes and not just look at the cpu meter...

At the same load, how did latency look for A vs B.

What was throughput and latency at maximum load like for A vs B. For whichever one had the smaller max throughput, what did latency look like for the other option.

For bonus points while testing: is there another observable metric to indicate available capacity, if cpu % free is less useful.

LoganDark3d ago

Chaosvex3d ago

Asio also comes in standalone form and both versions are header-only. Not necessarily directly related to your comment but adding it on, anyway.

1 more reply

wavemode3d ago

Some (most?) Boost libraries are header-only. Including Boost.Asio nowadays.

cherryteastain3d ago

You can statically link boost

toast03d ago· 7 in thread

I took a (very brief) look at the github repo [1], it doesn't look like you're doing anything with cpu pinning.

You can probably eke (thanks) out a bit more performance if you cpu pin your threads and cpu pin your listen sockets (sockopt SO_INCOMING_CPU).

The goal is for your proxy to be able to handle packets without any cross cpu communication.

[1] https://github.com/sibexico/TinyGate

SibexicoOP3d ago

toast03d ago

If it's still running on more than a single core, and your students want it to go faster, aligning the work to cpus will almost certainly be useful.

The less work your proxy does, the more important avoiding cross core communication is.

camkego3d ago

Pin threads to cores, and make sure threads different cores aren’t writing to the same 64 or 128 byte block. Lookup “false sharing”

iamcreasy2d ago

Thanks for the write up. So the first version was synchronous, second version was using epoll and third one will be use io_uring?

ahepp3d ago

I would be interested to see benchmarks for that patch

toast02d ago

jibal3d ago

eke

thomashabets23d ago· 4 in thread

I've not yet tested the shared buffers for my io uring based web server, but that's because instead of reading from a file and writing, i send directly from a mmaped region.

But really, I want to sendfile with io_uring, but that's not supported yet.

My writeup, with extra buzzwords like Rust and kTLS: https://blog.habets.se/2025/04/io-uring-ktls-and-rust-for-ze...

It was on HN too: https://news.ycombinator.com/item?id=44980865

Ne02ptzero3d ago

FYI you can use sendfile ish with uring, since splice(2) is implemented. Not as user friendly as sendfile, but should work fairly similarly.

thomashabets23d ago

Oops, I actually replied to the wrong comment. I replied here: https://news.ycombinator.com/item?id=48617774

But it's a relevant reply to both comments, so copied here:

So for optimal performance this may mean using write() for short files, and a pipe(), a pair of splice() calls, and a pair of close() calls, for larger files.

luke54413d ago

On the Linux side sendfile is implemented via splice. So it is a more generic API that covers the sendfile case.

thomashabets23d ago

Yes, my understanding is that I should be able to emulate sendfile via splice. The problem with that is that splice requires one end to be a pipe.

So for optimal performance this may mean using write() for short files, and a pipe(), a pair of splice() calls, and a pair of close() calls, for larger files.

Edit: I guess I could save some ops by reusing pipes, but then I'd have to make sure to flush them. Would add some complexity.

up2isomorphism3d ago· 3 in thread

SibexicoOP3d ago

I'm now a Windows developer, mostly working with Linux and FreeBSD. Thx for the point, I'll look how it works in Windows systems.

muststopmyths3d ago

Equivalent in Windows is Registered I/O (RIO) for sockets.

Windows network development is really, really different from Unixy stuff. But you might have fun :)

RossBencina3d ago

There is no benchmark in the post. There is analysis, discussion and code examples for epoll and io_uring usage.

spliffedr3d ago· 1 in thread

SibexicoOP3d ago

GalaxyNova3d ago· 1 in thread

The year is 2050; there are 20 different ways to poll a socket on Linux.

Uptrenda3d ago

Yes, even for io_uring. io_uring singshot and then multishot to go even faster.

witx3d ago

Such a great article!

buybackoff3d ago

inigyou3d ago

If you write one with DPDK, it'll be infinitely more complex but you'll have the opportunity to blow away nginx in performance.

If you make one run on an FPGA you it'll be even more complex.

eatonphil3d ago

I was also always curious about this and recently wrote a few implementations of an http file server to teach myself the key differences.

https://theconsensus.dev/p/2026/05/18/serving-files-three-wa...

gafferongames3d ago

Just use AF_XDP

1 more reply

j / k navigate · click thread line to collapse