undefined | Better HN

0 pointsbitwize10y ago0 comments

The difference between Windows Overlapped I/O and POSIX AIO is that on Windows it's a black box, so people can pretend it's magical.

No. The difference is that in Windows you can check for completion and set up an overlapped I/O operation in one system call. Requiring multiple system calls to do the same thing means more unnecessary context switches, and the possibility of race conditions especially in multithreaded code. That and, as trentnelson stated, the Windows implementation is well integrated with the kernel's filesystem cache. Linux userspace solutions? Hahaha.

Supplying this capability as a primitive rather than requiring userland hacks is the right way to do it from an application developer's perspective.

0 comments

6 comments · 4 top-level

makomk10y ago· 2 in thread

It's only the right way to do it from an application developer's perspective if the behaviour of the official OS-approved interface matches what their application needs, which is unlikely to be the case here. For example, according to https://support.microsoft.com/en-us/kb/156932 there's a fixed-sized pool of threads used to fetch data into cache to fill async I/O requests. If you try to have too many async I/O requests for the number of threads (which can be as small as 3) the excess are automatically converted into synchronous, blocking requests. This renders the API useless for something like an event-driven web server, because any file I/O call could block the main thread until it completes.

(Also, curiously when the data's in the cache that page shows a performance penalty for async reads that complete synchronously from the cache compared to sync reads. Wonder why.)

trentnelson10y ago

I comment on the advice given on that page here: https://news.ycombinator.com/item?id=11867375

I've never had a single issue with "async things suddenly becoming synchronous which blocks the main thread" -- if you architect things properly that just never happens, blocking operations are off the hot path, and when you absolutely must block, IOCP's concurrency self-awareness kicks in and another thread is scheduled to run on the core, ensuring that each core has one (and only one) runnable thread.

trentnelson10y ago

> (Also, curiously when the data's in the cache that page shows a performance penalty for async reads that complete synchronously from the cache compared to sync reads. Wonder why.)

Because a synchronous operation is always faster than an overlapped operation if it can be completed synchronously.

Lots of stuff happens behind the scenes when an overlapped operation occurs.

wahern10y ago

As load increases, polling in Unix asymptotically approaches one additional syscall per thread for any number of sockets. That's because a single poll returns all the ready sockets--you're not polling each socket individually before each I/O request[1]. That means if your thread has 10,000 sockets, the amoritized cost per I/O operation is <= 1/10,000th of a syscall.

As for IOCP being "well integrated", what does that even mean? In Windows when file I/O can't be satisfied from the buffer cache, Windows uses a thread pool to do the I/O (presuming it just doesn't block your thread; see https://support.microsoft.com/en-us/kb/156932), just like you'd do it in Unix. There's nothing magical about that thread pool other than that the threads aren't bound to a userspace context. Maybe you mean that the kernel can adjust the number of slave threads so that there aren't too many outstanding synchronous I/O requests? But the Linux I/O scheduler can implement similar logic when queueing and prioritizing requests. It's six of one and a half-dozen of the other.

[1] At least, assuming you're doing it correctly. But sadly many libraries do it incorrectly. For example, I once audited for a startup Zed Shaw's C-based non-blocking I/O and coroutine library. IIRC, he had devised an incredibly complex hack to fallback to poll(2) instead of epoll(2) because in his tests epoll(2) didn't scale when sockets were heavily used; he only saw epoll scale for HTTP sockets where clients were long-polling. But the problem was that every time he switched coroutine contexts, he was deleting and re-adding descriptors, which completely negated all the benefits of epoll. Why did do this? Presumably because to use epoll properly you need to persist the event polling. But if application code closes a descriptor, the user-space event management state will fall out of sync with the kernel-space state, which is bad news. He tried to design his coroutine and yielding API to be as transparent as possible. But you can't do that. Performant use of epoll requires sacrificing some abstraction, similar to the hassles IOCP causes with buffer management.

The benefit of IOCP isn't performance--whether it's more performant or not is context-dependent. The biggest benefit of IOCP, IMO, is that it's the defacto standard API to use. You don't need to choose between libevent, libev, Zed's Shaws library, or the other thousands of similar libraries. On Windows everybody just uses IOCP and they can expect very good results.

The myth that IOCP is intrinsically better, or intrinsically faster, is a result of what I like to call kernel fetishism--that things are always faster and better when run in kernel space. But that's just a myth. IOCP nails down a very popular and very robust design pattern for highly concurrent network servers, but it's not necessarily the best pattern. And sticking to IOCP imposes many unseen costs. For example, it makes it more difficult to mix and match libraries each doing I/O because when you're having to juggle callbacks from many different libraries your code quickly becomes obtuse and brittle. It also demands a highly threaded environment with lots of shared state, but that likewise leads to very complex and bug prone code.

trentnelson10y ago

Running out of places to upvote you.

tremon10y ago

Application developer is not the only perspective in the world however.

j / k navigate · click thread line to collapse

0 comments

6 comments · 4 top-level

makomk10y ago· 2 in thread

(Also, curiously when the data's in the cache that page shows a performance penalty for async reads that complete synchronously from the cache compared to sync reads. Wonder why.)

trentnelson10y ago

I comment on the advice given on that page here: https://news.ycombinator.com/item?id=11867375

trentnelson10y ago

> (Also, curiously when the data's in the cache that page shows a performance penalty for async reads that complete synchronously from the cache compared to sync reads. Wonder why.)

Because a synchronous operation is always faster than an overlapped operation if it can be completed synchronously.

Lots of stuff happens behind the scenes when an overlapped operation occurs.

wahern10y ago

trentnelson10y ago

Running out of places to upvote you.

tremon10y ago

Application developer is not the only perspective in the world however.

j / k navigate · click thread line to collapse