There must be some reason why no OS offers non-blocking disk I/O in this way, but I don't know what it is.
To an extent this is a design rooted in a world where disks are much faster than network access. But perhaps that world is coming back with SSD's/NVME/etc..
Now, with NT, you have ReadFileEx() and WriteFileEx(). However, a user can call them in such a way such that the semantics are: "hey, try and read this, if you can do it immediately without blocking, great... if you have to block, then do whatever you need to do in the background to make that happen, but still return to me without blocking".
That, and that alone, is the key difference between the inherently synchronous I/O model of UNIX, and the inherently asynchronous I/O model of NT. The entire NT I/O subsystem, cache manager, driver API, memory management, APCs, scheduling et al is predicated around the notion of every I/O request being asynchronous.
If everything happens to be in the right spot at the right time, sometimes an I/O call can be synchronous (i.e. user->kernel->user without a context switch due to a required wait). In every other case, the kernel won't be able to complete it there and then, so, it checks to see if the user still wants that read or write call to return immediately -- which implies "asynchronous I/O" (referred to as "overlapped" I/O in NT parlance, because you're overlapping an I/O request with more compute).
Windows kernel drivers are fundamentally more complex than corresponding Linux drivers because the kernel's I/O model is fundamentally more sophisticated -- everything is packet driven (the "I/O request packet", or Irp), your driver's read/write entry points need to be able to query the incoming I/O request and determine if the user wants sync/async, how you need to return the call so that the I/O manager can furnish the correct behavior to all the other pieces of the subsystem (and potentially other drivers that are layered higher and lower), and a huge number of other subtle details.
The added complexity is required because the fundamental I/O model is asynchronous. In the UNIX synchronous I/O model, there's simply no semantic concept -- at both the driver level, kernel level, and APIs exposed to the user -- to say "here, read() this and return immediately -- if it can be done synchronously, great, if not, kick it off in the background and give me some opaque structure back I can use in the future to check on the completion of the operation".
The other huge advantage of NT is the notion of thread-agnostic I/O. That is, the thread that initiates one of these asynchronous read requests doesn't have to be the same thread that completes it. Although it sounds simple, that's one of those tip-of-the-iceberg technical things where there are so many pieces behind the scenes that need to cooperate to facilitate the functionality. I talk a little bit about thread-agnostic I/O here: https://speakerdeck.com/trent/pyparallel-how-we-removed-the-....
So, to summarize, all discussions regarding asynchronous I/O and M:N threading on UNIX are sort of fundamentally flawed because the underlying primitives can't express what is actually needed (an asynchronous I/O subsystem at the kernel level, thread-agnostic completion-oriented I/O, and ideally, thread pools + completion ports) to achieve the end goal: optimally using your underlying hardware :-)
(Optimal hardware usage necessitates one thread running per core, and the ability for any one of these threads to continue program logic upon completion of an I/O request, regardless of whether or not they were the thread to initiate that request.)