Io_uring is not (only) a generic asynchronous syscall facility (opens in new tab)

(fancl20.github.io)

90 pointsfancl204y ago23 comments

23 comments

23 comments · 7 top-level

_vvhw4y ago· 6 in thread

This is a short article but the way it introduces the concepts of control plane and data plane is worth thinking about.

To elaborate:

In the abstract, the control plane is just anything that's not in the critical data path, it's safety critical but not performance critical, whereas the data plane has the inverse profile. For example, it would be fine to have (and one would want) plenty of assertions in the control plane for safety.

Whereas the data plane is in the critical path with huge volumes flowing through it. Here in the data plane one would want to optimize for performance: cache misses, context switches, branch mispredicts etc. The data plane would be like a water mains pipeline, there's nothing inside to obstruct flow. The control plane would be all the safety checks you do outside the pipeline as an operator, the little control box that adjusts pressure and controls the pipeline.

Having this clean split between control plane and data plane in the design provides both safety and performance without compromising either.

As a concrete example of this technique, in TigerBeetle [1], we have around 10,000 financial transactions in a batch. A single-threaded control plane is responsible for switching each of these batches through the consensus protocol, and we amortize all runtime bounds checks, assertions, syscalls and I/O across the batch, so that these become almost free, yet we have literally hundreds of assertions, and we're doing O_DSYNC on every write. And of course, all of this is running through io_uring so we can drive decent io depth to NVME or SSD without the overhead of context switches and complexity of any user space thread pool.

io_uring is such a great design. Not only because it's fast, but because it's so simple with such a clean separation between control and data planes. It makes pure thread-per-core designs easily achievable with fantastic performance.

[1] https://github.com/coilhq/tigerbeetle

secondcoming4y ago

We used a similar approach when designed the system for video playback for a now-dead mobile phone OS. That had the added requirement of content protection, so user-land should never be able to access video frames, only tell the data plane where to put its data.

test_epsilon4y ago

I disagree they're so cleanly split, that io-uring is so different in that particular split, or that it matters so much.

1. "Control plane" is how an application specifies what action they want, and "data plane" is (more or less) the result. A read(2) system call is perfectly cleanly separated in those concerns (input arguments and return value for control, returned data for data). Can't get cleaner than that.

2. Performance critical. Control is highly, highly performance critical. Some applications can generate a large amount of independent operations or very large chunks, but latency is very often the performance limiter. And that tends to be increasingly true as parallelism increases, bandwidth increases, but latencies tendto improve at a much lower rate and sometimes stand still or go backwards (whether you're looking at DRAM, NAND, disk, network, or communication across cores in a single node).

3. On the data side of performance. Well it's really fuzzy because control is data, and data often controls control, you have metadata etc. But quite often data is easier to deal with, if it's parallelizable, prefetchable, predicatble, linear. In CPUs for example, i$ misses are often a worse problem to have than d$ because running out of instructions means the whole pipeline empties out and shuts down, but stores can be buffered and a stalled load often still leaves independent useful work to do. It's common that CPUs will prefer instruction lines in their unified cache levels for this reason.

4. Data is absolutely safety critical. I'm not sure why it couldn't be if control is. Sure you could say you have redundancy in your data, but you can also have checks in your data to help ensure the control was correct (for simple example a block of data can store its own location as well, so if control logic goes wrong and reads the wrong block, it could try to recover). That said data tends to be easier to recover from than arbitrary complex logic, but that doesn't mean the data is less critical.

5. Where does metadata sit in here? In io_uring, you could have ops that open files, read directories, etc. This is no more a clean split than traditional unix APIs IMO.

io-uring is nice because of its submission and completion model minimizes overhead and it allows asynchronous and parallel and out of order operations. No surprise, it's modeled after high performance IO device command and completion queues.

_vvhw4y ago

> Performance critical. Control is highly, highly performance critical.

I am using the term "control plane" in the typical systems sense, where the control plane is not the performance critical data pipeline. The control plane by definition is outside the critical request path. Relative to the performance profile of the data plane, the control plane is not performance critical.

The control plane is usually several orders of magnitude less demanding in terms of resources (CPU, memory, network, storage) than the data plane it controls. One very obvious example of this is something like ZooKeeper, where a tiny metadata system can be responsible for switching gigantic data plane clusters.

Another basic example, the control plane handles "configuration" such as the routing table in a routing protocol, and changes to this configuration synced out of band, whereas the data plane does the rapid switching based on this table, or the control plane decides whether to switch the data plane on/off. These decisions are relatively cheap but can have serious consequences.

Finally, a good example is also Amazon, where their distributed systems are famous for having control planes that always do constant work regardless of the data plane to avoid bimodal behavior.

If your system has the control plane and the data plane showing similar performance profiles then these concepts may have been conflated or not exploited to their potential.

> Data is absolutely safety critical.

Again, I think you're missing the safety that a clear definition of a control plane gives to the data pipeline it manages. As long as you treat both planes as "effectively the same" or of no consequence, you won't see how to exploit these different concepts to achieve both performance AND safety.

test_epsilon4y ago

> I am using the term "control plane" in the typical systems sense, where the control plane is not the performance critical data pipeline. The control plane by definition is outside the critical request path. Relative to the performance profile of the data plane, the control plane is not performance critical.

That doesn't make sense because you're talking about io-uring itself having separation between control and data, however it is is exclusively involved with the request path. It seems like you made this confusion with your first post, I'm just replying to what you wrote.

jstimpfle4y ago

Maybe one could say that ioring's asynchronous completion model is a better split because there's no control dependency (blocking syscalls, context switches)?

test_epsilon4y ago

Linux has non blocking / asynchronous IO APIs for many years already, so as a _general_ statement that is not one of the new interesting things of io uring. But that is something it does well when you get into specifics (both in improving overheads of currently supported operations and expanding the types of operations that can be submitted this way).

bob10294y ago· 4 in thread

> io_uring is more than a generic asynchronous syscall facility. It's the state-of-the-art asynchronous interface for communication between subsystems implemented between the kernel and the userspace.

The most essential concept here for me is the ring buffer. The most effective practical implementation I am aware of is the LMAX Disruptor, as this implementation can essentially funnel multiple threads into a single writer w/ aggregate rates of tens-to-hundreds of millions of operations per second. If you wanted to construct any arbitrary application in which multiple participants (threads/callers/services) are synchronized against some context, this library could serve as the foundation for that.

I don't even reach directly for OS-level I/O acceleration capabilities and I can get ridiculous numbers. I simply aggregate writes in software using the buffer, and I am still able to completely saturate NVMe devices from languages like C# using stupid-simple calls like File.WriteAllBytes().

Batching is a hell of a thing when you are going for maximum throughput. Processing a ring buffer that is always full with a single thread in a hot loop is the best situation you could ever hope to find yourself in. Nothing will ever give you faster serialized throughput without resorting to compressed data structures and other dark(er) arts.

gpderetta4y ago

Note that the Disruptor is just one of the many applications for a ringbuffer, in particular is a multproducer-to-multicast best-effort link (i.e slow readers will miss packets), but not all ringbufer use cases can be covered by it.

In particular the disruptor wouldn't be appropriate to implement io_uring, where there is only one consumer, only one producer[1] and packet loss is not tolerated.

[1] some external synchronization can be course be used to handle multiple producers, but that's not io_uring concern.

bob10294y ago

> In particular the disruptor wouldn't be appropriate to implement io_uring

Completely agree. It's at a higher level of abstraction and intersects only in some conceptual ways with how the OS operates.

> [...] where there is only one consumer, only one producer[1] and packet loss is not tolerated.

My agreement above still standing, I feel like this case is actually covered by the disruptor as well. It is a trivial subset of the multi-producer, single-consumer problem that it was originally built to handle. If you made some minor modification like "packet loss is tolerated", then you have a solid case.

memco4y ago

> The most effective practical implementation I am aware of is the LMAX Disruptor, as this implementation can essentially funnel multiple threads into a single writer w/ aggregate rates of tens-to-hundreds of millions of operations per second.

Do you have more info on this? I work on magnetic tapes where any writes must be sequential but I want to process data to it and from it in parallel where possible.

bob10294y ago

No specific references aside from my own experimentation around these ideas. The core essence is as follows:

- Use an append-only log structure as the exclusive means of interfacing with the storage medium. If you are concerned about future cleanup (i.e. on block-wise rewritable media), segment this log out into file chunks of reasonable size. Cleanup of old segments would involve scanning each and rewriting alive data to the front of the log.

- Consider using a key-value abstraction as the basis for all of this, as it allows for trivially constructing dynamically-programmed tree structures with ideal locality-of-reference semantics (i.e. the splay tree and friends).

- All writes to the append-only log conclude with a consistent snapshot of system state. This is where the magic happens. The ring buffer results in batching of transaction request (e.g. SetValue, GetValue) such that a single final byte array can be constructed all at once that ultimately goes out to disk. You may have 1000 transactions bundled up into 10KB of contiguous bytes. This allows for you to say things like "transactions per block i/o". Callers into the system do a busy wait against a boolean status flag on the transaction object (or more ideally, structs in a fixed-sized array). This certainly consumes more power, but you wanted to go fast, right?

Nican4y ago· 3 in thread

My knowledge about kernels is very limited, but is this not the whole principle behind Google's Fuchsia?

cmrdporcupine4y ago

As far as I understand it, this kind of asynchronous callback based I/O was a hallmark of VMS's IO subsystem (1st release 1977...); and Unix's "get a byte, get a byte, get a byte byte byte" model was a major criticism that the VMS folks had about it.

https://retrocomputing.stackexchange.com/questions/14150/how...

So, no, I don't think you need to jump all the way to microkernel/message-passing (like Fuchsia) to get this kind of model.

twic4y ago

It sounds exactly like every microkernel architecture to me. Subsystems are processes, processes communicate by sending asynchronous messages, there is a way to wait for a reply to a message to allow synchronous interaction, there is a way to share or transfer ownership of memory between processes to allow bulk data transfer.

Message sending might be implemented on top of memory mapping, but doesn't need to be.

fancl20OP4y ago

That's exactly the point I want to express - this type of communication design is not bind to microkernel or io_uring or any system. It's a general architecture pattern appears on any cross subsystems communication. Hardware DMA also has a similar design.

matesz4y ago· 2 in thread

Did anybody tried to generalize IO_URING a bit further with something like language interoperability - platform specific swift/kotlin/objective-c/javascript with closer to the metal languages like c/rust/c++ for instance?

Let's say we have 2 main threads, one for platform language (we can't change that) and one for native language. Both of these threads share submission and completion queue buffers - so let's say Swift thread could submit to Rust thread what would be better suited for Rust and Rust could submit to Swift what had to be done in Swift. Just like in IO_URING - application thread (Rust) submits what has to be done in kernel (Swift) and perhaps the other way around too. Same for Android and Web. Of course sharing memory could be done also on other data like UI elements tree etc.

I tried to test that, but gave up once I was getting some weird out of memory errors in Xcode when tried to start threads from rust side of things. Xcode debugger could step into rust code amazingly though.

I'm really just guessing and have pretty much no experience in something like language interop, but I can't stop thinking it would be a good way to go. So, do you think IO_URING type abstraction could be a good approach to doing cross-platform development in general?

eska4y ago

Aren't you back at the actor model at that point, where each thread has a message queue and can push tasks/results to the other thread's queue? That is indeed getting popular.

jstimpfle4y ago

In my limited experience trying to make the Actor Model work, with the AM there is the danger of running into OOP type problems (overemphasis on encapsulation and isolated implementations) which quickly results in an unmaintainable mess. I've watched some presentations of success stories employing the AM, however my hunch is that the use there was a lot more limited and controlled, basically with a close to 1:1 mapping of physical devices to actors. Which I'm not sure is what most people think the Actor Model is.

The Actor Model, in my mind, is the idea of writing "synchronous"/"blocking" code for a number of agent implementations, and then getting that code to scale by scheduling as many of these agents as possible with some fiber or green threads magic.

IO completion queues is the actual mechanism that allows distributed computation (where "distributed" here includes normal syscall interactions with the OS) without deadlocks or bad performance due to high latencies.

sunmag4y ago· 1 in thread

Looks alot like nvme https://nvmexpress.org/wp-content/uploads/NVMe_Overview.pdf

jnwatson4y ago

It is actually the structure that most data-driven hardware devices use. A ring buffer of descriptors has been used from the early days of ethernet devices at least.

dang4y ago

Recent thread on the article this one is responding to:

Io_uring is not an event system - https://news.ycombinator.com/item?id=27540248 - June 2021 (134 comments)

nahuel0x4y ago

It's like there is a micro-kernel growing inside Linux.

j / k navigate · click thread line to collapse

23 comments

23 comments · 7 top-level

_vvhw4y ago· 6 in thread

This is a short article but the way it introduces the concepts of control plane and data plane is worth thinking about.

To elaborate:

Having this clean split between control plane and data plane in the design provides both safety and performance without compromising either.

[1] https://github.com/coilhq/tigerbeetle

secondcoming4y ago

test_epsilon4y ago

I disagree they're so cleanly split, that io-uring is so different in that particular split, or that it matters so much.

5. Where does metadata sit in here? In io_uring, you could have ops that open files, read directories, etc. This is no more a clean split than traditional unix APIs IMO.

_vvhw4y ago

> Performance critical. Control is highly, highly performance critical.

Finally, a good example is also Amazon, where their distributed systems are famous for having control planes that always do constant work regardless of the data plane to avoid bimodal behavior.

If your system has the control plane and the data plane showing similar performance profiles then these concepts may have been conflated or not exploited to their potential.

> Data is absolutely safety critical.

test_epsilon4y ago

jstimpfle4y ago

Maybe one could say that ioring's asynchronous completion model is a better split because there's no control dependency (blocking syscalls, context switches)?

test_epsilon4y ago

bob10294y ago· 4 in thread

gpderetta4y ago

In particular the disruptor wouldn't be appropriate to implement io_uring, where there is only one consumer, only one producer[1] and packet loss is not tolerated.

[1] some external synchronization can be course be used to handle multiple producers, but that's not io_uring concern.

bob10294y ago

> In particular the disruptor wouldn't be appropriate to implement io_uring

Completely agree. It's at a higher level of abstraction and intersects only in some conceptual ways with how the OS operates.

> [...] where there is only one consumer, only one producer[1] and packet loss is not tolerated.

memco4y ago

Do you have more info on this? I work on magnetic tapes where any writes must be sequential but I want to process data to it and from it in parallel where possible.

bob10294y ago

No specific references aside from my own experimentation around these ideas. The core essence is as follows:

Nican4y ago· 3 in thread

My knowledge about kernels is very limited, but is this not the whole principle behind Google's Fuchsia?

cmrdporcupine4y ago

https://retrocomputing.stackexchange.com/questions/14150/how...

So, no, I don't think you need to jump all the way to microkernel/message-passing (like Fuchsia) to get this kind of model.

twic4y ago

Message sending might be implemented on top of memory mapping, but doesn't need to be.

fancl20OP4y ago

matesz4y ago· 2 in thread

eska4y ago

Aren't you back at the actor model at that point, where each thread has a message queue and can push tasks/results to the other thread's queue? That is indeed getting popular.

jstimpfle4y ago

sunmag4y ago· 1 in thread

Looks alot like nvme https://nvmexpress.org/wp-content/uploads/NVMe_Overview.pdf

jnwatson4y ago

It is actually the structure that most data-driven hardware devices use. A ring buffer of descriptors has been used from the early days of ethernet devices at least.

dang4y ago

Recent thread on the article this one is responding to:

Io_uring is not an event system - https://news.ycombinator.com/item?id=27540248 - June 2021 (134 comments)

nahuel0x4y ago

It's like there is a micro-kernel growing inside Linux.

j / k navigate · click thread line to collapse