To elaborate:
In the abstract, the control plane is just anything that's not in the critical data path, it's safety critical but not performance critical, whereas the data plane has the inverse profile. For example, it would be fine to have (and one would want) plenty of assertions in the control plane for safety.
Whereas the data plane is in the critical path with huge volumes flowing through it. Here in the data plane one would want to optimize for performance: cache misses, context switches, branch mispredicts etc. The data plane would be like a water mains pipeline, there's nothing inside to obstruct flow. The control plane would be all the safety checks you do outside the pipeline as an operator, the little control box that adjusts pressure and controls the pipeline.
Having this clean split between control plane and data plane in the design provides both safety and performance without compromising either.
As a concrete example of this technique, in TigerBeetle [1], we have around 10,000 financial transactions in a batch. A single-threaded control plane is responsible for switching each of these batches through the consensus protocol, and we amortize all runtime bounds checks, assertions, syscalls and I/O across the batch, so that these become almost free, yet we have literally hundreds of assertions, and we're doing O_DSYNC on every write. And of course, all of this is running through io_uring so we can drive decent io depth to NVME or SSD without the overhead of context switches and complexity of any user space thread pool.
io_uring is such a great design. Not only because it's fast, but because it's so simple with such a clean separation between control and data planes. It makes pure thread-per-core designs easily achievable with fantastic performance.
1. "Control plane" is how an application specifies what action they want, and "data plane" is (more or less) the result. A read(2) system call is perfectly cleanly separated in those concerns (input arguments and return value for control, returned data for data). Can't get cleaner than that.
2. Performance critical. Control is highly, highly performance critical. Some applications can generate a large amount of independent operations or very large chunks, but latency is very often the performance limiter. And that tends to be increasingly true as parallelism increases, bandwidth increases, but latencies tendto improve at a much lower rate and sometimes stand still or go backwards (whether you're looking at DRAM, NAND, disk, network, or communication across cores in a single node).
3. On the data side of performance. Well it's really fuzzy because control is data, and data often controls control, you have metadata etc. But quite often data is easier to deal with, if it's parallelizable, prefetchable, predicatble, linear. In CPUs for example, i$ misses are often a worse problem to have than d$ because running out of instructions means the whole pipeline empties out and shuts down, but stores can be buffered and a stalled load often still leaves independent useful work to do. It's common that CPUs will prefer instruction lines in their unified cache levels for this reason.
4. Data is absolutely safety critical. I'm not sure why it couldn't be if control is. Sure you could say you have redundancy in your data, but you can also have checks in your data to help ensure the control was correct (for simple example a block of data can store its own location as well, so if control logic goes wrong and reads the wrong block, it could try to recover). That said data tends to be easier to recover from than arbitrary complex logic, but that doesn't mean the data is less critical.
5. Where does metadata sit in here? In io_uring, you could have ops that open files, read directories, etc. This is no more a clean split than traditional unix APIs IMO.
io-uring is nice because of its submission and completion model minimizes overhead and it allows asynchronous and parallel and out of order operations. No surprise, it's modeled after high performance IO device command and completion queues.
I am using the term "control plane" in the typical systems sense, where the control plane is not the performance critical data pipeline. The control plane by definition is outside the critical request path. Relative to the performance profile of the data plane, the control plane is not performance critical.
The control plane is usually several orders of magnitude less demanding in terms of resources (CPU, memory, network, storage) than the data plane it controls. One very obvious example of this is something like ZooKeeper, where a tiny metadata system can be responsible for switching gigantic data plane clusters.
Another basic example, the control plane handles "configuration" such as the routing table in a routing protocol, and changes to this configuration synced out of band, whereas the data plane does the rapid switching based on this table, or the control plane decides whether to switch the data plane on/off. These decisions are relatively cheap but can have serious consequences.
Finally, a good example is also Amazon, where their distributed systems are famous for having control planes that always do constant work regardless of the data plane to avoid bimodal behavior.
If your system has the control plane and the data plane showing similar performance profiles then these concepts may have been conflated or not exploited to their potential.
> Data is absolutely safety critical.
Again, I think you're missing the safety that a clear definition of a control plane gives to the data pipeline it manages. As long as you treat both planes as "effectively the same" or of no consequence, you won't see how to exploit these different concepts to achieve both performance AND safety.
That doesn't make sense because you're talking about io-uring itself having separation between control and data, however it is is exclusively involved with the request path. It seems like you made this confusion with your first post, I'm just replying to what you wrote.
The most essential concept here for me is the ring buffer. The most effective practical implementation I am aware of is the LMAX Disruptor, as this implementation can essentially funnel multiple threads into a single writer w/ aggregate rates of tens-to-hundreds of millions of operations per second. If you wanted to construct any arbitrary application in which multiple participants (threads/callers/services) are synchronized against some context, this library could serve as the foundation for that.
I don't even reach directly for OS-level I/O acceleration capabilities and I can get ridiculous numbers. I simply aggregate writes in software using the buffer, and I am still able to completely saturate NVMe devices from languages like C# using stupid-simple calls like File.WriteAllBytes().
Batching is a hell of a thing when you are going for maximum throughput. Processing a ring buffer that is always full with a single thread in a hot loop is the best situation you could ever hope to find yourself in. Nothing will ever give you faster serialized throughput without resorting to compressed data structures and other dark(er) arts.
In particular the disruptor wouldn't be appropriate to implement io_uring, where there is only one consumer, only one producer[1] and packet loss is not tolerated.
[1] some external synchronization can be course be used to handle multiple producers, but that's not io_uring concern.
Completely agree. It's at a higher level of abstraction and intersects only in some conceptual ways with how the OS operates.
> [...] where there is only one consumer, only one producer[1] and packet loss is not tolerated.
My agreement above still standing, I feel like this case is actually covered by the disruptor as well. It is a trivial subset of the multi-producer, single-consumer problem that it was originally built to handle. If you made some minor modification like "packet loss is tolerated", then you have a solid case.
Do you have more info on this? I work on magnetic tapes where any writes must be sequential but I want to process data to it and from it in parallel where possible.
- Use an append-only log structure as the exclusive means of interfacing with the storage medium. If you are concerned about future cleanup (i.e. on block-wise rewritable media), segment this log out into file chunks of reasonable size. Cleanup of old segments would involve scanning each and rewriting alive data to the front of the log.
- Consider using a key-value abstraction as the basis for all of this, as it allows for trivially constructing dynamically-programmed tree structures with ideal locality-of-reference semantics (i.e. the splay tree and friends).
- All writes to the append-only log conclude with a consistent snapshot of system state. This is where the magic happens. The ring buffer results in batching of transaction request (e.g. SetValue, GetValue) such that a single final byte array can be constructed all at once that ultimately goes out to disk. You may have 1000 transactions bundled up into 10KB of contiguous bytes. This allows for you to say things like "transactions per block i/o". Callers into the system do a busy wait against a boolean status flag on the transaction object (or more ideally, structs in a fixed-sized array). This certainly consumes more power, but you wanted to go fast, right?
https://retrocomputing.stackexchange.com/questions/14150/how...
So, no, I don't think you need to jump all the way to microkernel/message-passing (like Fuchsia) to get this kind of model.
Message sending might be implemented on top of memory mapping, but doesn't need to be.
Let's say we have 2 main threads, one for platform language (we can't change that) and one for native language. Both of these threads share submission and completion queue buffers - so let's say Swift thread could submit to Rust thread what would be better suited for Rust and Rust could submit to Swift what had to be done in Swift. Just like in IO_URING - application thread (Rust) submits what has to be done in kernel (Swift) and perhaps the other way around too. Same for Android and Web. Of course sharing memory could be done also on other data like UI elements tree etc.
I tried to test that, but gave up once I was getting some weird out of memory errors in Xcode when tried to start threads from rust side of things. Xcode debugger could step into rust code amazingly though.
I'm really just guessing and have pretty much no experience in something like language interop, but I can't stop thinking it would be a good way to go. So, do you think IO_URING type abstraction could be a good approach to doing cross-platform development in general?
The Actor Model, in my mind, is the idea of writing "synchronous"/"blocking" code for a number of agent implementations, and then getting that code to scale by scheduling as many of these agents as possible with some fiber or green threads magic.
IO completion queues is the actual mechanism that allows distributed computation (where "distributed" here includes normal syscall interactions with the OS) without deadlocks or bad performance due to high latencies.
Io_uring is not an event system - https://news.ycombinator.com/item?id=27540248 - June 2021 (134 comments)