From 12M ops/s to 305 M ops/s on a lock-free ring buffer.
In this post, I walk you step by step through implementing a single-producer single-consumer queue from scratch.
This pattern is widely used to share data between threads in the lowest-latency environments.