I was doing this in 1992 so it's at least 12 years older than the 2004 implementation. I suspect it was being done long before that. Back then the read and write indexes were being updated by separate processors (even more fun, processors with different endianness) with no locking. The only assumption being made was that updates to the read/write pointers were atomic (in this case 'atomic' meant that the two bytes that made up a word, counters were 16 bits, were written in atomically). Comically, on one piece of hardware this was not the case and I spent many hours inside the old Apollo works outside Boston with an ICE and a bunch of logic analyzers figuring out what the hell was happening on some weird EISA bus add on to some HP workstation.
It's unclear to me why the focus on a 2^n sized buffer just so you can use & for the mask.
Edit: having had this discussion I've realized that Juho's implementation is different from the 1992 implementation I was using because he doesn't ever reset the read/write indexes. Oops.
Try late 1960s. Generally known then, widely used.
For an interesting proof about tokens in ring buffers, check out https://www.cs.utexas.edu/users/EWD/ewd04xx/EWD426.PDF, which, for 1974, has an interesting bit of multiprocessing.
The cost of a mask can probably be entirely buried in the instruction pipeline, so that it's hardly any more expensive than whatever it costs just to move from one register to another.
Modulo requires division. Division requires a hardware algorithm that iterates, consuming multiple cycles (pipeline stall).
That's for "normal" ring buffers. I suspect that the design described in the article can be implemented for non power-of-two without division but I'll need to think about the details.
movl $-252645135, %edx
movl %edi, %eax
mull %edx
movl %edx, %eax
shrl $4, %eax
movl %eax, %edx
sall $4, %edx
addl %edx, %eax
subl %eax, %ediBTW: read and write pointers, power of two, did that in BeOS (1998) and many sound drivers did it earlier than that. To me, that seemed like the obvious way to do it when I needed it.
- subtracting buffer size from both pointers once the read pointer has wrapped.
- choosing a longer int for the math operation where possible
That seems a small price for the freedom to be able to choose an appropriate buffer size.
Last time I checked, the operation cost was 26k cycles on my PIC.
Using a 2^n + mask made my queue perform 10 times faster (if not more).
IIRC the way we made this really fast was the write the buffer backwards. That way you can detect wrapping around the buffer because DEC will underflow and set the sign flag. Then you can JS to whatever code needs to ADD back the buffer length to handle the wrap around.
But 2^n has another problem (back in that era): buffer size. You are stuck with 1K, 2K, 4K, etc. buffers. When memory is tight you likely need something very specific, so you end up with the solution we had.
But, hey, if memory is free use 2^n bytes for your buffer.
Probably. Use the Ian Knot: http://www.fieggen.com/shoelace/ianknot.htm
Seriously, spend 20 mins practising this, and you'll never go back to the clumsy old way again.
I usually tie this knot twice over the lifetime of a pair of shoes. Once when I get them, and once more when they're worn in and need to be tightened.
The really nice thing about this knot is that it looks really nice too so you can use them on both running shoes and dress shoes.
It makes no sense to teach the more common shoe tying knots.
Since learning the Ian knot (and correct starting knot) I can honestly say I enjoy tying my shoes every day and relish the opportunity to tie a bow at any other time.
I use the "Two Loop Shoelace Knot Bad Technique 1" from that page.
In recent years I've been wearing shoes with a different fastening mechanism, but I have to tie some dress shoes for a wedding tomorrow, so this is very timely knowledge!
More importantly, I can't seem to get a Ian's knot very tight. Does this get better over time?
So many people walk around assuming they need to do complicated double knots to stop their shoelaces untying themselves. If only they knew they were doing Granny Knots, and that a standard knot is perfectly secure if tied properly.
I like this kind of article and enjoyed this particular one, but the long discussion above about the "right" way to do it goes some way to justifying why so many people are happy to do it the "wrong" way.
I've implemented and used ring buffers the "wrong" way many times (with the modulus operator as well!) and the limitations of this method have never been a problem or bottleneck for me, while its simplicity means that it's easier to write and understand than almost any other data structure.
In most practical applications, it's memory barriers that you really have to worry about.
I must admit that I never actually benchmarked my implementation properly -- it might be interesting to see if there are actual trade-offs between mmap vs. copying. (I'm guessing that nothing can beat MMU support, but I think the MMU also supports copy operations, so...?)
EDIT: Oops, I see they use mirrored memory here as well.
https://www.kernel.org/doc/Documentation/circular-buffers.tx...
Note that wake_up() does not guarantee any sort of barrier unless something
is actually awakened. We therefore cannot rely on it for ordering. However,
there is always one element of the array left empty. Therefore, the
producer must produce two elements before it could possibly corrupt the
element currently being read by the consumer. Therefore, the unlock-lock
pair between consecutive invocations of the consumer provides the necessary
ordering between the read of the index indicating that the consumer has
vacated a given element and the write by the producer to that same element. EMPTY -> (read == write)
FULL -> (read == (write + SIZE) % (2 * SIZE))
Basically you're full if you're at the same relative index and your on different laps, you are empty if you at the same relative index on the same lap. If you do this with power of 2 size then the 'lap' is just the bit 2 << SIZE.(The trick is that SIZE has to be a power of two, or else when you increment from 2^32-1 to 0, your pointers will jump to a different position in the array.)
Because it's easier to understand at first glance, has no performance penalty, and for most busy programmers that often wins.
The real reason to stick with the first approach is that your static analysis tools won't freak out that you have intentional unsigned int overflow. Heck, some compilers will now scream at you for doing this. Then what happens when someone goes to port your code to a language with stricter overflow behavior? It won't work.
IMO even in realtime systems, I don't use this. Heck, the linux kernel even uses the original version.
This question needs little context to be relevant, so long as the topic is "computer programming".
Certainly not limited to writing ring buffers. It could be an apropos comment in almost any discussion.
Of course in many cases, the part about "no performance penalty" does not apply. Performance is a routine trade off for some other perceived gain.
Using the mask method is slick (I'd cache that mask with the array to reduce runtime calculations), but it's definitely going to add cognitive overhead and get messy if you want to make it lockless with CAS semantics.
So, store size-1 instead of size, and add one when asked for the size? I can see that, though I'm not confident it's worth the conceptual overhead.
If you mean storing it in addition to the size, I think that's a bad trade - cache is far more precious than many decrements.
Of course, if the size is fixed at compile time, the mask will probably be stored baked into the instructions (andl <const>, ...).
Doesn't it break the order invariant of the buffer, though? I can't see a way to do this without the risk of getting reads of newer data prior to older data. That's probably fine in many cases, but something like non-timestamped-debugging strikes me as a case where I'd want to know that the data arrived in the order I'm seeing.
No, if you increment the read pointer prior to the write pointer, the read pointer will still point at the oldest valid value in the buffer.
So, in pseudo code:
if (w+1 >= r) {
r = w + 2
}
w++
b[w-1] = value
For a debugging ring buffer (i.e. looking at it in a core file), you have the last value of the write pointer, so you can simply read from write pointer + 1 back around to the write pointer and have your messages in order. This makes the assumption that there is no readers of the debug buffer, so you're only having to deal with the one pointer.When that's the case, a ring buffer is a great choice. It's not required, though - the writer could block when it detects a full buffer.
When pushing D in their example they overwrite the value to be read and items are out of order now.
But maybe I'm missing something, I lost interest at all the bit-twiddling.
We've been using similar code in PortAudio since the late 90s[0]. I'm pretty sure Phil Burk got the idea from his hardware work.
[0] https://app.assembla.com/spaces/portaudio/git/source/master/...
No, this is a well known construct in digital design. Basically, for a 2^N deep queue you only need two N+1 bit variables:
http://www.sunburst-design.com/papers/CummingsSNUG2002SJ_FIF...
build in dynamic fifo function http://software-lab.de/doc/refF.html#fifo
Great! Just don't use it if the indices are N bits wide and the array has 2N elements. :)
Not unheard of. E.g. tiny embedded system. 8 bit variables, 256 element buffer.
I guess that's the reason most people don't do it: they'd rather waste O(1) space than waste mental effort on trying to save it.
It's a dynamically sized ring buffer with an optimization analogous to that of C++ strings; if the required capacity is small enough, the buffer is stored inline in the object rather than in a separate heap-allocated object. So something in the spirit of (but not exactly like):
struct rb {
union {
Value* array;
// Set N such that this array uses the same amount of space as the pointer.
Value inline_array[N];
};
uint16_t read;
uint16_t write;
uint16_t capacity;
}
You'd dynamically switch between the two internal representations, and choose whether to read from array or inline_array based on whether capacity is larger than N. In this setup it'd be pretty common for N to be 1. Having to add a special case to every single method would kind of suck, generic code that could handle any size seemed like a nice property to have.I've always been doing it the "wrong" way, mostly on embedded systems. My classic application is a ring buffer for the received characters over a serial port. What's nice is that this sort of data structure doesn't need a mutex or such to protect access. Only the ISR changes the head, and only the main routine changes the tail.
http://stackoverflow.com/questions/1583123/circular-buffer-i...
size() { return write - read; }
0 - UINT_MAX -1 = ?[EDIT] Changed constant to reflect use of unsigned integers, which I forgot to specify initially.
What I find interesting are the trade-offs: machine vs explicit integer wrap-around and buffers with maximum ~size(int)/2 vs ~size(int).
(0 - (2^32 - 1)) % 2^32 = 1PS. No wrap-around either, for different reasons.
You'll have to explain that to me, since I can't assign `x = 2^32` without wraparound when x is an unsigned 32 bit integer.
I understand not wanting to waste one slot. A third variable (first, last, count) isn't too bad. But if you really hate that third variable, why not just use first and count variables? You can then compute last from first and count, and the two boundary cases show up as count = 0 and count = capacity.
I think he addressed that in the post:
The most common use for ring buffers is for it to be the intermediary between a concurrent reader and writer (be it two threads, to processes sharing memory, or a software process communicating with hardware). And for that, the index + size representation is kind of miserable. Both the reader and the writer will be writing to the length field, which is bad for caching. The read index and the length will also need to always be read and updated atomically, which would be awkward.
0xffffffff % 7 = 3, but (0xffffffff + 1) % 7 = 0.
Why would someone do this instead of re-using previous (or third-party) implementations? Of course unless it's all in different languages, but I don't think that's the case here.
I didn't even know what a ring buffer was
where do I dispose of my programmer membership card?
edit : lol, what a hostile reaction...
pls explain I'd love to hear