Roaring bitmaps: what they are and how they work (opens in new tab)

(vikramoberoi.com)

226 pointsvoberoi3y ago60 comments

60 comments

I see this common theme among very fast practical algorithms (like timsort or pdqsort) where there is not some secret math or algorithm, rather they are just a bunch of special cases based on heuristics. Often they involve specific knowledge of the hardware instead of treating software as abstract. To me this is the big difference between “computer science” and “computer engineering”.

jandrewrogers3y ago

Much of this is just threshold effects for algorithm performance, not special cases per se.

You don't even need to have specific knowledge of the hardware as long as you can identify cross-over points where one algorithm starts significantly outperforming others. An old HPC trick is to write software that thoroughly measures several algorithm strategies on your specific hardware environment and then code-gens an algorithm that has an optimal set of strategies and strategy-switching thresholds. The meta-algorithm is mostly focused on cheaply detecting these thresholds at runtime.

hansvm3y ago

For a lot of such algorithms there are also pretty cheap ways to remove jitter from the measurements (in case you're operating on top of a general purpose OS or something) -- you expect for many algorithms that in the absence of external factors the runtime is convex monotonic as a multivariate function of the input sizes and that external confounders don't speed things up. Under that assumption you can measure the runtime for many different input sizes and de-jitter each data point by pushing down that convexity constraint, even if for each input size you only gather one single noisy data point.

Where possible, I always liked arrays of function pointers for each next power of 2 in the input sizes. A few popcounts and a jump later, and you're executing the right routine. It's not great if you're throwing a general-purpose tool against tiny problems, but it's not a huge amount of overhead either, and it's dead simple to code.

1 more reply

nunb3y ago

In 2000 came across a binary space partition algo that did this, but used a small scheme interpreter to codegen the c++ BSP code. Would love to find out the library name... have now forgotten it.

voberoiOP3y ago

I had the same takeaway when I read these papers. It felt like a game of whack-a-mole.

There's a massive performance benefit to doing this at the cost of implementation complexity. I haven't studied the implementations or tried my hand one, but I get the impression that these are tough to implement correctly in a way that takes full advantage of the hardware.

(In that sense, it's awesome that the researchers also did the legwork to implement and maintain a library!)

coldtea3y ago

>I see this common theme among very fast practical algorithms (like timsort or pdqsort) where there is not some secret math or algorithm, rather they are just a bunch of special cases based on heuristics

The core here, which is bitmap storage and the basic optimization are simple, but solid and general. Mathematically it takes less space to store data as positions on a bitmap rather than fully spelt associations, and it takes even less space to seggregate the bitmaps in chunks.

This will hold and be smaller and faster in any computer. So, it's not some case of special case based on heuristics, either related to the specific frequencies or sample characteristics of some particular set of data, or of specific CPU peculiarities or whatever.

More exotic optimizations piled on top, sure. But "compressed bitmaps" themselves as a concept, not.

fuckstick3y ago

No this is still comp sci. Comp sci doesn’t mean that practical considerations in algorithm design are ignored. Computer engineering has to do with the actual design and construction of computing machinery. It’s an accredited engineering field and the undergraduate coursework includes electronics, semiconductors, computer architecture and generally a lot of overlap with electrical engineering. I had to take one dumb programming course - all the data structures and algorithms stuff I had to learn elsewhere.

anonymoushn3y ago

timsort and pdqsort seem to be significantly slower than radix sort for random inputs, but presumably radix sort could also benefit from a pile of pattern-finding heuristics.

eesmith3y ago

timsort and pdqsort (two comparison sort methods) and radix sort (a non-comparative sorting algorithm) have different albeit overlapping domains of applicability.

Given a large number of 32-bit integers, radix sort is indeed significantly faster.

While I would reach for a comparison sort method if I had a large number of arbitrary-length Unicode strings, which I wanted to sort in a case-ignoring order.

Also, I found timsort faster than radix sort when there was a small number (as I recall, <100 or so) of elements.

1 more reply

ww5203y ago

This is an excellent write up on Roaring Bitmap. The concrete examples really help to illustrate the algorithm well.

While Roaring Bitmap performs well on space saving with good performance, my gut feeling is there're better ways to achieve the same goals, especially it involving sorting on insertion and binary search on lookup, both on the sorted top level list and the sorted lower-bit integers.

Also it works well on 32-bit integers and probably 64-bit integers but it's not clear it can scale well beyond to bigger integers.

Nevertheless it's an excellent algorithm that works well in practice.

quickthrower23y ago

Not .bmp bitmaps used to store photographic images as I would have guessed:

> Bitmaps are arrays of bits used to store sets of integers. They work by setting the Nth bit when an integer N is in the set

Waterluvian3y ago

I was confused until I clicked on the provided link in the article to see they’re talking about bit maps. For me that space has always been critical.

TacticalCoder3y ago

> When a set is sparse, traditional bitmaps compress poorly.

They waste space. But if you were to compress them, they'd compress very well. As in, for example, if you were to compress them using roaring bitmaps.

zibby83y ago

If you compressed them using some naive scheme, they would be smaller but they would lose their fast set-wise operations like lookup and union.

Hippocrates3y ago

Maybe the author means they compress poorly under generic, common compression algorithms. This is an exotic compression.

jpgvm3y ago

Yeah pretty much. General purpose compression algorithms do a good job but pale in comparison to what Roaring achieves.

I wouldn't use the word exotic but rather specialized and/or optimized.

Infact later versions of Roaring use run-length-encoding (RLE) containers in addition to array and bitmap containers which is a technique shared with more traditional/general compression algorithms.

In effect rather than achieving "very good" compression, Roaring is very close to "optimal" compression for sparse bitmaps.

pvillano3y ago

I thought the trick was going to be indirection.

sorted-list/bitmap/runs, in a two level tree. cool.

it's technically missing sorted compliment-list, i.e. only the items that are missing, although worst case runs only use twice as much space, so more complexity without much savings, esp. considering real workloads

performs better with sequential ids than random uuids because you use fewer pages

further research

investigating the effect of adding more layers to the tree

using additional set representations e.g. balanced tree

kanak3y ago

Lucene's adaptation of Roaring uses the complement idea on a block-wise basis:

https://github.com/apache/lucene/blob/84cae4f27cfd3feb3bb42d...

nullc3y ago

Those who do not know judy1 are doomed to reinvent it.

http://judy.sourceforge.net/

froh3y ago

which possibly is a good thing

the Judy project seems inactive to me:

https://sourceforge.net/p/judy/bugs/

this short HN thread resonates with the intuition of jude1 being stalled, it's last optimistic comment points to an orphaned comment in the big list above

https://news.ycombinator.com/item?id=32188204

jpgvm3y ago

It's certainly not active but perhaps could be revived now that the patents have expired (filed in 2001 so expired last year).

On the surface it's just a very optimized 256-ary radix trie. I think it would take some software archeology to determine if there was more to it than that and if it's assumptions still hold on todays processors.

nullc3y ago

Does it need to be active? It's done and it works.

It's not written in an highly unstable language. The safety of every bridge you drive over was probably partially validated using fortran numerical code from the 1980s.

1 more reply

jasonwatkinspdx3y ago

Judy arrays have never been particularly compelling, but for some reason gained huge mindshare among programmers on Slashdot.

Today it'd make much more sense to implement ART, which is dramatically simpler.

1 more reply

latchkey3y ago

http://nothings.org/computer/judy/

nullc3y ago

Not applicable to this subject. :) A comparison between judy1 and roaring would be interesting.

maycotte3y ago

Bitmaps are amazing and we believe that they will replace all other data formats for analytics.

Today we open sourced about $25m in R&D (more to come)! They take a bit of harnessing, but once you do the results are amazing. We have released an OLAP database called FeatureBase, built entirely on bitmaps (well bitmaps in a b-tree)...

Here is a bit (pun intended) more about our big bet on bitmaps: https://www.featurebase.com/blog/bitmaps-making-real-time-an...

Xeoncross3y ago

Can someone explain how this is laid out in memory and what it would look like serialized? It's easy to have an array of N size that fits all bits and just write to disk. You can also mmap a file as a large array of bytes and just mutate it and let the OS handle the disk syncs if you're okay with some data loss of recent state changes if the machine crashes.

What would an array of pointers to X containers which are N size each look like on disk?

ww5203y ago

The containers are not in a contiguous memory block. To serialize it, you could write out each container as a length prefixed block with the container id, which is the 16-bit MSB of the container. The top level sorted list can be ignored, as it can be reconstructed.

To de-serialize, read each container by reading its prefix length and its container id, and then read the rest of the container by the length. The top level list is sorted by the container id. It can be reconstructed by inserting the container id with the container memory pointer into the sorted list.

ww5203y ago

Just to add a bit. The top level list can be appended to the end of the file as a directory of the pairs of id and the file offset of the container. In this way, the list at the end of the file can be read into memory to form a mapping of container-id to file offset, which can provide random access to the container blocks in the file.

extesy3y ago

This article has a section titled "how Roaring bitmaps are represented in memory". I suggest you actually read the article, it will likely answer all your questions.

Xeoncross3y ago

I was actually asking how they were serialized and stored on disk.

2 more replies

nattaylor3y ago

I enjoyed this walkthrough. I'm interested in RB because it's used in Lucene but never dug in and assumed (without much thought) that it had to do with consecutive runs of 1s and 0s. Wrong!

LouisSayers3y ago

Compressing consecutive 1's and 0's was the first idea that popped into my head as well. Then when they started to talk about inserting into the 800,000th position my brain went "What if they just reversed the array and stored some metadata as the type of array they're dealing with?"

It's funny that in the end Bitmaps essentially are a modified data structure and process.

The way things are going, I imagine someone will at some point take all the different tricks we have of inserting, sorting, finding, deleting data, take millions of datasets and run some machine learning type process to create libraries that can perform these operations optimally.

nattaylor3y ago

I was thinking a similar thought that 4096 seemed quite magic (although it seems to be chosen based on research) and that RB perf could probably be tuned based on the workload

1 more reply

superjan3y ago

It looks a bit too simplistic: Blocks with up to 4k items are stored as sorted lists. Having to insert in a sorted list up to 4k items seems very expensive to me.

jandrewrogers3y ago

CPUs have a strong performance bias toward sequential memory access and there are large threshold effects at work here. The block size used is not arbitrary. Improvements in prefetching and cache line utilization can have such large performance benefits that it more than justifies any apparent increase in computational cost because of how the code is organized to obtain those improvements.

Most developers do not have a good intuition for the efficiency and scalability of sequential brute-force on modern architectures. There is a cross-over threshold but I think many people would be surprised at how high it is in practice.

jiggawatts3y ago

I would have expected 2048 items to be the optimal cutover point, because with 16-bit entries that would result in 4KB arrays. Those fit exactly into a typical CPU memory page, whereas 8KB requires two. In the worst case that might double the page table overheads, e.g.: for random point reads.

I wonder if it would be worth the trouble to code-gen a bunch of variants with things like 8-bit entries, and benchmark to death to determine the optimal cutover points...

1 more reply

jasonwatkinspdx3y ago

The threshold was chosen empirically. Modern processors are very fast when operating on memory sequentially, as the prefetcher achieves the best case scenario. The cost of a cache line miss is so substantial this ends up dominating things until "n" is in the 1000s. A lot of programmers' intuition is off the mark on this point.

superjan3y ago

My intuition was indeed trained a long time ago. In terms of memory use, the current choice is likely optimal. But I have a hard time believing that doing an or operation on two sets of sorted numbers will be as fast as the same for two bitmaps of 8k. The latter can be trivially simd accellerated, and could be done in-place.

Having said that, I applaud that this type of datastructure is being investigated. An efficient bitset should be present in every collections library.

jaffee3y ago

The full arrays do get expensive, although not too bad. I work at FeatureBase and we have a whole analytics DB built on a roaring variant... for perf reasons it's usually worth it to bias toward the bitmap representation when you get past about 2k set bits, though it does take a bit more space.

foota3y ago

The 4k items are only 512 bytes though, so expensive yes but not overly so.

Veserv3y ago

No. The person you are replying to is talking about the sparse leaf case where the contained integers are actually being stored (to be more precise the low bits are being stored) in a sorted structure. The switch to the bitmap occurs when the sparse structure takes the same number of bits to directly encode the sparsely contained items. In this case they use a 2^16 bitmap and encode sparse elements using 16-bits, so it occurs at ((2^16 / 16) == 2^12) elements.

skitter3y ago

They are two byte per item; it's the bitmaps (>4k entries) that use 1 bit per item.

jerven3y ago

I love roaringbitmaps and maybe even over use this library. On the other hand it made a lot of things possible for legacy.uniprot.org over the years that would have been unaffordable otherwise. Quick faceting using permanent filters over 500+ million Documents would have been really hard otherwise.

voberoiOP3y ago

I’m really curious about this. Are you able to share a bit about your use case and how you use roaring bitmaps outright (vs. through search infra like Solr or ElasticSearch)?

jerven3y ago

Well the replacement of this code base uses Solr. But when legacy started we where on lucene 1x that over the years upgraded to lucene 9x. So one of the things we used our bitmaps for was for a number extensions to pure lucene. So like facets, but also inter index joins. And then preserving résultats for fast faceting that remained part of the query.

E.g. a thing that legacy can but the new www (Solr) currently can't is allow downloads in a streaming fashion of more than 5million documents. As maintaing a roaringbitmap cost very little memory but depends on docid to key/value store mapping. Our extensions allowed this and gave us a very easy to use rest API.

pvillano3y ago

I'm reading this on a runaway with terrible internet and the figures slowly load to reveal... text :/

j / k navigate · click thread line to collapse

60 comments

celeritascelery3y ago

jandrewrogers3y ago

Much of this is just threshold effects for algorithm performance, not special cases per se.

hansvm3y ago

1 more reply

nunb3y ago

In 2000 came across a binary space partition algo that did this, but used a small scheme interpreter to codegen the c++ BSP code. Would love to find out the library name... have now forgotten it.

voberoiOP3y ago

I had the same takeaway when I read these papers. It felt like a game of whack-a-mole.

(In that sense, it's awesome that the researchers also did the legwork to implement and maintain a library!)

coldtea3y ago

More exotic optimizations piled on top, sure. But "compressed bitmaps" themselves as a concept, not.

fuckstick3y ago

anonymoushn3y ago

timsort and pdqsort seem to be significantly slower than radix sort for random inputs, but presumably radix sort could also benefit from a pile of pattern-finding heuristics.

eesmith3y ago

timsort and pdqsort (two comparison sort methods) and radix sort (a non-comparative sorting algorithm) have different albeit overlapping domains of applicability.

Given a large number of 32-bit integers, radix sort is indeed significantly faster.

While I would reach for a comparison sort method if I had a large number of arbitrary-length Unicode strings, which I wanted to sort in a case-ignoring order.

Also, I found timsort faster than radix sort when there was a small number (as I recall, <100 or so) of elements.

1 more reply

ww5203y ago

This is an excellent write up on Roaring Bitmap. The concrete examples really help to illustrate the algorithm well.

Also it works well on 32-bit integers and probably 64-bit integers but it's not clear it can scale well beyond to bigger integers.

Nevertheless it's an excellent algorithm that works well in practice.

quickthrower23y ago

Not .bmp bitmaps used to store photographic images as I would have guessed:

> Bitmaps are arrays of bits used to store sets of integers. They work by setting the Nth bit when an integer N is in the set

Waterluvian3y ago

I was confused until I clicked on the provided link in the article to see they’re talking about bit maps. For me that space has always been critical.

TacticalCoder3y ago

> When a set is sparse, traditional bitmaps compress poorly.

They waste space. But if you were to compress them, they'd compress very well. As in, for example, if you were to compress them using roaring bitmaps.

zibby83y ago

If you compressed them using some naive scheme, they would be smaller but they would lose their fast set-wise operations like lookup and union.

Hippocrates3y ago

Maybe the author means they compress poorly under generic, common compression algorithms. This is an exotic compression.

jpgvm3y ago

Yeah pretty much. General purpose compression algorithms do a good job but pale in comparison to what Roaring achieves.

I wouldn't use the word exotic but rather specialized and/or optimized.

Infact later versions of Roaring use run-length-encoding (RLE) containers in addition to array and bitmap containers which is a technique shared with more traditional/general compression algorithms.

In effect rather than achieving "very good" compression, Roaring is very close to "optimal" compression for sparse bitmaps.

pvillano3y ago

I thought the trick was going to be indirection.

sorted-list/bitmap/runs, in a two level tree. cool.

performs better with sequential ids than random uuids because you use fewer pages

further research

investigating the effect of adding more layers to the tree

using additional set representations e.g. balanced tree

kanak3y ago

Lucene's adaptation of Roaring uses the complement idea on a block-wise basis:

https://github.com/apache/lucene/blob/84cae4f27cfd3feb3bb42d...

nullc3y ago

Those who do not know judy1 are doomed to reinvent it.

http://judy.sourceforge.net/

froh3y ago

which possibly is a good thing

the Judy project seems inactive to me:

https://sourceforge.net/p/judy/bugs/

this short HN thread resonates with the intuition of jude1 being stalled, it's last optimistic comment points to an orphaned comment in the big list above

https://news.ycombinator.com/item?id=32188204

jpgvm3y ago

It's certainly not active but perhaps could be revived now that the patents have expired (filed in 2001 so expired last year).

nullc3y ago

Does it need to be active? It's done and it works.

It's not written in an highly unstable language. The safety of every bridge you drive over was probably partially validated using fortran numerical code from the 1980s.

1 more reply

jasonwatkinspdx3y ago

Judy arrays have never been particularly compelling, but for some reason gained huge mindshare among programmers on Slashdot.

Today it'd make much more sense to implement ART, which is dramatically simpler.

1 more reply

latchkey3y ago

http://nothings.org/computer/judy/

nullc3y ago

Not applicable to this subject. :) A comparison between judy1 and roaring would be interesting.

maycotte3y ago

Bitmaps are amazing and we believe that they will replace all other data formats for analytics.

Here is a bit (pun intended) more about our big bet on bitmaps: https://www.featurebase.com/blog/bitmaps-making-real-time-an...

Xeoncross3y ago

What would an array of pointers to X containers which are N size each look like on disk?

ww5203y ago

extesy3y ago

This article has a section titled "how Roaring bitmaps are represented in memory". I suggest you actually read the article, it will likely answer all your questions.

Xeoncross3y ago

I was actually asking how they were serialized and stored on disk.

2 more replies

nattaylor3y ago

I enjoyed this walkthrough. I'm interested in RB because it's used in Lucene but never dug in and assumed (without much thought) that it had to do with consecutive runs of 1s and 0s. Wrong!

LouisSayers3y ago

It's funny that in the end Bitmaps essentially are a modified data structure and process.

nattaylor3y ago

I was thinking a similar thought that 4096 seemed quite magic (although it seems to be chosen based on research) and that RB perf could probably be tuned based on the workload

1 more reply

superjan3y ago

It looks a bit too simplistic: Blocks with up to 4k items are stored as sorted lists. Having to insert in a sorted list up to 4k items seems very expensive to me.

jandrewrogers3y ago

jiggawatts3y ago

I wonder if it would be worth the trouble to code-gen a bunch of variants with things like 8-bit entries, and benchmark to death to determine the optimal cutover points...

1 more reply

jasonwatkinspdx3y ago

superjan3y ago

Having said that, I applaud that this type of datastructure is being investigated. An efficient bitset should be present in every collections library.

jaffee3y ago

foota3y ago

The 4k items are only 512 bytes though, so expensive yes but not overly so.

Veserv3y ago

skitter3y ago

They are two byte per item; it's the bitmaps (>4k entries) that use 1 bit per item.

jerven3y ago

voberoiOP3y ago

I’m really curious about this. Are you able to share a bit about your use case and how you use roaring bitmaps outright (vs. through search infra like Solr or ElasticSearch)?

jerven3y ago

pvillano3y ago

I'm reading this on a runaway with terrible internet and the figures slowly load to reveal... text :/

j / k navigate · click thread line to collapse