Two Bits Are Better Than One: making bloom filters 2x more accurate (opens in new tab)

(floedb.ai)

184 pointsmatheusalmeida4mo ago35 comments

35 comments

33 comments · 10 top-level

vlmutolo4mo ago· 10 in thread

This article is a little confusing. I think this is a roundabout way to invent the blocked bloom filter with k=2 bits inserted per element.

It seems like the authors wanted to use a single hash for performance (?). Maybe they correctly determined that naive Bloom filters have poor cache locality and reinvented block bloom filters from there.

Overall, I think block bloom filters should be the default most people reach for. They completely solve the cache locality issues (single cache miss per element lookup), and they sacrifice only like 10–15% space increase to do it. I had a simple implementation running at something like 20ns per query with maybe k=9. It would be about 9x that for native Bloom filters.

There’s some discussion in the article about using a single hash to come up with various indexing locations, but it’s simpler to just think of block bloom filters as:

1. Hash-0 gets you the block index

2. Hash-1 through hash-k get you the bits inside the block

If your implementation slices up a single hash to divide it into multiple smaller hashes, that’s fine.

sakras4mo ago

Yeah I kind of think authors didn't conduct a thorough-enough literature review here. There are well-known relations between number of hash functions you use and the FPR, cache-blocking and register-blocking are classic techniques (Cache-, Hash-, and Space-Efficient Bloom Filters by Putze et. al), and there are even ways of generating patterns from only a single hash function that works well (shamelessly shilling my own blogpost on the topic: https://save-buffer.github.io/bloom_filter.html)

I also find the use of atomics to build the filter confusing here. If you're doing a join, you're presumably doing a batch of hashes, so it'd be much more efficient to partition your Bloom filter, lock the partitions, and do a bulk insertion.

thomasmg4mo ago

Your blogpost is great! Except for one detail: you have used modulo n. If n is not known at compile time, multiply+shift is much faster [1]. Division and modulo (remainder) are slow, except on Apple silicon (I don't know what they did there). BTW for blocked Bloom filters, there are some SIMD variants that seem to be simpler than yours [2] (maybe I'm wrong, I didn't look at the details, just it seems yours uses more code). I implemented a register-based one in one in Java here [3].

Bulk insertion: yes, if there are many keys, bulk insertion is faster. For xor filters, I used radix sort before insertion [4] (I should have documented the code better), but for fuse filters and blocked Bloom filters it might not be worth it, unless if the filter is huge.

[1] https://lemire.me/blog/2016/06/27/a-fast-alternative-to-the-... [2] https://github.com/FastFilter/fastfilter_cpp/blob/master/src... [3] https://github.com/FastFilter/fastfilter_java/blob/master/fa... [4] https://github.com/FastFilter/fastfilter_cpp/blob/master/src...

vlmutolo4mo ago

Very interesting blog post. I’d never seen that method for quickly computing the patterns. I thought I had done a lot of research on bloom filters, too!

holoskii4mo ago

Post author here. Yes, you are correct. I was doing this code change 3 years ago when I was a junior dev, I was not familiar with a blocked bloom filters at that time. Looking back, it’s cool to see that I accidentally reinvented a basic blocked bloom.

I was also limited by the constraint of legacy code. This project was not a complete rewrite, but just an idea: "can we use more information from that 32-bit hash that we recieve in without regressing any perf". We didn't have a time for a deep research or a rewrite, so I just wanted to show the result of this small exercise on how we can make things run better without rewriting the world.

Sesse__4mo ago

> Overall, I think block bloom filters should be the default most people reach for.

I think this depends on how big your filters are. Most people think of Bloom filters as having to have hundreds of thousands of elements, but I frequently find them useful all the way down to 32 bits (!). (E.g., there are papers showing chained hash tables where each bucket has a co-sited tiny Bloom filter to check if it's worth probing the chain.) In the “no man's land” in-between with a couple ten thousand buckets, the blocking seems to be mostly negative; it only makes sense as long as you actually keep missing the cache.

vanderZwan4mo ago

Are you talking about Cuckoo++ tables, perhaps? If not can you point me to the hash table you had in mind? Always fun to learn of a new approach.

https://github.com/technicolor-research/cuckoopp

1 more reply

vlmutolo4mo ago

Yeah, I agree with this. I think there are open addressing hash tables like Swiss Table that do something similar. IIRC, they have buckets with a portion at the beginning with lossy “fingerprints” of items, which kind of serve a similar purpose as a bloom filter.

hinkley4mo ago

Bloom filters are useful for sharding so it stands to reason that a hash table implemented with shards would benefit.

hinkley4mo ago

Problem is bloom isn’t close to the theoretical space complexity of the idea it implements and if you add 15% then it starts becoming attractive to switch to one that gets a tighter bound on the space complexity.

SolarNet4mo ago

Cause it was written by AI.the entire mid section is classic AI slop writing. Repeating the same points and numbers over and over, repackaging the same idea with "key takeaway" and shit. The voice of the author is heavily AI coded there.

h33t-l4x0r4mo ago· 7 in thread

Hmm, Bloom filters seem important. I'm wondering why my CS education never even touched on them and it's tbh triggering my imposter syndrome.

benmanns4mo ago

They were only touched on (and just barely) in my CS education, so don’t feel too left out. Spend an evening or two on the Wiki for Probabilistic data structures[0]. With a CS education you should have the baseline knowledge to find them really fascinating. Enjoy!

Oh, and I don’t find myself actually implementing any of these very often or knowing that they are in use. I occasionally use things like APPROX_COUNT_DISTINCT in Snowflake[1], which is a HyperLogLog (linked in the Wiki).

[0]: https://en.wikipedia.org/wiki/Category:Probabilistic_data_st...

[1]: https://docs.snowflake.com/en/sql-reference/functions/approx...

hinkley4mo ago

I made it until Google posted an article about their use of Bloom Filters back around -2000 before I even heard of them, which is at least 25 years after they were invented. Anger was my emotion, not impostor syndrome. I went to a top ten school and none of my profs thought to mention it. Had to learn AVL trees twice though. Which I’ve used fuck-all.

vyr4mo ago

they're common in databases and performance instrumentation of various kinds (as are other forms of data structure "sketch" like count sketches) but not as common outside those realms.

i've gotten interview questions best solved with them a few times; a Microsoft version involved spell-checking in extremely limited memory, and the interviewer told me that they'd actually been used for that back in the PDP era.

dheera4mo ago

My education didn't touch upon it but I've been grilled on it multiple times in interviews.

I learned about them after the first time I got grilled and rejected. Sucks to be the first company that grilled me about it, thanks for the tip though, you just didn't stick around long enough to see how fast I learn

jb36894mo ago

Distributed systems and probabilistic data structures really should be in every undergrad CS curriculum even if just in passing for the second

hinkley4mo ago

That curriculums didn’t have distributed computing classes when I was in school (mine did, but few took it) made some sense. That modern coursework omits it is unconscionable.

on_the_train4mo ago

Unpopular opinion: They're one of these things that are popular because of their cool name. For most purposes, they're not a good fit

londons_explore4mo ago· 2 in thread

What are they running this code on?

I doubt their hardware is any faster shuffling bits in a uint32 than a uint64, and using uint64 should have a decent benefit to the false positive rate...

holoskii4mo ago

It was just an oversight. Thinking about it now, u64 could have been better:

1. intra element collison goes down: 1/32 = 3.1% vs 1/64 = 1.6% -> 1.5% difference. Intra element collison doesn't mean guaranteed a FP though! 2. 64 bucket would have less variance - filter behavior would be more predictable

But there is a downside: When switching from 32 to 64 bit word we also reduce number of array elements 2 times, doubling the contention. We are populating the filter from up to 128 threads during join phase. When the build side isn't significantly smaller than the probe side, that contention can overweight the FP improvement.

FreakLegion4mo ago

That struck me as an odd choice, too. On average there's no difference in false positives, but the smaller the blocks, the more likely they'll be saturated. Since there are 6 leftover bits in the hash anyway, there's no cost to increase the two 5-bit values to 6 bits and the block size to 64. You'll have a lot fewer hot blocks that way.

With blocks this small there's also no reason not to optimize the number of hash functions (albeit this brings back the specter of saturation). There are no cache misses to worry about; all positions can be checked with a single mask.

pkoird4mo ago· 2 in thread

Clever. My first impression was that surely this saturates the filter too fast as we're setting more bits at once but looks like the maths checks out. It's one of those non-intuitive things that I am glad I learned today.

FreakLegion4mo ago

It works because the original filter has suboptimal settings. An optimal filter of that size and number of items would set 5 bits per item and have about a quarter of the false positive rate. The 2 bits per item in the blocked filter is still suboptimal, but it's also saving them from saturating a bunch of 32-bit blocks, at the cost of a much higher overall false positive rate.

lemagedurage4mo ago

True, I had the same feeling. The article does go off 256K elements in a bloom filter of 2M. After 1M elements, using 2 bits actually increases false positive rate, but at that point the false positive rate is higher than 50% already.

geon4mo ago· 1 in thread

Nice.

The ligature ← for << was a bit confusing.

And a nitpick; Finding a match in a bloom filter isn’t a false positive, it is inconclusive.

Since bloom filter are only designed to give negatives, never positives, the concept of a false positive is nonsensical. Yeah, I get what you mean, but language is important.

eesmith4mo ago

Calling it a false positive is entirely in line with the historical use.

Back in the 1980s or earlier it was called a "false drop".

Knuth, for example, talks about it in "The Art of Computer programming", v3, section 6.5, "Retrieval on Secondary Keys", using cookie ingredients. (See https://archive.org/details/fileorganisation0000thar/mode/2u... for Tharp using the same example.)

Bloom filters are a type of superimposed coding.

IshKebab4mo ago· 1 in thread

Is this worth reading? The text is LLM slop.

aziis984mo ago

> The win? Massive.

But there are benchmark numbers at least, so maybe they only used it for the prose

ozgrakkurt4mo ago

You can actually make those two bits more independent afaik.

https://github.com/apache/parquet-format/blob/master/BloomFi...

https://github.com/facebook/rocksdb/blob/main/util/bloom_imp...

First one is useful for grasping the idea second one is more comprehensive. Both try to make multiple bit loads but try to make them as independent as possible as far a I can understand.

Also hash function has huge effect on bloom filter performance. I was getting 2x perf when using xxhash3 instead of wyhash even though wyhash is a faster hash function afaik.

londons_explore4mo ago

It's true that a fixed size bloom filter gives better compiler performance...

But another approach is to use C++ templating so you can have say 10 different 'fixed size' implementations with no additional overhead, and at runtime select the most suitable size.

For the couple of kilobytes of extra code size, this optimisation has to be worth it assuming table size is variable and there are some stats to give cardinality estimates...

krackers4mo ago

Isn't this idea similar to the classic "power of 2 random choices"

https://www.eecs.harvard.edu/~michaelm/postscripts/handbook2...

seiflotfy4mo ago

https://arxiv.org/pdf/2101.01719 ?

j / k navigate · click thread line to collapse

35 comments

33 comments · 10 top-level

vlmutolo4mo ago· 10 in thread

This article is a little confusing. I think this is a roundabout way to invent the blocked bloom filter with k=2 bits inserted per element.

There’s some discussion in the article about using a single hash to come up with various indexing locations, but it’s simpler to just think of block bloom filters as:

1. Hash-0 gets you the block index

2. Hash-1 through hash-k get you the bits inside the block

If your implementation slices up a single hash to divide it into multiple smaller hashes, that’s fine.

sakras4mo ago

thomasmg4mo ago

vlmutolo4mo ago

Very interesting blog post. I’d never seen that method for quickly computing the patterns. I thought I had done a lot of research on bloom filters, too!

holoskii4mo ago

Sesse__4mo ago

> Overall, I think block bloom filters should be the default most people reach for.

vanderZwan4mo ago

Are you talking about Cuckoo++ tables, perhaps? If not can you point me to the hash table you had in mind? Always fun to learn of a new approach.

https://github.com/technicolor-research/cuckoopp

1 more reply

vlmutolo4mo ago

hinkley4mo ago

Bloom filters are useful for sharding so it stands to reason that a hash table implemented with shards would benefit.

hinkley4mo ago

SolarNet4mo ago

h33t-l4x0r4mo ago· 7 in thread

Hmm, Bloom filters seem important. I'm wondering why my CS education never even touched on them and it's tbh triggering my imposter syndrome.

benmanns4mo ago

[0]: https://en.wikipedia.org/wiki/Category:Probabilistic_data_st...

[1]: https://docs.snowflake.com/en/sql-reference/functions/approx...

hinkley4mo ago

vyr4mo ago

they're common in databases and performance instrumentation of various kinds (as are other forms of data structure "sketch" like count sketches) but not as common outside those realms.

dheera4mo ago

My education didn't touch upon it but I've been grilled on it multiple times in interviews.

jb36894mo ago

Distributed systems and probabilistic data structures really should be in every undergrad CS curriculum even if just in passing for the second

hinkley4mo ago

That curriculums didn’t have distributed computing classes when I was in school (mine did, but few took it) made some sense. That modern coursework omits it is unconscionable.

on_the_train4mo ago

Unpopular opinion: They're one of these things that are popular because of their cool name. For most purposes, they're not a good fit

londons_explore4mo ago· 2 in thread

What are they running this code on?

I doubt their hardware is any faster shuffling bits in a uint32 than a uint64, and using uint64 should have a decent benefit to the false positive rate...

holoskii4mo ago

It was just an oversight. Thinking about it now, u64 could have been better:

FreakLegion4mo ago

pkoird4mo ago· 2 in thread

FreakLegion4mo ago

lemagedurage4mo ago

geon4mo ago· 1 in thread

Nice.

The ligature ← for << was a bit confusing.

And a nitpick; Finding a match in a bloom filter isn’t a false positive, it is inconclusive.

Since bloom filter are only designed to give negatives, never positives, the concept of a false positive is nonsensical. Yeah, I get what you mean, but language is important.

eesmith4mo ago

Calling it a false positive is entirely in line with the historical use.

Back in the 1980s or earlier it was called a "false drop".

Bloom filters are a type of superimposed coding.

IshKebab4mo ago· 1 in thread

Is this worth reading? The text is LLM slop.

aziis984mo ago

> The win? Massive.

But there are benchmark numbers at least, so maybe they only used it for the prose

ozgrakkurt4mo ago

You can actually make those two bits more independent afaik.

https://github.com/apache/parquet-format/blob/master/BloomFi...

https://github.com/facebook/rocksdb/blob/main/util/bloom_imp...

First one is useful for grasping the idea second one is more comprehensive. Both try to make multiple bit loads but try to make them as independent as possible as far a I can understand.

Also hash function has huge effect on bloom filter performance. I was getting 2x perf when using xxhash3 instead of wyhash even though wyhash is a faster hash function afaik.

londons_explore4mo ago

It's true that a fixed size bloom filter gives better compiler performance...

But another approach is to use C++ templating so you can have say 10 different 'fixed size' implementations with no additional overhead, and at runtime select the most suitable size.

For the couple of kilobytes of extra code size, this optimisation has to be worth it assuming table size is variable and there are some stats to give cardinality estimates...

krackers4mo ago

Isn't this idea similar to the classic "power of 2 random choices"

https://www.eecs.harvard.edu/~michaelm/postscripts/handbook2...

seiflotfy4mo ago

https://arxiv.org/pdf/2101.01719 ?

j / k navigate · click thread line to collapse