It seems like the authors wanted to use a single hash for performance (?). Maybe they correctly determined that naive Bloom filters have poor cache locality and reinvented block bloom filters from there.
Overall, I think block bloom filters should be the default most people reach for. They completely solve the cache locality issues (single cache miss per element lookup), and they sacrifice only like 10–15% space increase to do it. I had a simple implementation running at something like 20ns per query with maybe k=9. It would be about 9x that for native Bloom filters.
There’s some discussion in the article about using a single hash to come up with various indexing locations, but it’s simpler to just think of block bloom filters as:
1. Hash-0 gets you the block index
2. Hash-1 through hash-k get you the bits inside the block
If your implementation slices up a single hash to divide it into multiple smaller hashes, that’s fine.
I also find the use of atomics to build the filter confusing here. If you're doing a join, you're presumably doing a batch of hashes, so it'd be much more efficient to partition your Bloom filter, lock the partitions, and do a bulk insertion.
Bulk insertion: yes, if there are many keys, bulk insertion is faster. For xor filters, I used radix sort before insertion [4] (I should have documented the code better), but for fuse filters and blocked Bloom filters it might not be worth it, unless if the filter is huge.
[1] https://lemire.me/blog/2016/06/27/a-fast-alternative-to-the-... [2] https://github.com/FastFilter/fastfilter_cpp/blob/master/src... [3] https://github.com/FastFilter/fastfilter_java/blob/master/fa... [4] https://github.com/FastFilter/fastfilter_cpp/blob/master/src...
I was also limited by the constraint of legacy code. This project was not a complete rewrite, but just an idea: "can we use more information from that 32-bit hash that we recieve in without regressing any perf". We didn't have a time for a deep research or a rewrite, so I just wanted to show the result of this small exercise on how we can make things run better without rewriting the world.
I think this depends on how big your filters are. Most people think of Bloom filters as having to have hundreds of thousands of elements, but I frequently find them useful all the way down to 32 bits (!). (E.g., there are papers showing chained hash tables where each bucket has a co-sited tiny Bloom filter to check if it's worth probing the chain.) In the “no man's land” in-between with a couple ten thousand buckets, the blocking seems to be mostly negative; it only makes sense as long as you actually keep missing the cache.
I doubt their hardware is any faster shuffling bits in a uint32 than a uint64, and using uint64 should have a decent benefit to the false positive rate...
1. intra element collison goes down: 1/32 = 3.1% vs 1/64 = 1.6% -> 1.5% difference. Intra element collison doesn't mean guaranteed a FP though! 2. 64 bucket would have less variance - filter behavior would be more predictable
But there is a downside: When switching from 32 to 64 bit word we also reduce number of array elements 2 times, doubling the contention. We are populating the filter from up to 128 threads during join phase. When the build side isn't significantly smaller than the probe side, that contention can overweight the FP improvement.
With blocks this small there's also no reason not to optimize the number of hash functions (albeit this brings back the specter of saturation). There are no cache misses to worry about; all positions can be checked with a single mask.
https://github.com/apache/parquet-format/blob/master/BloomFi...
https://github.com/facebook/rocksdb/blob/main/util/bloom_imp...
First one is useful for grasping the idea second one is more comprehensive. Both try to make multiple bit loads but try to make them as independent as possible as far a I can understand.
Also hash function has huge effect on bloom filter performance. I was getting 2x perf when using xxhash3 instead of wyhash even though wyhash is a faster hash function afaik.
But another approach is to use C++ templating so you can have say 10 different 'fixed size' implementations with no additional overhead, and at runtime select the most suitable size.
For the couple of kilobytes of extra code size, this optimisation has to be worth it assuming table size is variable and there are some stats to give cardinality estimates...
Oh, and I don’t find myself actually implementing any of these very often or knowing that they are in use. I occasionally use things like APPROX_COUNT_DISTINCT in Snowflake[1], which is a HyperLogLog (linked in the Wiki).
[0]: https://en.wikipedia.org/wiki/Category:Probabilistic_data_st...
[1]: https://docs.snowflake.com/en/sql-reference/functions/approx...
i've gotten interview questions best solved with them a few times; a Microsoft version involved spell-checking in extremely limited memory, and the interviewer told me that they'd actually been used for that back in the PDP era.
I learned about them after the first time I got grilled and rejected. Sucks to be the first company that grilled me about it, thanks for the tip though, you just didn't stick around long enough to see how fast I learn
https://www.eecs.harvard.edu/~michaelm/postscripts/handbook2...
The ligature ← for << was a bit confusing.
And a nitpick; Finding a match in a bloom filter isn’t a false positive, it is inconclusive.
Since bloom filter are only designed to give negatives, never positives, the concept of a false positive is nonsensical. Yeah, I get what you mean, but language is important.
Back in the 1980s or earlier it was called a "false drop".
Knuth, for example, talks about it in "The Art of Computer programming", v3, section 6.5, "Retrieval on Secondary Keys", using cookie ingredients. (See https://archive.org/details/fileorganisation0000thar/mode/2u... for Tharp using the same example.)
Bloom filters are a type of superimposed coding.
But there are benchmark numbers at least, so maybe they only used it for the prose