Bloom Filters for the Perplexed (opens in new tab)

(sagi.io)

259 pointskedmi8y ago45 comments

45 comments

Bloom filters are a nice data structure, and you should absolutely have them in your toolbox, but if you go looking for a reason to use one you are likely to wind up making things worse. The following is not valid reasoning: "Bloom filters are efficient. Therefore if I can find a way to use a bloom filter, my solution will be efficient."

The "SSH keys" protocol in the article seems like an example of this. It doesn't make any sense. Why would the server send the client a Bloom filter if the client has already told it what key it wants to check? The server only has to send one bit back to the client! And if the goal is to not trust the server with the client's (public) key, this protocol doesn't accomplish that either.

And if you do for some reason have to transmit the entire database of compromised SSH keys in a way that permits only membership tests, a Bloom filter isn't the most compact way to do it! For example, off the top of my head, you could calculate an (15+N)-bit hash for each element of the list, sort the hashes, and rice code the deltas. That would take very roughly 32768 * (N+2) bits and give about 1 in 2^N false positives. So for N=13 it is about the size of the bloom filter in the article but gives a false positive rate 8 times lower. This data structure isn't random access like a Bloom filter, but that doesn't matter for something you are sending over the network (which is always O(N)).

gopalv8y ago

> The server only has to send one bit back to the client!

As much as the example is a bad one because it leaks server-side info to an unauthenticated client, I've had scenarios where if you have > 3 ssh keys in your key-chain all ssh login attempts fail after 3 keys are tried & cause failures. I end up writing ~/.ssh/config entries; a lot of them for the client to remember which key to try first.

My favourite real-life bloom-filter story is the "unsafe URLs" list that is in Chrome - the "Safe Browsing Bloom" is a neat way to send out obscured information about the bad URLs without actually handing out a list to a user. The web URLs or domains which find a match in this, do need to be checked upstream, but it avoids having to check for every single request with a central service.

On a similar note, been playing with a variant of bloom filters at work called a Bloom-1 filter [1] which works much faster than a regular bloom filter which has a lot of random memory access for 1 bit reads.

[1] - https://github.com/prasanthj/bloomfilter/blob/master/core/sr...

papercrane8y ago

> The "SSH keys" protocol in the article seems like an example of this. It doesn't make any sense. Why would the server send the client a Bloom filter if the client has already told it what key it wants to check? The server only has to send one bit back to the client! And if the goal is to not trust the server with the client's (public) key, this protocol doesn't accomplish that either.

There is a footnote on the sequence diagram that the key is not sent to the server on the initial request. Rather the client just does a simple GET. Since it's just sending a static file the client could cache the bloom filter.

chrisweekly8y ago

+1 insightful and educational -- in more ways than one!

> "sort the hashes, and rice code the deltas"

Being quite sure that wasn't a racial slur, I looked it up:

> "Rice coding (invented by Robert F. Rice) denotes using a subset of the family of Golomb codes to produce a simpler (but possibly suboptimal) prefix code. Rice used this set of codes in an adaptive coding scheme; "Rice coding" can refer either to that adaptive scheme or to using that subset of Golomb codes." (https://en.wikipedia.org/wiki/Golomb_coding)

malkia8y ago

When I first heard of Bloom filters, I thought of the Bloom effect - https://en.wikipedia.org/wiki/Bloom_(shader_effect) - and we used to call these things filters for a while - then friend of mine told me about the other meaning :)

QuercusMax8y ago

I think the first thing any article about Bloom filters should mention is that they were invented by a guy named Burton Bloom. They don't have anything to do with "blooming" of any sort.

Very similar with Shellsort, which was designed by Donald Shell.

jszymborski8y ago

I always thought it was called shell sort because it invoked the image of a shell game and the pointer was like a shell covering the array element!!

Thanks for the edification!!!

thefalcon8y ago

I'm also always having to perform the manual translation in my head from shader effect when I read this term.

captaintacos8y ago

Some weeks ago someone posted a very useful interactive demo of a bloom filter (implemented in js) that you might want to play with after reading this article: https://www.jasondavies.com/bloomfilter/

tzs8y ago

How does a Bloom filter with k hash functions hashing to a shared table of m bits compare to just using k hash functions each hashing into its own separate hash table of m/k bits?

cmurphycode8y ago

Having many filters is worse, but not hugely so. I believe the problem is reducible to the fact that you'll have many smaller filters thus the random distribution hurts you a bit more.

The key word to google is "blocked bloom filter" e.g. as proposed in http://algo2.iti.kit.edu/documents/cacheefficientbloomfilter...

Here's a nice paper with some improvements http://tfk.mit.edu/pdf/bloom.pdf

We use blocked bloom filters for a couple of reasons, but one major benefit is the memory locality (our "bloom filter" is 32GB or larger, so it's handy & fast to be able to address it with separate "pages" which are really just individual bloom filters.)

openasocket8y ago

It leads to more false positives. Following the derivation from wikipedia (https://en.wikipedia.org/wiki/Bloom_filter), if we have a bloom filter with k hash functions, m bits, and contains n elements, the probability of a false positive is

(1 - (1 - 1/m)^(kn))^n ~ (1 - exp(-nk/m))^k

With your idea (let's call that Bloom Filter'), we have k hash tables with m/k bits each, so the probability of an element of one of the hash tables being 1 with n elements is 1 - (1 - k/m)^n, so the odds that 1 random position in each of the k hash sets is 1 (i.e. a false positive) is:

(1 - (1 - k/m)^n)^k ~ (1 - exp(-nm/k))^k

Which is very similar, we just switched k/m with m/k inside that exponential. So which has a lower false positive rate for m and k? We do some algebra, assuming that your Bloom Filter' has a lower false positive rate than the regular Bloom Filter, and try to get a contradiction:

(1 - exp(-nm/k))^k < (1 - exp(-nk/m))^k

1 - exp(-nm/k) < 1 - exp(-nk/m)

exp(-nk/m) < exp(-nm/k)

-nk/m < -nm/k

m/k < k/m

m^2 < k^2

m < k

This is a contradiction, because if we have fewer bits than hash functions, we'll wind up with hash tables of size 0. Thus, Bloom Filter' leads to a worse false positive rate than regular Bloom Filters.

P.S. I just did the math in the last 10 minutes, so there could be mistakes. This also only shows your system is less accurate if it has the same m and k as a regular bloom filter, but maybe your system becomes more accurate if using a different value of k. I'm checking that possibility now.

UPDATE: interesting. I tried to find the optimum value of k given n and m for bloom filter' (for regular bloom filters the optimum k = m / n log(2)). But for bloom filter' there is no local or global minimum, only a maximum at k = m (where everything is a false positive). For k < m the false positive value is decreasing, so the optimum under those constraints is k = 1. Which is the same as a regular bloom filter with just 1 hash. Thus, the bloom filter' will never outperform a regular bloom filter with optimal k. The false positive probability for bloom filter' for optimal k=1 is:

1 - (1 - 1/m)^n ~ 1 - exp(-n/m)

caraffle8y ago

What you're suggesting sounds like a count min sketch, like a 2D bloom filter.

jason_slack8y ago

Can anyone tell me about a use case in game design?

andybak8y ago

That's an interesting question. I wonder if there's some potential application in collision detection.

Godel_unicode8y ago

If you have a large number of moving parties you could make bloom filters of the tiles they're going to move through and then compare them to cull pairs which will never share tiles along their paths perhaps?

notgood8y ago

This link[0] has an example; so basically almost anything in a game where it has to lookup massive amounts of data (e.g. logs), the example in the article is to quickly check if the user has already seen an item (in a game with thousands of items [e.g because those were pseudo-randomly generated, think RPGs]). But it's easy to think further applications: quickly check if the player already played this chess game before (all pieces where in the same position) to make sure the enemy does something smarter this time (because that time the enemy lost), and such.

[0] https://blog.demofox.org/2015/02/08/estimating-set-membershi...

hvidgaard8y ago

That is not a good usecase for bloomfilters. With a bloomfilter you can tell if something has not been added/done. But you cannot tell if the opposite is true. In technical terms, false negatives are not possible, but false positives absolutely are, and should be expected. It can tell you: "This is not in the set", or "This is possibly in the set".

With that in mind, bloomfilters should be used for things where you are only interested to know if it is not in the set, or when you can tolerate a given false positive rate and size it accordingly.

For the first example will give false positives and be hostile to players that have not attacked. That might be okay, but not what you expect, and you might as well use probability for that. The second example I suppose you can use it to only try things you haven't tried before, but it seems weird.

That said, the link you provided is actually a good post about the topic, and include some good insight. It's not trying to hide that bloom filters are No or Maybe.

Retric8y ago

It's not clear to me that the added complexity aka risk of bugs is worth it in your case for any reasonable game design.

In the chess example, knowing it's possible to lose from X positions is almost meaningless most of the time. The issue is search space sizes are either to large to be useful or small enough for brute force.

1 more reply

jason_slack8y ago

This is a great application and something I can actually learn and use now. What a way to motivate learning :-)

oli56798y ago

Every month or so there is a new version of an article like this posted on hn

endorphone8y ago

At this point I believe these articles are the result of people who got over their own confusion/misunderstanding by writing a new article on bloom filters. Eventually there will be a 1:1 ratio of bloom filter articles / developers.

lalaithion8y ago

So, bloom filters are the equivalent of Monads? https://byorgey.wordpress.com/2009/01/12/abstraction-intuiti...

1 more reply

bradleyjg8y ago

I read a few of these articles a couple of years ago. Along with a few of the inevitable cuckoo hash filter rejoinders.

I think they are neat algorithms and I'm glad to have come across them. But that said, I have yet to find a problem in my day to day work which required set membership, with space at a premium, and where false positives were acceptable. So I've never used either in anger.

manish_gill8y ago

I use them at work as a cache during data ingestion phrase (analytics). I have to store a unique URL for each page the user is at, and each page generates a lot of requests. So I store the URLs inside a Bloom Filter, hitting the DB only when the contain() returns False. It's a neat little thing that saves me thousands of unnecessary database hits per second.

3131s8y ago

I have used them for text segmentation. It's an extremely quick way to test for membership on a set (30+ million tokens in my case) that would otherwise be too large to hold in main memory.

arielweisberg8y ago

If you are bored with bloom and cuckoo filters then check out quotient filters. Quotienting was one of those mind blown things for me.

1 more reply

Jake2328y ago

They're very useful in large scale web crawling/scraping. I use them for a number of things in this field.

1 more reply

mlevental8y ago

you scoff but this is a very thorough presentation of a bloomfilter - very few of these sorts of articles actually cover the computation of the probability bounds.

niftich8y ago

Yeah, there have been a good number [1] and a steady stream of submissions about Bloom filters (and truly, the inevitable re-riff about Cuckoo filters), but this article is toward the higher end of the quality scale.

It's a bit odd that a data structure attracts this kind of attention, but not all of it is about self-discovery, and the fact that people feel writing about them belies the fact that they either consider it a novelty, or expect members of their intended audience to consider them as such. Hopefully with time, we will reach a saturation point where most people (including beginners) are familiar with Bloom filters because they've been formally taught or read one of these articles.

[1] https://hn.algolia.com/?query=bloom+filter&sort=byDate&type=...

1 more reply

foo1018y ago

The Wikipedia article on bloom filter is already pretty thorough and discusses the computation of the probability bounds as well as the optimal parameters: https://en.wikipedia.org/wiki/Bloom_filter

dang8y ago

Wasn't that more like 2011? I was thinking "oldie".

j / k navigate · click thread line to collapse

45 comments

voidmain8y ago

gopalv8y ago

> The server only has to send one bit back to the client!

[1] - https://github.com/prasanthj/bloomfilter/blob/master/core/sr...

papercrane8y ago

chrisweekly8y ago

+1 insightful and educational -- in more ways than one!

> "sort the hashes, and rice code the deltas"

Being quite sure that wasn't a racial slur, I looked it up:

malkia8y ago

QuercusMax8y ago

I think the first thing any article about Bloom filters should mention is that they were invented by a guy named Burton Bloom. They don't have anything to do with "blooming" of any sort.

Very similar with Shellsort, which was designed by Donald Shell.

jszymborski8y ago

I always thought it was called shell sort because it invoked the image of a shell game and the pointer was like a shell covering the array element!!

Thanks for the edification!!!

thefalcon8y ago

I'm also always having to perform the manual translation in my head from shader effect when I read this term.

captaintacos8y ago

tzs8y ago

How does a Bloom filter with k hash functions hashing to a shared table of m bits compare to just using k hash functions each hashing into its own separate hash table of m/k bits?

cmurphycode8y ago

Having many filters is worse, but not hugely so. I believe the problem is reducible to the fact that you'll have many smaller filters thus the random distribution hurts you a bit more.

The key word to google is "blocked bloom filter" e.g. as proposed in http://algo2.iti.kit.edu/documents/cacheefficientbloomfilter...

Here's a nice paper with some improvements http://tfk.mit.edu/pdf/bloom.pdf

openasocket8y ago

(1 - (1 - 1/m)^(kn))^n ~ (1 - exp(-nk/m))^k

(1 - (1 - k/m)^n)^k ~ (1 - exp(-nm/k))^k

(1 - exp(-nm/k))^k < (1 - exp(-nk/m))^k

1 - exp(-nm/k) < 1 - exp(-nk/m)

exp(-nk/m) < exp(-nm/k)

-nk/m < -nm/k

m/k < k/m

m^2 < k^2

m < k

1 - (1 - 1/m)^n ~ 1 - exp(-n/m)

caraffle8y ago

What you're suggesting sounds like a count min sketch, like a 2D bloom filter.

jason_slack8y ago

Can anyone tell me about a use case in game design?

andybak8y ago

That's an interesting question. I wonder if there's some potential application in collision detection.

Godel_unicode8y ago

notgood8y ago

[0] https://blog.demofox.org/2015/02/08/estimating-set-membershi...

hvidgaard8y ago

With that in mind, bloomfilters should be used for things where you are only interested to know if it is not in the set, or when you can tolerate a given false positive rate and size it accordingly.

That said, the link you provided is actually a good post about the topic, and include some good insight. It's not trying to hide that bloom filters are No or Maybe.

Retric8y ago

It's not clear to me that the added complexity aka risk of bugs is worth it in your case for any reasonable game design.

1 more reply

jason_slack8y ago

This is a great application and something I can actually learn and use now. What a way to motivate learning :-)

oli56798y ago

Every month or so there is a new version of an article like this posted on hn

endorphone8y ago

lalaithion8y ago

So, bloom filters are the equivalent of Monads? https://byorgey.wordpress.com/2009/01/12/abstraction-intuiti...

1 more reply

bradleyjg8y ago

I read a few of these articles a couple of years ago. Along with a few of the inevitable cuckoo hash filter rejoinders.

manish_gill8y ago

3131s8y ago

I have used them for text segmentation. It's an extremely quick way to test for membership on a set (30+ million tokens in my case) that would otherwise be too large to hold in main memory.

arielweisberg8y ago

If you are bored with bloom and cuckoo filters then check out quotient filters. Quotienting was one of those mind blown things for me.

1 more reply

Jake2328y ago

They're very useful in large scale web crawling/scraping. I use them for a number of things in this field.

1 more reply

mlevental8y ago

you scoff but this is a very thorough presentation of a bloomfilter - very few of these sorts of articles actually cover the computation of the probability bounds.

niftich8y ago

[1] https://hn.algolia.com/?query=bloom+filter&sort=byDate&type=...

1 more reply

foo1018y ago

The Wikipedia article on bloom filter is already pretty thorough and discusses the computation of the probability bounds as well as the optimal parameters: https://en.wikipedia.org/wiki/Bloom_filter

dang8y ago

Wasn't that more like 2011? I was thinking "oldie".

j / k navigate · click thread line to collapse