Bloom Filters by example (opens in new tab)

(billmill.org)

167 pointsgpsarakis12y ago61 comments

61 comments

52 comments · 19 top-level

dpcx12y ago· 11 in thread

Anyone got pointers to "simple" real-world use cases?

Any read-world use case where you want to check set membership when false positives are allowed.

E.g. one particular Dutch natural language parser uses a left-corner algorithm. To prune the search space, the parser removes hypotheses with unlikely splines, since they will probably not lead to a good parse (the probability of a spline occurring in a good parse is estimated by parsing a large corpus).

Since this parser is written in Prolog and splines are represented as a Prolog terms. So, a natural approach is to assert likely splines as Prolog facts. Although most Prolog implementations do first argument indexing and the splines are ground (contain no variables), it's not an ideal lookup:

- Asserting all the terms requires a lot of memory.

- Lookup requires hashing the term and possibly doing multiple comparisons when there is more than one term with that hash.

This is also an application where a false positive does not matter - it means that you will not prune a particular spline. It can increase parsing time (since it increases the number of hypothesis), but since it purges fewer splines, you are not throwing away parses.

For this reason, we replaced the 'splines as Prolog facts' with a Bloom filter: you hash the Prolog term one or more times and use constant lookup in a bit array. This change made this functionality faster and a lot more compact[2].

As a matter of fact, I implemented this so that at compile time the hashes are computed, the array is constructed and then written as a C file. This is linked in to the Prolog module that does the guided parsing. So, in binary versions of the parser the Bloom filter is actually never computed[3].

[1] There is an understandable description here: http://www.let.rug.nl/vannoord/Presentations/Athens09/Main/p...

[2] Of course, a middle way would be to assert the hashes as facts. But it's still not by far as efficient.

[3] Fun fact: SICStus and SWI-Prolog use different term hashing methods, so the Bloom filter is Prolog-implementation specific ;).

dmunoz12y ago

Jeremy Edberg mentioned in his talk on scaling reddit [0] that they used a bloom filter to provide fast (negative) lookup of whether or not a story or comment had been up voted by the user. The negative case was important since it's often the case that the majority of stories or comments appearing on a page will not have been up voted by the user.

[0] http://www.infoq.com/presentations/scaling-reddit

jedberg12y ago

To be pedantic, what we did was take advantage of Cassandra's bloom filter.

panic12y ago

WebKit uses a bloom filter to optimize CSS descendant selector matching: http://trac.webkit.org/changeset/77777

roye12y ago

You could use them for rough spell-checking by hashing all the words in a dictionary. You would then need some additional data structure for overcoming false positives though, but any word rejected would necessarily not be a member of the original dictionary. The BF size would be much cheaper to store than the original dictionary and searching for each word would be pretty fast (though not necessarily faster than exact string lookups).

supergirl12y ago

Don't know what you consider simple, but databases use bloom filters to do faster joins for example. Chrome also uses them for malware detection I think.

throwaway197912y ago

Specifically, a place where DBs use bloom filters is distributed joins. This isn't a common scenario IMHO. Found a good description here:

http://www.coolsnap.net/kevin/?p=19

P.S. Loved the tutorial.

1 more reply

llimllib12y ago

> Chrome also uses them for malware detection I think.

Used to! I had them in earlier versions of the article but they changed that code.

1 more reply

fennecfoxen12y ago

To generalize, "possible match detected" for anything: you have a big, fat database about a bigger, fatter universe of identifiers, and a less-capable set of clients over a less-capable network link. They want a fast way of determining whether or not a particular identifier is worth querying the database for, and the false positives do not have a very high cost.

The conditions are relevant for some cloud-managed appliances, mobile-phone applications, and sometimes caching techniques in your own distributed applications.

lmb12y ago

I've used Bloom filters to implement simple breakpoints in an emulator. It makes sense since checking for membership happens every emulator step, and is essentially constant time with a bloom filter.

nwh12y ago

Bitcoin nodes now use bloom filters to enable fast transaction lookups between peers.

fennecfoxen12y ago· 5 in thread

Now we just need HyperLogLog by example.

(HyperLogLogs are vaguely similar data structures in that you hash things a bunch to store a value inside, except instead of set-membership inquiries, they're better at cardinality-estimation purposes.)

dmunoz12y ago

Not in the style of the submitted article, but HyperLogLog was mentioned [0] last year on Nick's Blogs Damn Cool Algortihms series [1]. Discussion on HN [2].

[0] http://blog.notdot.net/2012/09/Dam-Cool-Algorithms-Cardinali...

[1] http://blog.notdot.net/tag/damn-cool-algorithms

[2] https://news.ycombinator.com/item?id=4488946

abolibibelot12y ago

Well, there you are: a nice explanation of HyperLogLog, along with graphs, and a javascript demo to play with. And a tribute to Philippe Flajolet who came up with the algorithm.

http://blog.aggregateknowledge.com/2012/10/25/sketch-of-the-...

The Aggregate Knowledge blog is mind blowing for everything sketch related.

llimllib12y ago

Is that demo widget custom, or did they use a tool to generate it? Do you know?

1 more reply

llimllib12y ago

I hope you write it! I also imagine an encyclopedia of many different algorithm examples, each with demonstrations and graphical explanations. Bret Victor has talked about that sort of thing, and it greatly intrigues me.

dmunoz12y ago

I'm not sure what Bret Victor had in mind, but the second part of Skiena's text The Algorithm Design Manual already does a good job at this. Similar write ups are available online [0], in less detail than the text, e.g. [1]. They all open with a pictorial representation of possible input and corresponding output to aid in scanning when you have a problem to solve. There is also the Encyclopedia of Algorithms [2], and the NIST Dictionary of Algorithms and Data Structures [3], history of at [4].

[0] http://www.cs.sunysb.edu/~algorith/

[1] http://www.cs.sunysb.edu/~algorith/files/dictionaries.shtml

[2] https://www.springer.com/computer/theoretical+computer+scien...

[3] http://xlinux.nist.gov/dads/

[4] https://en.wikipedia.org/wiki/Dictionary_of_Algorithms_and_D...

makmanalp12y ago· 3 in thread

Question, instead of using two hash functions, why not use just a single function but use a = hash(foo) and b = hash(foo+"something")? A well distributed hash function should give drastically different results for a and b for different values of foo, which is what we want, correct?

kurige12y ago

It's standard to have at least two hashing functions. You can then simulate having x hash functions simply by combining those two.[1]

But, your question is entirely valid and I'm not really sure why you couldn't do that. It might simply be that the kinds of hash functions used in bloom filters (non-cryptographic) are not well distributed.

[1]: http://citeseer.ist.psu.edu/viewdoc/download;jsessionid=4060...

makmanalp12y ago

Fascinating, thank you! I'll take a look at the paper when I get back home.

natch12y ago

That's what I do, and it works just fine. I was going to raise this point but you beat me to it.

aidos12y ago· 3 in thread

Interesting. I'd never looked into how BFs work before and this is a nice clear explaination.

One thing I don't quite get - as you're hashing into the same bit array for each of the hashes - you must get false positives from combinations of hashes. So say you chose a bad combination of hash algorithms where one key hashed into bits a and b using the two different hashes respectively. Maybe some other key might hash into b and a. With a totally suboptimal choice of hash algorithms you could end up with a 50% error rate. Or am I misunderstanding something?

chamblin12y ago

This is kind of a fundamental part of bloom filters. They're probabilistic data structures, which is the trade you make for space efficiency.

If you get a false back, you can guarantee that some item is not in the filter. If you get a true back, you can be fairly certain it is. Once you have a degree of certainty, you can do a more thorough check for membership while cutting out large numbers of negatives.

aidos12y ago

I get the concept - you're trading space for certainty. It's more the bit where it says:

    "The more hash functions you have, the slower your bloom filter, and the quicker it fills up. *If you have too few, however, you may suffer too many false positives.*"

Just made me wonder - couldn't badly chosen hash functions actually give you more false positives?

natch12y ago

I agree with what chamblin said, and to add one more point to that, you can choose the amount of uncertainty by using a larger or smaller bit array relative to the number of items you will have.

javindo12y ago· 2 in thread

Interesting anecdote: an alumni of my university was chatting with me recently about his work at Google, essentially he replaced what he described as a "large, complex machine learning pipeline" with a "simple bloom filter" for displaying product results at the top of a search page if it determines you have searched for a product.

For example, if you search for "iPhone 5S", the filter determines whether to show something like this http://i.imgur.com/Dp3y1Gi.png (not sure if sponsored makes a difference here, possibly a bad example).

yeukhon12y ago

Is EE shop sites you would have visited before the search?

> a "simple bloom filter" for displaying product results at the top of a search page if it determines you have searched for a product.

"have searched" is not clear to me. I was expecting Google would just spit out a JSON response with all the search query for page 1 (including sponsored ads). Can you please elaborate on this please?

nicolethenerd12y ago

> Is EE shop sites you would have visited before the search?

No. I think you might be parsing his statement wrong - it's not "have searched" as in searched for previously, it's "have searched for a PRODUCT" as in, your search is for a product (something purchasable in a store, preferably an online store) as opposed to something else.

The Bloom filter figures out if your search is for a product, and if it is, puts that box w/ sponsored links to stores that sell that product up at the top of your results.

sandwell12y ago· 2 in thread

This looks very cool, however I was able to get lots of false positives using the first few letters of the alphabet as test data. Could you reduce these by using k hashes and k bit vectors with m bits, i.e. one bit vector for each hash algorithm, at the expense of a little extra computation?

IanCal12y ago

That's probably because it's a tiny filter. In reality you'd calculate the size more appropriately:

http://hur.st/bloomfilter?n=4&p=1.0E-20

Here you can give it a false positive rate you're happy with and a number of elements. It'll give you the optimal size and number of hashes to use. Your suggestion would use far too much space.

microtonal12y ago

The expense is space, not so much the extra computation.

The nice thing about Bloom filters is that (if your hash functions are good), that you compute the vector length given the number of set elements and the probability of false positives that is acceptable in your application.

roye12y ago· 2 in thread

BFs are getting pretty popular in bioinformatics. The post mentions one example, here is my favorite recent one: http://minia.genouest.org/files/minia.pdf

sakai12y ago

Very cool – I hadn't seen this paper before (haven't done any real work on de novo assembly or otherwise requiring de Bruijn graphs).

Here's a paper using Bloom filters in metagenomic classification (my relative* area): http://bioinformatics.oxfordjournals.org/content/26/13/1595....

In this vein, a friend and I are researching/implementing a probabilistic key-value store for some bioinformatics applications (one is metagenomic organism and gene identification). It's fast and space-efficient, just like a BF (though obviously less so as it stores keys and not single-set-membership). Any use cases for that kind of thing in your sub-field? Always trying to figure out interesting new applications (we aren't ready to write it all up yet, but hope to at some point not too far down the road).

* I'm really just a dabbler / don't have a formal bioinformatics background. My friend is the genetics PhD.

roye12y ago

Can you explain "probabilistic key-value store?" Would it be that each gene has some defined probability of belonging to a given organism, or is it probabilistic in the sense of having a defined error rate as BFs do?

1 more reply

joeblau12y ago· 2 in thread

Last time I checked, the Cassandra implementation was really messed up. Myself and another dev at our last company spent a few weeks debugging, adding test cases and fixing the code to get it to work correctly.

Elfan12y ago

Cassandra user in production here. Could you elaborate a little bit on what was messed up (size, false positives etc.) and which version this was? Did you have trouble with Cassandra itself that you tracked down to the bloom filters or trying to re-use the bloom filters in another project?

sakai12y ago

I'm only familiar with a Python clone of the Cassandra implementation ("hydra" -- used it a while back), but two issues I do remember are: 1) I believe it only uses 32-bit ints for the bit array addressing, so you can overflow it (and this also may be less-than-ideal from a hash distribution perspective, but I don't know offhand); and 2) as someone coming from a different background, I found the whole thing to have a bit too much "OO" magic, with several helper classes to set up the filter that (to me) obfuscate what's really going on.

mbostock12y ago· 1 in thread

Jason Davies made a nice visual explanation of bloom filters: https://www.jasondavies.com/bloomfilter/

llimllib12y ago

His demo is much smoother than mine, I'm jealous.

GhotiFish12y ago· 1 in thread

I always loved bloom filters, because to me, they showed something that was very fundamentally important.

That the amount of information required so that you can say whether an element is in a set or not is significantly less that the set itself.

That is just wierd, but amazing.

If someone asks me about weird and wonderful things in computer science, this is what I talk about.

MichaelGG12y ago

Well it doesn't tell you the element's membership. It only can tell you it might be in, or definitely isn't.

Like reviewing a passenger list where everyone's name is abbreviated to initials.

tropicalmug12y ago· 1 in thread

Another great Bloom Filter library not mentioned in this article is Bitly's implementation on GitHub[0]. It has the best of both worlds: a C implementation and higher-level wrappers (Golang, PHP, Python, and Common Lisp). If you don't want to play around with a fancier DBMS and really just want the damn filters, I would look here.

[0] https://github.com/bitly/dablooms

llimllib12y ago

It is elliptically mentioned, via the link to this wonderful pull request: https://github.com/bitly/dablooms/pull/19

rethab12y ago

Chapter 26 of 'Real World Haskell' shows how to design a library by implementing a bloom filter: http://book.realworldhaskell.org/read/advanced-library-desig...

pearkes12y ago

bloomd[1] is a "high performance c server for bloom filters" for those looking for a low-overhead and deployable implementation with quite a few clients, like Python and Ruby.

[1] https://github.com/armon/bloomd

ngcazz12y ago

Wow, I didn't realize they were this simple to implement and reason about!

gngeal12y ago

Google Chrome not only uses bloom filters for malicious URL filtering but also as a part of the CSS selector matching process. I'm not sure how exactly, but allegedly, it gives them a nice speed bump in majority of the cases without compromising on generality.

timruffles12y ago

Coincidence: used this page on Friday to implement bloom filters for a kata at Software Craftsmanship 2013 (in Bletchley Park of all places).

Fun to implement, I'd recommend having a go. Here's our terrible clojure code: https://gist.github.com/timruffles/7195405

jbochi12y ago

There is nice summary of other interesting probabilistic data structures here: http://highlyscalable.wordpress.com/2012/05/01/probabilistic...

mrcactu512y ago

I had just put down this bloom filter tutorial by Jason Davies, now another one comes up. It's a sign. www.jasondavies.com/bloomfilter/

malkia12y ago

As an example of real usage - In my ChromeBook home folder, there are several "bloom" filter files for various predefined filtering in chrome

j / k navigate · click thread line to collapse

61 comments

52 comments · 19 top-level

dpcx12y ago· 11 in thread

Anyone got pointers to "simple" real-world use cases?

microtonal12y ago

Any read-world use case where you want to check set membership when false positives are allowed.

- Asserting all the terms requires a lot of memory.

- Lookup requires hashing the term and possibly doing multiple comparisons when there is more than one term with that hash.

[1] There is an understandable description here: http://www.let.rug.nl/vannoord/Presentations/Athens09/Main/p...

[2] Of course, a middle way would be to assert the hashes as facts. But it's still not by far as efficient.

[3] Fun fact: SICStus and SWI-Prolog use different term hashing methods, so the Bloom filter is Prolog-implementation specific ;).

dmunoz12y ago

[0] http://www.infoq.com/presentations/scaling-reddit

jedberg12y ago

To be pedantic, what we did was take advantage of Cassandra's bloom filter.

panic12y ago

WebKit uses a bloom filter to optimize CSS descendant selector matching: http://trac.webkit.org/changeset/77777

roye12y ago

supergirl12y ago

Don't know what you consider simple, but databases use bloom filters to do faster joins for example. Chrome also uses them for malware detection I think.

throwaway197912y ago

Specifically, a place where DBs use bloom filters is distributed joins. This isn't a common scenario IMHO. Found a good description here:

http://www.coolsnap.net/kevin/?p=19

P.S. Loved the tutorial.

1 more reply

llimllib12y ago

> Chrome also uses them for malware detection I think.

Used to! I had them in earlier versions of the article but they changed that code.

1 more reply

fennecfoxen12y ago

The conditions are relevant for some cloud-managed appliances, mobile-phone applications, and sometimes caching techniques in your own distributed applications.

lmb12y ago

I've used Bloom filters to implement simple breakpoints in an emulator. It makes sense since checking for membership happens every emulator step, and is essentially constant time with a bloom filter.

nwh12y ago

Bitcoin nodes now use bloom filters to enable fast transaction lookups between peers.

fennecfoxen12y ago· 5 in thread

Now we just need HyperLogLog by example.

dmunoz12y ago

Not in the style of the submitted article, but HyperLogLog was mentioned [0] last year on Nick's Blogs Damn Cool Algortihms series [1]. Discussion on HN [2].

[0] http://blog.notdot.net/2012/09/Dam-Cool-Algorithms-Cardinali...

[1] http://blog.notdot.net/tag/damn-cool-algorithms

[2] https://news.ycombinator.com/item?id=4488946

abolibibelot12y ago

Well, there you are: a nice explanation of HyperLogLog, along with graphs, and a javascript demo to play with. And a tribute to Philippe Flajolet who came up with the algorithm.

http://blog.aggregateknowledge.com/2012/10/25/sketch-of-the-...

The Aggregate Knowledge blog is mind blowing for everything sketch related.

llimllib12y ago

Is that demo widget custom, or did they use a tool to generate it? Do you know?

1 more reply

llimllib12y ago

dmunoz12y ago

[0] http://www.cs.sunysb.edu/~algorith/

[1] http://www.cs.sunysb.edu/~algorith/files/dictionaries.shtml

[2] https://www.springer.com/computer/theoretical+computer+scien...

[3] http://xlinux.nist.gov/dads/

[4] https://en.wikipedia.org/wiki/Dictionary_of_Algorithms_and_D...

makmanalp12y ago· 3 in thread

kurige12y ago

It's standard to have at least two hashing functions. You can then simulate having x hash functions simply by combining those two.[1]

[1]: http://citeseer.ist.psu.edu/viewdoc/download;jsessionid=4060...

makmanalp12y ago

Fascinating, thank you! I'll take a look at the paper when I get back home.

natch12y ago

That's what I do, and it works just fine. I was going to raise this point but you beat me to it.

aidos12y ago· 3 in thread

Interesting. I'd never looked into how BFs work before and this is a nice clear explaination.

chamblin12y ago

This is kind of a fundamental part of bloom filters. They're probabilistic data structures, which is the trade you make for space efficiency.

aidos12y ago

I get the concept - you're trading space for certainty. It's more the bit where it says:

    "The more hash functions you have, the slower your bloom filter, and the quicker it fills up. *If you have too few, however, you may suffer too many false positives.*"

Just made me wonder - couldn't badly chosen hash functions actually give you more false positives?

natch12y ago

I agree with what chamblin said, and to add one more point to that, you can choose the amount of uncertainty by using a larger or smaller bit array relative to the number of items you will have.

javindo12y ago· 2 in thread

yeukhon12y ago

Is EE shop sites you would have visited before the search?

> a "simple bloom filter" for displaying product results at the top of a search page if it determines you have searched for a product.

"have searched" is not clear to me. I was expecting Google would just spit out a JSON response with all the search query for page 1 (including sponsored ads). Can you please elaborate on this please?

nicolethenerd12y ago

> Is EE shop sites you would have visited before the search?

The Bloom filter figures out if your search is for a product, and if it is, puts that box w/ sponsored links to stores that sell that product up at the top of your results.

sandwell12y ago· 2 in thread

IanCal12y ago

That's probably because it's a tiny filter. In reality you'd calculate the size more appropriately:

http://hur.st/bloomfilter?n=4&p=1.0E-20

Here you can give it a false positive rate you're happy with and a number of elements. It'll give you the optimal size and number of hashes to use. Your suggestion would use far too much space.

microtonal12y ago

The expense is space, not so much the extra computation.

roye12y ago· 2 in thread

BFs are getting pretty popular in bioinformatics. The post mentions one example, here is my favorite recent one: http://minia.genouest.org/files/minia.pdf

sakai12y ago

Very cool – I hadn't seen this paper before (haven't done any real work on de novo assembly or otherwise requiring de Bruijn graphs).

Here's a paper using Bloom filters in metagenomic classification (my relative* area): http://bioinformatics.oxfordjournals.org/content/26/13/1595....

* I'm really just a dabbler / don't have a formal bioinformatics background. My friend is the genetics PhD.

roye12y ago

1 more reply

joeblau12y ago· 2 in thread

Elfan12y ago

sakai12y ago

mbostock12y ago· 1 in thread

Jason Davies made a nice visual explanation of bloom filters: https://www.jasondavies.com/bloomfilter/

llimllib12y ago

His demo is much smoother than mine, I'm jealous.

GhotiFish12y ago· 1 in thread

I always loved bloom filters, because to me, they showed something that was very fundamentally important.

That the amount of information required so that you can say whether an element is in a set or not is significantly less that the set itself.

That is just wierd, but amazing.

If someone asks me about weird and wonderful things in computer science, this is what I talk about.

MichaelGG12y ago

Well it doesn't tell you the element's membership. It only can tell you it might be in, or definitely isn't.

Like reviewing a passenger list where everyone's name is abbreviated to initials.

tropicalmug12y ago· 1 in thread

[0] https://github.com/bitly/dablooms

llimllib12y ago

It is elliptically mentioned, via the link to this wonderful pull request: https://github.com/bitly/dablooms/pull/19

rethab12y ago

Chapter 26 of 'Real World Haskell' shows how to design a library by implementing a bloom filter: http://book.realworldhaskell.org/read/advanced-library-desig...

pearkes12y ago

bloomd[1] is a "high performance c server for bloom filters" for those looking for a low-overhead and deployable implementation with quite a few clients, like Python and Ruby.

[1] https://github.com/armon/bloomd

ngcazz12y ago

Wow, I didn't realize they were this simple to implement and reason about!

gngeal12y ago

timruffles12y ago

Coincidence: used this page on Friday to implement bloom filters for a kata at Software Craftsmanship 2013 (in Bletchley Park of all places).

Fun to implement, I'd recommend having a go. Here's our terrible clojure code: https://gist.github.com/timruffles/7195405

jbochi12y ago

There is nice summary of other interesting probabilistic data structures here: http://highlyscalable.wordpress.com/2012/05/01/probabilistic...

mrcactu512y ago

I had just put down this bloom filter tutorial by Jason Davies, now another one comes up. It's a sign. www.jasondavies.com/bloomfilter/

malkia12y ago

As an example of real usage - In my ChromeBook home folder, there are several "bloom" filter files for various predefined filtering in chrome

j / k navigate · click thread line to collapse