For example, if you search for "iPhone 5S", the filter determines whether to show something like this http://i.imgur.com/Dp3y1Gi.png (not sure if sponsored makes a difference here, possibly a bad example).
> a "simple bloom filter" for displaying product results at the top of a search page if it determines you have searched for a product.
"have searched" is not clear to me. I was expecting Google would just spit out a JSON response with all the search query for page 1 (including sponsored ads). Can you please elaborate on this please?
No. I think you might be parsing his statement wrong - it's not "have searched" as in searched for previously, it's "have searched for a PRODUCT" as in, your search is for a product (something purchasable in a store, preferably an online store) as opposed to something else.
The Bloom filter figures out if your search is for a product, and if it is, puts that box w/ sponsored links to stores that sell that product up at the top of your results.
That the amount of information required so that you can say whether an element is in a set or not is significantly less that the set itself.
That is just wierd, but amazing.
If someone asks me about weird and wonderful things in computer science, this is what I talk about.
Like reviewing a passenger list where everyone's name is abbreviated to initials.
(HyperLogLogs are vaguely similar data structures in that you hash things a bunch to store a value inside, except instead of set-membership inquiries, they're better at cardinality-estimation purposes.)
[0] http://blog.notdot.net/2012/09/Dam-Cool-Algorithms-Cardinali...
http://blog.aggregateknowledge.com/2012/10/25/sketch-of-the-...
The Aggregate Knowledge blog is mind blowing for everything sketch related.
[0] http://www.cs.sunysb.edu/~algorith/
[1] http://www.cs.sunysb.edu/~algorith/files/dictionaries.shtml
[2] https://www.springer.com/computer/theoretical+computer+scien...
[3] http://xlinux.nist.gov/dads/
[4] https://en.wikipedia.org/wiki/Dictionary_of_Algorithms_and_D...
But, your question is entirely valid and I'm not really sure why you couldn't do that. It might simply be that the kinds of hash functions used in bloom filters (non-cryptographic) are not well distributed.
[1]: http://citeseer.ist.psu.edu/viewdoc/download;jsessionid=4060...
http://hur.st/bloomfilter?n=4&p=1.0E-20
Here you can give it a false positive rate you're happy with and a number of elements. It'll give you the optimal size and number of hashes to use. Your suggestion would use far too much space.
The nice thing about Bloom filters is that (if your hash functions are good), that you compute the vector length given the number of set elements and the probability of false positives that is acceptable in your application.
Here's a paper using Bloom filters in metagenomic classification (my relative* area): http://bioinformatics.oxfordjournals.org/content/26/13/1595....
In this vein, a friend and I are researching/implementing a probabilistic key-value store for some bioinformatics applications (one is metagenomic organism and gene identification). It's fast and space-efficient, just like a BF (though obviously less so as it stores keys and not single-set-membership). Any use cases for that kind of thing in your sub-field? Always trying to figure out interesting new applications (we aren't ready to write it all up yet, but hope to at some point not too far down the road).
* I'm really just a dabbler / don't have a formal bioinformatics background. My friend is the genetics PhD.
Fun to implement, I'd recommend having a go. Here's our terrible clojure code: https://gist.github.com/timruffles/7195405
One thing I don't quite get - as you're hashing into the same bit array for each of the hashes - you must get false positives from combinations of hashes. So say you chose a bad combination of hash algorithms where one key hashed into bits a and b using the two different hashes respectively. Maybe some other key might hash into b and a. With a totally suboptimal choice of hash algorithms you could end up with a 50% error rate. Or am I misunderstanding something?
If you get a false back, you can guarantee that some item is not in the filter. If you get a true back, you can be fairly certain it is. Once you have a degree of certainty, you can do a more thorough check for membership while cutting out large numbers of negatives.
"The more hash functions you have, the slower your bloom filter, and the quicker it fills up. *If you have too few, however, you may suffer too many false positives.*"
Just made me wonder - couldn't badly chosen hash functions actually give you more false positives?E.g. one particular Dutch natural language parser uses a left-corner algorithm. To prune the search space, the parser removes hypotheses with unlikely splines, since they will probably not lead to a good parse (the probability of a spline occurring in a good parse is estimated by parsing a large corpus).
Since this parser is written in Prolog and splines are represented as a Prolog terms. So, a natural approach is to assert likely splines as Prolog facts. Although most Prolog implementations do first argument indexing and the splines are ground (contain no variables), it's not an ideal lookup:
- Asserting all the terms requires a lot of memory.
- Lookup requires hashing the term and possibly doing multiple comparisons when there is more than one term with that hash.
This is also an application where a false positive does not matter - it means that you will not prune a particular spline. It can increase parsing time (since it increases the number of hypothesis), but since it purges fewer splines, you are not throwing away parses.
For this reason, we replaced the 'splines as Prolog facts' with a Bloom filter: you hash the Prolog term one or more times and use constant lookup in a bit array. This change made this functionality faster and a lot more compact[2].
As a matter of fact, I implemented this so that at compile time the hashes are computed, the array is constructed and then written as a C file. This is linked in to the Prolog module that does the guided parsing. So, in binary versions of the parser the Bloom filter is actually never computed[3].
[1] There is an understandable description here: http://www.let.rug.nl/vannoord/Presentations/Athens09/Main/p...
[2] Of course, a middle way would be to assert the hashes as facts. But it's still not by far as efficient.
[3] Fun fact: SICStus and SWI-Prolog use different term hashing methods, so the Bloom filter is Prolog-implementation specific ;).
http://www.coolsnap.net/kevin/?p=19
P.S. Loved the tutorial.
Used to! I had them in earlier versions of the article but they changed that code.
The conditions are relevant for some cloud-managed appliances, mobile-phone applications, and sometimes caching techniques in your own distributed applications.