h1(i) = 0
h2(i) = 5
h3(i) = 7
and upon insertion, the bloom filter would resemble: 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1
to check if an item is the bloom filter, you check the corresponding bits. After you add items to a bloom filter there's some probability that the corresponding bits for items that haven't been added are also set. So it can sometimes lie to you and say something is included, even when it's not.IBLTs replace the bits with counts of items added, and a sum of the key and value. So where Cs is the count sum, Ks is the Key sum and Vs is the value sum, the filter might look like;
Cs=1 Ks=10 Vs=20 | Cs=0 Ks=0 Vs=0 | ...
the positions in the table are computed just as before; the count in the count increments for every item added, and the key and value "sums" are a union of the keys and values that have been added (this might be a simple rolling count of a small checksum, or an xor).Items can be removed from the table individually, by subtracting them from the relevant counts (just like a counting bloom filter) and rolling sums. A "dump" of the table can be derived by iterating through all possible values (you need to know what they might be) and deleting them one by one, and it takes trickery to support duplicate entries.
This seems like a big downside; if the input set is constrained and there is so small a finite set of inputs that you can just iterate over them, then I think it is easier to use composite numbers as your invertible lookup table.
Here's how; for each potential input element, assign it a unique prime number, keep this assignment in a static lookup table. Start the table with the value "1". To store an element, multiply the current table value by the element's value. A composite number will be generated that is the product of all elements inserted into the table.
To check for element presence; just use mod. Duplicates work; if the value is still congruent, then the element is still there. To remove an element, just divide by its value. A surprisingly large number of elements can be represented this way in a 64k-sized composite number.
A good example of this this technique is to model Internet AS paths; give every AS number a unique prime number (there are only about 100k of them) and model AS path as the composite of its component ASes; using vectorized DIV/MOD instructions it is incredibly fast to large graph operations.
In your example, just storing the static lookup table with one prime number for each key in the universe costs about |U| log |U| bits. Furthermore, each cell needs to store a product of these numbers, and even if you can bound with k the number of keys colliding in the same cell, you need k log |U| bits for each cell.
This is very inefficient, and if you think about it you can achieve the same functionality with a standard hash table, which takes less space. In general, every time you think "I could use prime factorization to store sets", there is usually a more efficient solution, probably involving bitvectors.
The usage scenario of IBLTs is quite different from the usual scenario of storing key-value mappings: it is useful when you have an algorithm that adds and delete keys, and at any time during the execution of the algorithm the number of distinct keys can be arbitrarily large, but you know that at the end of the algorithm the number of distinct keys is "small". With a standard hash table, you would need a space that is proportional to the maximum number of distinct elements reached during the execution of the algorithm; with IBLTs you only need space that is proportional to the final set size.
A toy example is the following: someone tells you a set of n numbers, and then tells you again all the numbers in the set except for k; you want to return the k missing numbers. With a hash table you would need O(n) space, with a IBLT you just need O(k).
To achieve this, they use a standard trick in Bloom-like data structures (and perfect hashes, etc.): the property that there is always a cell with exactly one element; this property is preserved when you remove that one element, so there will be another cell with exactly one element, and so on, until you can retrieve all of them. This property is guaranteed with random hash functions with the right table size, and can be proved by showing that the hypergraph induced by the hash mappings is "peelable", which is a generalization of a standard graph with no cycles, i.e. a tree: in a tree there is always a node with degree 1, and when you remove that you leave at least another node with degree 1, and so on.
It sounds like you're proposing that they iterate through all possible keys, check if the key is in the table, and if it is, dump that key and delete it. This would take O(2^b) time, where b is the number of bits in the key. Obviously, for reasonably-sized b, this can get a bit out of hand.
The paper takes a different approach. It looks for a cell with Cs = 1. Cs = 1 means Ks and Vs represent exactly a key-value pair. Dump that (Ks,Vs) pair, then delete (Ks,Vs) from the table. In the process of deleting, there will probably be some other cell that goes from Cs = 2 to Cs = 1. continue from there. Eventually, you should get all Cs = 0. If you ever get stuck, you're unable to dump the entire list.
That method takes O(n), where n is the number of elements in the table, and the fastest you could expect to dump any kind of list.
AS paths are ordered; what interesting graph algorithms could be performed on paths in which order has been lost?
And of course ideally they'd release source code of their implementations, and release all of their data, and so on... and we'd all ride ponies...
Not particularly bad going by the standard of some of the papers I've read.
1.3 Applications and Usage Cases
* Database Reconciliation
* Tracking Network Acknowledgments
* Oblivious Selection from a Table
Hacker News is "news" in the sense that a used clothing store near my house is called "New To You" :)
- users have various permissions
- other objects in the system require one or more permissions to be viewable by those users
you can accomplish this easily with joins alone, but it becomes a performance issue rather quickly. I have a large-scale system and my intuition was that supplemental indexes based on bloom filters could help me solve this problem. i haven't figured out an exact way to apply them however.