Given N objects, hash each object and store the lowest M (s.t. M << N) hashes, then calculate what percentage of hash space is covered and compare to how much of the hash space would be expected to be covered.
This removes the issue of having to assume the N objects are distributed evenly as hash(object) should exhibit that property.
[1] My naive Python implementation for tests on the NLP WSJ corpus -- https://github.com/Smerity/Snippets/blob/master/analytics/ap...
[2]: Szl's implementation -- http://code.google.com/p/szl/source/browse/trunk/src/emitter...
I'm a bit confused -- isn't that exactly how the article proceeds?
"Size estimation" is sufficient; no need to get fancy with words you don't know the meaning of.
Cardinality refers to the size of any set. This includes finite, countable, and uncountable sets.
The sequence of cardinal numbers is transfinite: 0, 1, 2, ... , n, ..., aleph_0, aleph_1, ...
[1] http://en.wikipedia.org/wiki/Cardinality [2] http://en.wikipedia.org/wiki/Cardinal_number
If you want to write things that leave no room for you being mistaken, it is a good idea to double check what you are saying.
http://www.cs.yale.edu/homes/el327/datamining2011aFiles/ASim...
Note that this algorithms fails silently if there is not majority element, which limits the scope of its utility.
I assume that the reason it took so long to discover is that no one needed the solution before. Are there applications that can benefit form this algorithm?
In retrospect, it makes sense stochastically, but still a Damn Cool Algorithm indeed.
Quite a few cryptographers have learned this the hard way!
What's awesome is you can then make inferences about the original data from those hashes -- something that good hash functions are supposed to be resistant to, in isolation.
All the algorithms can process data in a streaming fashion, though - they only require a single pass.