Damn Cool Algorithms: Cardinality Estimation (opens in new tab)

(blog.notdot.net)

233 pointsnicksdjohnson13y ago30 comments

30 comments

25 comments · 7 top-level

Smerity13y ago· 6 in thread

Just to mention a modification on the "simple and intuitive cardinality estimator" that's far more accurate and actually used for Google's SZL:

Given N objects, hash each object and store the lowest M (s.t. M << N) hashes, then calculate what percentage of hash space is covered and compare to how much of the hash space would be expected to be covered.

This removes the issue of having to assume the N objects are distributed evenly as hash(object) should exhibit that property.

[1] My naive Python implementation for tests on the NLP WSJ corpus -- https://github.com/Smerity/Snippets/blob/master/analytics/ap...

[2]: Szl's implementation -- http://code.google.com/p/szl/source/browse/trunk/src/emitter...

nicksdjohnsonOP13y ago

I actually cover that in the post (well, more or less - I talk about hashing to remove bias, after talking about the 'min element' algorithm. According to the papers cited, though, taking the count of leading zeroes is more space efficient, allowing you to have more buckets in the same amount of space.

shardling13y ago

>This removes the issue of having to assume the N objects are distributed evenly as hash(object) should exhibit that property.

I'm a bit confused -- isn't that exactly how the article proceeds?

Smerity13y ago

The hashing isn't the point. The leading zeroes method is more likely to produce unreliable results when you can't guarantee the dataset will be large in advance whilst the "keep min M elements" method works fine for unexpectedly small sets. That's why it's used for SZL where the user can request approximate unique counts on any data regardless of size.

1 more reply

spicyj13y ago

Do you know off the top of your head if this is more or less efficient/accurate than the algorithm in the post?

robotresearcher13y ago

It's the same algorithm.

asharp13y ago

Does this method have a name/are there any reasonable papers out there describing it?

brilee13y ago· 5 in thread

A minor nitpick... 'cardinality' in mathematics refers to a set of infinities. The set of all integers has a certain cardinality, which is equal to the cardinality of all rational numbers, but which is smaller than the cardinality of all irrational numbers.

"Size estimation" is sufficient; no need to get fancy with words you don't know the meaning of.

lotharbot13y ago

His usage is correct.

Cardinality refers to the size of any set. This includes finite, countable, and uncountable sets.

The sequence of cardinal numbers is transfinite: 0, 1, 2, ... , n, ..., aleph_0, aleph_1, ...

http://en.wikipedia.org/wiki/Cardinality

http://en.wikipedia.org/wiki/Cardinal_number

code013y ago

The OP is correct. Cardinality in mathematics is the number of elements in a set[1]. What you are probably referring to is the concept of cardinal numbers[2]. I guess you can skip the smug tone.

[1] http://en.wikipedia.org/wiki/Cardinality [2] http://en.wikipedia.org/wiki/Cardinal_number

Sniffnoy13y ago

He can't even claim that. Cardinal numbers include the finite cardinals, after all.

arnarbi13y ago

> no need to get fancy with words you don't know the meaning of.

If you want to write things that leave no room for you being mistaken, it is a good idea to double check what you are saying.

https://www.google.com/webhp?#hl=en&q=cardinality&tb...

mikeash13y ago

The irony of your crushingly ignorant post nearly killed me.

psykotic13y ago· 3 in thread

I can't resist mentioning a favorite algorithm of mine for solving a related problem. It generalizes the algorithm for the classic problem of finding the majority element (one which occurs at least 50% of the time) with a single pass over a stream, using a constant amount of space, and a second sieving pass. It's one of those simple gems you wonder why it took so long to discover.

http://www.cs.yale.edu/homes/el327/datamining2011aFiles/ASim...

nicksdjohnsonOP13y ago

Very nice, thanks! It might be too similar for a post all of its own, but I think it'd be a worthy followup to this post.

1 more reply

lurker1413y ago

Direct link to the pre-generalized: https://www.cs.utexas.edu/~moore/best-ideas/mjrty/index.html

Note that this algorithms fails silently if there is not majority element, which limits the scope of its utility.

I assume that the reason it took so long to discover is that no one needed the solution before. Are there applications that can benefit form this algorithm?

beder13y ago

You can do a second pass over the list to check that your solution is actually a majority element. This maintains the linear time and constant space properties.

1 more reply

jallmann13y ago· 3 in thread

That is mind-blowing. Intuitively, a hash digest is completely uncorrelated with its input, but this shows that you can make inferences about a collection of data based on its hashes.

In retrospect, it makes sense stochastically, but still a Damn Cool Algorithm indeed.

Daniel_Newby13y ago

> ... you can make inferences about a collection of data based on its hashes.

Quite a few cryptographers have learned this the hard way!

lurker1413y ago

Isn't that the point? That hashing maps an arbitrary set of numbers to an approximately uniform scale-invariant distribution.

jallmann13y ago

That's not what I said. Generating a predictable distribution by hashing is old hat.

What's awesome is you can then make inferences about the original data from those hashes -- something that good hash functions are supposed to be resistant to, in isolation.

1 more reply

_delirium13y ago· 1 in thread

Fwiw, I believe the best estimator has been improved on since HyperLogLog, with a more recent result that is provably optimal (and slightly faster asymptotically, dropping the loglog factor), which perhaps more importantly can also process streamed data online: http://people.seas.harvard.edu/~minilek/papers/f0.pdf

nicksdjohnsonOP13y ago

A friend mentioned the existence of this, but I couldn't find it myself. Thanks for pointing it out.

All the algorithms can process data in a streaming fashion, though - they only require a single pass.

bochi13y ago

If you liked this post, you will probably find this article about other probabilistic data structures interesting too: http://highlyscalable.wordpress.com/2012/05/01/probabilistic...

tripzilch13y ago

Very cool algorithm. I wonder, can anyone explain in what kind of real-world application it would be used?

j / k navigate · click thread line to collapse

30 comments

25 comments · 7 top-level

Smerity13y ago· 6 in thread

Just to mention a modification on the "simple and intuitive cardinality estimator" that's far more accurate and actually used for Google's SZL:

This removes the issue of having to assume the N objects are distributed evenly as hash(object) should exhibit that property.

[1] My naive Python implementation for tests on the NLP WSJ corpus -- https://github.com/Smerity/Snippets/blob/master/analytics/ap...

[2]: Szl's implementation -- http://code.google.com/p/szl/source/browse/trunk/src/emitter...

nicksdjohnsonOP13y ago

shardling13y ago

>This removes the issue of having to assume the N objects are distributed evenly as hash(object) should exhibit that property.

I'm a bit confused -- isn't that exactly how the article proceeds?

Smerity13y ago

1 more reply

spicyj13y ago

Do you know off the top of your head if this is more or less efficient/accurate than the algorithm in the post?

robotresearcher13y ago

It's the same algorithm.

asharp13y ago

Does this method have a name/are there any reasonable papers out there describing it?

brilee13y ago· 5 in thread

"Size estimation" is sufficient; no need to get fancy with words you don't know the meaning of.

lotharbot13y ago

His usage is correct.

Cardinality refers to the size of any set. This includes finite, countable, and uncountable sets.

The sequence of cardinal numbers is transfinite: 0, 1, 2, ... , n, ..., aleph_0, aleph_1, ...

http://en.wikipedia.org/wiki/Cardinality

http://en.wikipedia.org/wiki/Cardinal_number

code013y ago

The OP is correct. Cardinality in mathematics is the number of elements in a set[1]. What you are probably referring to is the concept of cardinal numbers[2]. I guess you can skip the smug tone.

[1] http://en.wikipedia.org/wiki/Cardinality [2] http://en.wikipedia.org/wiki/Cardinal_number

Sniffnoy13y ago

He can't even claim that. Cardinal numbers include the finite cardinals, after all.

arnarbi13y ago

> no need to get fancy with words you don't know the meaning of.

If you want to write things that leave no room for you being mistaken, it is a good idea to double check what you are saying.

https://www.google.com/webhp?#hl=en&q=cardinality&tb...

mikeash13y ago

The irony of your crushingly ignorant post nearly killed me.

psykotic13y ago· 3 in thread

http://www.cs.yale.edu/homes/el327/datamining2011aFiles/ASim...

nicksdjohnsonOP13y ago

Very nice, thanks! It might be too similar for a post all of its own, but I think it'd be a worthy followup to this post.

1 more reply

lurker1413y ago

Direct link to the pre-generalized: https://www.cs.utexas.edu/~moore/best-ideas/mjrty/index.html

Note that this algorithms fails silently if there is not majority element, which limits the scope of its utility.

I assume that the reason it took so long to discover is that no one needed the solution before. Are there applications that can benefit form this algorithm?

beder13y ago

You can do a second pass over the list to check that your solution is actually a majority element. This maintains the linear time and constant space properties.

1 more reply

jallmann13y ago· 3 in thread

That is mind-blowing. Intuitively, a hash digest is completely uncorrelated with its input, but this shows that you can make inferences about a collection of data based on its hashes.

In retrospect, it makes sense stochastically, but still a Damn Cool Algorithm indeed.

Daniel_Newby13y ago

> ... you can make inferences about a collection of data based on its hashes.

Quite a few cryptographers have learned this the hard way!

lurker1413y ago

Isn't that the point? That hashing maps an arbitrary set of numbers to an approximately uniform scale-invariant distribution.

jallmann13y ago

That's not what I said. Generating a predictable distribution by hashing is old hat.

What's awesome is you can then make inferences about the original data from those hashes -- something that good hash functions are supposed to be resistant to, in isolation.

1 more reply

_delirium13y ago· 1 in thread

nicksdjohnsonOP13y ago

A friend mentioned the existence of this, but I couldn't find it myself. Thanks for pointing it out.

All the algorithms can process data in a streaming fashion, though - they only require a single pass.

bochi13y ago

If you liked this post, you will probably find this article about other probabilistic data structures interesting too: http://highlyscalable.wordpress.com/2012/05/01/probabilistic...

tripzilch13y ago

Very cool algorithm. I wonder, can anyone explain in what kind of real-world application it would be used?

j / k navigate · click thread line to collapse