Redis comes with both bitfields (see http://redis.io/commands/bitcount) and hyperloglog counters (see http://redis.io/commands/pfcount), requires almost no setup and has very minimal overhead.
"Just add another database!"
> Some of you, who have ventured deep into the bowels of databases, will probably point out that doing something like this in a real setup is committing concurrency suicide. All updates to the same row will essentially be executed serially which is no bueno if you're trying to build a performant data pipeline.
Implements bloom filters
SELECT user_id IN (SELECT DISTINCT user_id FROM user_actions);
is not valid SQL. You may mean something like: SELECT 123 IN (SELECT DISTINCT user_id FROM user_actions);
which is a strange query, as it's equivalent to: SELECT 123 IN (SELECT user_id FROM user_actions); SELECT <user_id> FROM (SELECT DISTINCT user_id FROM user_actions);
You're absolutely right that both those queries will give the same result. I guess I was trying to motivate the basic problem of finding whether some user exists in a set of users, and `SELECT DISTINCT` is the SQL way of representing a set.Fixed the post, thanks!
It doesn't help that using unnecessary DISTINCTs is subqueries is a common performance problem in novice SQL. Why people do that I don't really understand, but they do.
That's the thing about probabilistic data structures - I've never seen a real-world performance problem in SQL where they would have been helpful. I really would like to have an "aha" moment where somebody shows me one.
Probabilistic data structures do seem like a natural match for streaming databases, but that's different.
SELECT EXISTS (SELECT 1 FROM user_actions WHERE user_id = 123);
lead to a better execution plan?We actually support a HyperLogLog backed COUNT DISTINCT aggregate too: http://docs.pipelinedb.com/aggregates.html#general-aggregate...
Continuous views are consumers for streams. You can think of them as high throughput real-time materialized views. The source of data for the stream can be practically anything. Logical decoding on the other hand is a producer of streaming data--it's basically a human readable replication log. So you could potentially stream the logically decoded log into PipelineDB and build some continuous views in front of it.
Thanks!