Making Postgres Bloom (opens in new tab)

(pipelinedb.com)

69 pointsusman-m10y ago23 comments

23 comments

21 comments · 7 top-level

pbnjay10y ago· 5 in thread

Don't get me wrong, I love Postgres and use it in pretty much all of my projects... but for this kind of application it's not very well suited. Leave your relational data for the database and use something more efficient!

Redis comes with both bitfields (see http://redis.io/commands/bitcount) and hyperloglog counters (see http://redis.io/commands/pfcount), requires almost no setup and has very minimal overhead.

Zikes10y ago

Foreign Data Wrappers make that even easier: http://www.craigkerstiens.com/2012/10/18/connecting_to_redis...

danneu10y ago

Classic premature optimization.

"Just add another database!"

mycelium10y ago

From the article:

> Some of you, who have ventured deep into the bowels of databases, will probably point out that doing something like this in a real setup is committing concurrency suicide. All updates to the same row will essentially be executed serially which is no bueno if you're trying to build a performant data pipeline.

pbnjay10y ago

LOL. Versus implementing bloom filters in a relational database?

Gigablah10y ago

"Let's not over-engineer though."

Implements bloom filters

gleb10y ago· 3 in thread

The example doesn't quite make sense:

  SELECT user_id IN (SELECT DISTINCT user_id FROM user_actions);

is not valid SQL. You may mean something like:

  SELECT 123 IN (SELECT DISTINCT user_id FROM user_actions);

which is a strange query, as it's equivalent to:

  SELECT 123 IN (SELECT user_id FROM user_actions);

usman-mOP10y ago

Oops, it was meant to be:

  SELECT <user_id> FROM (SELECT DISTINCT user_id FROM user_actions);

You're absolutely right that both those queries will give the same result. I guess I was trying to motivate the basic problem of finding whether some user exists in a set of users, and `SELECT DISTINCT` is the SQL way of representing a set.

Fixed the post, thanks!

gleb10y ago

I'd put more effort into setting up a believable problem in these kind of posts, before presenting a solution. Much like in a company pitch, it's hard to understand the value of product if you don't understand what problem it is trying to solve.

It doesn't help that using unnecessary DISTINCTs is subqueries is a common performance problem in novice SQL. Why people do that I don't really understand, but they do.

That's the thing about probabilistic data structures - I've never seen a real-world performance problem in SQL where they would have been helpful. I really would like to have an "aha" moment where somebody shows me one.

Probabilistic data structures do seem like a natural match for streaming databases, but that's different.

teddyh10y ago

Wouldn’t

  SELECT EXISTS (SELECT 1 FROM user_actions WHERE user_id = 123);

lead to a better execution plan?

matsur10y ago· 2 in thread

Semi-related in the land of Postgres and probabilistic data structures -- Redshift supports APPROXIMATE COUNT. Much, much faster than a raw COUNT, and their stated error is +-2%

http://docs.aws.amazon.com/redshift/latest/dg/r_COUNT.html

striking10y ago

I'd eat my hat if it didn't use a Bloom filter in some way.

usman-mOP10y ago

It probably uses a HyperLogLog--the 2% error rate kind of gives it away. Bloom filters approximate set membership queries, HyperLogLogs approximate set cardinality queries. COUNT DISTINCT is a set cardinality query.

We actually support a HyperLogLog backed COUNT DISTINCT aggregate too: http://docs.pipelinedb.com/aggregates.html#general-aggregate...

1 more reply

ahachete10y ago· 2 in thread

usman-m, the approach of PipelineDB seems really interesting. However, I'd like to understand how in your opinion it compares with processing the stream of data changes accessed over PostgreSQL's logical decoding (http://www.postgresql.org/docs/9.4/static/logicaldecoding.ht...) interface. Thank you

usman-mOP10y ago

ahachete, I'm not sure if I totally understand your question.

Continuous views are consumers for streams. You can think of them as high throughput real-time materialized views. The source of data for the stream can be practically anything. Logical decoding on the other hand is a producer of streaming data--it's basically a human readable replication log. So you could potentially stream the logically decoded log into PipelineDB and build some continuous views in front of it.

ahachete10y ago

I was thinking of a system where data is extracted from the source database and then the (changes) data are processed real-time by a software that consumes this stream. So other than the obvious differences (need to write the software, SQL support) what would be the real advantage of using PipelineDB over a system with PostgreSQL+logical decoding+stream processing of that data?

Thanks!

mamikonyana10y ago· 1 in thread

When you say adding more online algorithms, is that mostly going to be limited to adding more probabilistic data structures?

usman-mOP10y ago

Mostly that. I've also been thinking about how we could incorporate some machine learning algorithms, like online perceptrons.

jordibunster10y ago· 1 in thread

Internally using hashtext(), which is not a good idea for a bloom filter for a few reasons, one of which is http://www.postgresql.org/message-id/CABUevExTx2whgSpKaoMVow...

usman-mOP10y ago

Our native implementations of all probabilistic data structures use MurmurHash3, so this isn't a problem. The dumbloom implementation is in no way a good Bloom filter, as the name suggests :)

zallarak10y ago

The idea of using probabilistic data structures instead of counting every point of data (for things like customer analytics) is pretty significant -- getting caught in the weeds of managing every data point is error-prone and inefficient.

j / k navigate · click thread line to collapse

23 comments

21 comments · 7 top-level

pbnjay10y ago· 5 in thread

Redis comes with both bitfields (see http://redis.io/commands/bitcount) and hyperloglog counters (see http://redis.io/commands/pfcount), requires almost no setup and has very minimal overhead.

Zikes10y ago

Foreign Data Wrappers make that even easier: http://www.craigkerstiens.com/2012/10/18/connecting_to_redis...

danneu10y ago

Classic premature optimization.

"Just add another database!"

mycelium10y ago

From the article:

pbnjay10y ago

LOL. Versus implementing bloom filters in a relational database?

Gigablah10y ago

"Let's not over-engineer though."

Implements bloom filters

gleb10y ago· 3 in thread

The example doesn't quite make sense:

  SELECT user_id IN (SELECT DISTINCT user_id FROM user_actions);

is not valid SQL. You may mean something like:

  SELECT 123 IN (SELECT DISTINCT user_id FROM user_actions);

which is a strange query, as it's equivalent to:

  SELECT 123 IN (SELECT user_id FROM user_actions);

usman-mOP10y ago

Oops, it was meant to be:

  SELECT <user_id> FROM (SELECT DISTINCT user_id FROM user_actions);

Fixed the post, thanks!

gleb10y ago

It doesn't help that using unnecessary DISTINCTs is subqueries is a common performance problem in novice SQL. Why people do that I don't really understand, but they do.

Probabilistic data structures do seem like a natural match for streaming databases, but that's different.

teddyh10y ago

Wouldn’t

  SELECT EXISTS (SELECT 1 FROM user_actions WHERE user_id = 123);

lead to a better execution plan?

matsur10y ago· 2 in thread

Semi-related in the land of Postgres and probabilistic data structures -- Redshift supports APPROXIMATE COUNT. Much, much faster than a raw COUNT, and their stated error is +-2%

http://docs.aws.amazon.com/redshift/latest/dg/r_COUNT.html

striking10y ago

I'd eat my hat if it didn't use a Bloom filter in some way.

usman-mOP10y ago

We actually support a HyperLogLog backed COUNT DISTINCT aggregate too: http://docs.pipelinedb.com/aggregates.html#general-aggregate...

1 more reply

ahachete10y ago· 2 in thread

usman-mOP10y ago

ahachete, I'm not sure if I totally understand your question.

ahachete10y ago

Thanks!

mamikonyana10y ago· 1 in thread

When you say adding more online algorithms, is that mostly going to be limited to adding more probabilistic data structures?

usman-mOP10y ago

Mostly that. I've also been thinking about how we could incorporate some machine learning algorithms, like online perceptrons.

jordibunster10y ago· 1 in thread

Internally using hashtext(), which is not a good idea for a bloom filter for a few reasons, one of which is http://www.postgresql.org/message-id/CABUevExTx2whgSpKaoMVow...

usman-mOP10y ago

Our native implementations of all probabilistic data structures use MurmurHash3, so this isn't a problem. The dumbloom implementation is in no way a good Bloom filter, as the name suggests :)

zallarak10y ago

j / k navigate · click thread line to collapse