Pg_ClickHouse: A Postgres extension for querying ClickHouse (opens in new tab)

(clickhouse.com)

117 pointsspathak6mo ago45 comments

45 comments

29 comments · 6 top-level

oulipo26mo ago· 8 in thread

I'm using Postgres as my base business database, and thinking now about linking it to either DuckDb/DuckLake or Clickhouse...

what would you recommend and why?

I understand part of the interest of pg_clickhouse is to be able to use "pre-existing Postgres queries" on an analytical database without having to change anything, so if I am building my database now and have no legacy, would pg_clickhouse make sense, or should I do analytics differently?

Also, would you have some kind of tutorial / sample setup of a typical business application in Postgres and kind of replication in clickhouse to make analytics queries? so I can see how Clickhouse would be typically used?

exagolo6mo ago

If not having to adjust queries is a major driver for your considerations, then I would highly recommend looking at SQLGlot (https://github.com/tobymao/sqlglot), a transpiler that makes you (more) independent of query dialects. They already support 30 dialects (big vendors such as Snowflake, Databricks, BigQuery, but also loads of the specialists such as ClickHouse, SingleStore or Exasol). Repo is maintained extremely well.

Picking the best solution for your concrete workload (and your future demands) should be equally important to the implementation effort, to avoid that you run into walls later on. At least as long as data volume, query complexity or concurrency scalability can be challenges.

jascha_eng6mo ago

Depending on your workload you might also be able to use Timescale to have very fast analytical queries inside postgres directly. That avoids having to replicate the data altogether.

Note that I work for the company that built timescale (Tiger Data). Clickhouse is cool though, just throwing another option into the ring.

Tbf in terms of speed Clickhouse pulls ahead on most benchmark, unless you want to join a lot with your postgres data directly then you might benefit from having everything in one place. And of course you avoid the sync overhead.

oulipo26mo ago

I'm indeed already using Timescaledb, I was wondering if I would really gain something from adding clickhouse

2 more replies

mritchie7126mo ago

We released a meltano target for DuckLake[0]. dlt has one now too. Pretty easy to sync pg -> ducklake.

I've been really happy with DuckLake, happy to answer any questions about it.

DuckDB has always felt easier to use vs. Clickhouse for me, but both are great options. If I were you, I'd try both options for a few hours with your use case and pick the one that feels better.

0 - https://www.definite.app/blog/target-ducklake

saisrirampur6mo ago

I love DuckDB from a product perspective and appreciate the engineering excellence behind it. However, DuckDB was primarily built for seamless for in-process analytics, data science, data-preparation/ETL workloads than real-time customer facing analytics.

ClickHouse’s bread and butter is real-time analytics for customer-facing applications, which often come with demanding concurrency and latency requirements.

Ack, totally makes sense that both are amazing technologies - you could try both and test them at the scale your real-time application may reach, and then choose the technology that best fits your needs. :)

1 more reply

oulipo26mo ago

Nice, what would be your typical setup?

You keep like 1 year's worth of data in your "business database", and then archive the rest in S3 with parquet and query with DuckDB ?

And if you want to sync everything, even "current data", to do datascience/analytics, can you just write the recent data (eg the last week of data or whatever) in S3 every hours/days to get relatively up-to-date data? And doesn't that cause the S3 data to grow needlessly (eg does it replace, rather than store an additional copy of recent data each hour?)

Do you have kind of "starter project" for a Postgres + DuckLake integration that I could look at to see how it's used in practice, and how it makes some operations easier?

1 more reply

saisrirampur6mo ago

Great question! If you’re starting a greenfield application, pg_clickhouse makes a lot of sense since you’ll be using a unified query layer for your application.

Now, coming to your question about replication: you can use PeerDB (acquired by ClickHouse https://github.com/PeerDB-io/peerdb), which is laser-focused and battle-tested at scale for Postgres-to-ClickHouse replication. Once the data is replicated into ClickHouse, you can start querying those tables from within Postgres using pg_clickhouse. In ClickHouse Cloud, we offer ClickPipes for Postgres CDC/replication, which is a managed service version of PeerDB and is tightly integrated with ClickHouse. Now there could be non-transcational tables that you can directly ingest to ClickHouse and still query using pg_clickhouse.

So TL;DR: Postgres for OLTP; ClickHouse for OLAP; PeerDB/ClickPipes for data replication; pg_clickhouse as the unified query layer. We are actively working on making this entire stack tightly integrated so that building real-time apps becomes seamless. More on that soon! :)

oulipo26mo ago

Nice! Right now I'm using Timescaledb, do you think it makes sense to move to a Postgres+CH setup instead? or only if I hit the limit of timescaledb?

Also what would be the benefit for me of querying clickhouse from Postgres, rather than directly through my backend via an ORM/SDK? is that because it would allow me to do JOINs?

What would be the typical setup if I want to JOIN analytical data (eg my IoT device readings) from CH with some business data (eg the user owning the device) from my Postgres? Would I replicate that business data to CH to do the join there, or would that be typically the exact use-case for pg_clickhouse?

1 more reply

onedognight6mo ago· 6 in thread

The name of the project is a reference to P. G. Wodehouse[0] for those unaware.

[0] https://www.gutenberg.org/ebooks/author/783

sevg6mo ago

Hmm, no.

It’s just like all the other postgres extensions named “pg_foo”, and the clear and obvious choice for “foo” in this case is “clickhouse”.

Unless this is some bad joke that has flown over my head.

justtheory6mo ago

I will never un-see it now, tbh

__s6mo ago

definitely a joke, not even that bad

DetroitThrow6mo ago

"I am never wrong, sir" -onedognight

yayitswei6mo ago

That was my first impression as well.

justtheory6mo ago

LOL

tempest_6mo ago· 4 in thread

This is nice because there are a lot of clickhouse fdw implementations and none of them are well maintained from what I can tell.

saisrirampur6mo ago

Appreciate you chiming in! We evaluated almost all the FDWs and landed on clickhouse_fdw (built by Ildus) as the most mature option. However, it hadn’t been maintained since 2020. We used it as the base, and the goal is to take it to the next level.

Our main focus is comprehensive pushdown capabilities. It was very surprising to see how much the Postgres FDW framework has evolved over the years and the number and types of hooks it now provides for push down. This is why we decided to lean into FDW than build an extension bottoms up. But we may still do that within pg_clickhouse for a few features, wherever FDW framework becomes a restriction.

We’ve made notable progress over the last few months, including support for pushdown of custom aggregations and SEMI JOINs/basic subqueries. Fourteen of twenty-two TPCH queries are now fully pushdownable.

We’ll be doubling down to add pushdown support for much more complex queries, CTEs, window functions, and more. More on the future here - https://github.com/ClickHouse/pg_clickhouse?tab=readme-ov-fi... All with the goal of enabling users to build fast analytics from the Postgres layer itself but still using the power of ClickHouse!

DetroitThrow6mo ago

>All with the goal of enabling users to build fast analytics from the Postgres layer itself but still using the power of ClickHouse!

That would be incredible! So many times I want to reach for ClickHouse but whatever company I'm at has so much inertia built into PG. Pleease add CTE support.

And yes I'm aware of PeerDB or whatever that project is called. This is still or even more helpful.

2 more replies

kiwicopple6mo ago

(Note: we work closely with the clickhouse team so this is not to intended to detract from their launch, simply to point out maintained options.)

Our CH wrapper is actively maintained, with push down, parameterized views, and async streaming: https://supabase.github.io/wrappers/catalog/clickhouse/

We see a lot of companies choosing CH with PG - it’s fantastic

saisrirampur6mo ago

Thank you, Paul! Great to see Supabase wrappers evolve. I really love the async streaming feature. It helps address use cases involving (reliably) moving larger datasets from ClickHouse to Postgres for supporting (stricter) transactional workloads.

Very excited to continue working closely to further integrate these amazing open source database technologies and make it easier for users. :)

graovic6mo ago· 4 in thread

This is pretty good. It will allow us to use PostgREST as an API endpoint to query the ClickHouse database directly

saisrirampur6mo ago

Good idea! Btw, ClickHouse does provide a HTTP interface directly, too! https://clickhouse.com/docs/interfaces/http

justtheory6mo ago

Ooh, neat idea!

oulipo26mo ago

What are the typical uses of PostgREST? is it just when you want to make your database accessible to various languages over HTTP because you don't want to use an ORM and connect to your db? But besides that, for an entreprise solution, why would you use PostgREST to develop your backend rather than, say, use an ORM in your language and make direct queries? (honest question)

lillecarl6mo ago

You skip the backend entirely and query from the frontend. PostgREST and Postgres is your backend. If you want extra sauce on top you route those paths to an application that does whatever extra imperative operations you need.

2 more replies

N_Lens6mo ago· 1 in thread

At this stage it may be possible to build one's entire application stack inside of postgres extensions.

justtheory6mo ago

Yes. YES! That's the idea. :-)

Olshansky6mo ago

Added to https://github.com/Olshansk/postgres_for_everything.

j / k navigate · click thread line to collapse

45 comments

29 comments · 6 top-level

oulipo26mo ago· 8 in thread

I'm using Postgres as my base business database, and thinking now about linking it to either DuckDb/DuckLake or Clickhouse...

what would you recommend and why?

exagolo6mo ago

jascha_eng6mo ago

Depending on your workload you might also be able to use Timescale to have very fast analytical queries inside postgres directly. That avoids having to replicate the data altogether.

Note that I work for the company that built timescale (Tiger Data). Clickhouse is cool though, just throwing another option into the ring.

oulipo26mo ago

I'm indeed already using Timescaledb, I was wondering if I would really gain something from adding clickhouse

2 more replies

mritchie7126mo ago

We released a meltano target for DuckLake[0]. dlt has one now too. Pretty easy to sync pg -> ducklake.

I've been really happy with DuckLake, happy to answer any questions about it.

DuckDB has always felt easier to use vs. Clickhouse for me, but both are great options. If I were you, I'd try both options for a few hours with your use case and pick the one that feels better.

0 - https://www.definite.app/blog/target-ducklake

saisrirampur6mo ago

ClickHouse’s bread and butter is real-time analytics for customer-facing applications, which often come with demanding concurrency and latency requirements.

1 more reply

oulipo26mo ago

Nice, what would be your typical setup?

You keep like 1 year's worth of data in your "business database", and then archive the rest in S3 with parquet and query with DuckDB ?

Do you have kind of "starter project" for a Postgres + DuckLake integration that I could look at to see how it's used in practice, and how it makes some operations easier?

1 more reply

saisrirampur6mo ago

Great question! If you’re starting a greenfield application, pg_clickhouse makes a lot of sense since you’ll be using a unified query layer for your application.

oulipo26mo ago

Nice! Right now I'm using Timescaledb, do you think it makes sense to move to a Postgres+CH setup instead? or only if I hit the limit of timescaledb?

Also what would be the benefit for me of querying clickhouse from Postgres, rather than directly through my backend via an ORM/SDK? is that because it would allow me to do JOINs?

1 more reply

onedognight6mo ago· 6 in thread

The name of the project is a reference to P. G. Wodehouse[0] for those unaware.

[0] https://www.gutenberg.org/ebooks/author/783

sevg6mo ago

Hmm, no.

It’s just like all the other postgres extensions named “pg_foo”, and the clear and obvious choice for “foo” in this case is “clickhouse”.

Unless this is some bad joke that has flown over my head.

justtheory6mo ago

I will never un-see it now, tbh

__s6mo ago

definitely a joke, not even that bad

DetroitThrow6mo ago

"I am never wrong, sir" -onedognight

yayitswei6mo ago

That was my first impression as well.

justtheory6mo ago

LOL

tempest_6mo ago· 4 in thread

This is nice because there are a lot of clickhouse fdw implementations and none of them are well maintained from what I can tell.

saisrirampur6mo ago

DetroitThrow6mo ago

>All with the goal of enabling users to build fast analytics from the Postgres layer itself but still using the power of ClickHouse!

That would be incredible! So many times I want to reach for ClickHouse but whatever company I'm at has so much inertia built into PG. Pleease add CTE support.

And yes I'm aware of PeerDB or whatever that project is called. This is still or even more helpful.

2 more replies

kiwicopple6mo ago

(Note: we work closely with the clickhouse team so this is not to intended to detract from their launch, simply to point out maintained options.)

Our CH wrapper is actively maintained, with push down, parameterized views, and async streaming: https://supabase.github.io/wrappers/catalog/clickhouse/

We see a lot of companies choosing CH with PG - it’s fantastic

saisrirampur6mo ago

Very excited to continue working closely to further integrate these amazing open source database technologies and make it easier for users. :)

graovic6mo ago· 4 in thread

This is pretty good. It will allow us to use PostgREST as an API endpoint to query the ClickHouse database directly

saisrirampur6mo ago

Good idea! Btw, ClickHouse does provide a HTTP interface directly, too! https://clickhouse.com/docs/interfaces/http

justtheory6mo ago

Ooh, neat idea!

oulipo26mo ago

lillecarl6mo ago

2 more replies

N_Lens6mo ago· 1 in thread

At this stage it may be possible to build one's entire application stack inside of postgres extensions.

justtheory6mo ago

Yes. YES! That's the idea. :-)

Olshansky6mo ago

Added to https://github.com/Olshansk/postgres_for_everything.

j / k navigate · click thread line to collapse