what would you recommend and why?
I understand part of the interest of pg_clickhouse is to be able to use "pre-existing Postgres queries" on an analytical database without having to change anything, so if I am building my database now and have no legacy, would pg_clickhouse make sense, or should I do analytics differently?
Also, would you have some kind of tutorial / sample setup of a typical business application in Postgres and kind of replication in clickhouse to make analytics queries? so I can see how Clickhouse would be typically used?
Picking the best solution for your concrete workload (and your future demands) should be equally important to the implementation effort, to avoid that you run into walls later on. At least as long as data volume, query complexity or concurrency scalability can be challenges.
Note that I work for the company that built timescale (Tiger Data). Clickhouse is cool though, just throwing another option into the ring.
Tbf in terms of speed Clickhouse pulls ahead on most benchmark, unless you want to join a lot with your postgres data directly then you might benefit from having everything in one place. And of course you avoid the sync overhead.
I've been really happy with DuckLake, happy to answer any questions about it.
DuckDB has always felt easier to use vs. Clickhouse for me, but both are great options. If I were you, I'd try both options for a few hours with your use case and pick the one that feels better.
ClickHouse’s bread and butter is real-time analytics for customer-facing applications, which often come with demanding concurrency and latency requirements.
Ack, totally makes sense that both are amazing technologies - you could try both and test them at the scale your real-time application may reach, and then choose the technology that best fits your needs. :)
You keep like 1 year's worth of data in your "business database", and then archive the rest in S3 with parquet and query with DuckDB ?
And if you want to sync everything, even "current data", to do datascience/analytics, can you just write the recent data (eg the last week of data or whatever) in S3 every hours/days to get relatively up-to-date data? And doesn't that cause the S3 data to grow needlessly (eg does it replace, rather than store an additional copy of recent data each hour?)
Do you have kind of "starter project" for a Postgres + DuckLake integration that I could look at to see how it's used in practice, and how it makes some operations easier?
Now, coming to your question about replication: you can use PeerDB (acquired by ClickHouse https://github.com/PeerDB-io/peerdb), which is laser-focused and battle-tested at scale for Postgres-to-ClickHouse replication. Once the data is replicated into ClickHouse, you can start querying those tables from within Postgres using pg_clickhouse. In ClickHouse Cloud, we offer ClickPipes for Postgres CDC/replication, which is a managed service version of PeerDB and is tightly integrated with ClickHouse. Now there could be non-transcational tables that you can directly ingest to ClickHouse and still query using pg_clickhouse.
So TL;DR: Postgres for OLTP; ClickHouse for OLAP; PeerDB/ClickPipes for data replication; pg_clickhouse as the unified query layer. We are actively working on making this entire stack tightly integrated so that building real-time apps becomes seamless. More on that soon! :)
Also what would be the benefit for me of querying clickhouse from Postgres, rather than directly through my backend via an ORM/SDK? is that because it would allow me to do JOINs?
What would be the typical setup if I want to JOIN analytical data (eg my IoT device readings) from CH with some business data (eg the user owning the device) from my Postgres? Would I replicate that business data to CH to do the join there, or would that be typically the exact use-case for pg_clickhouse?
It’s just like all the other postgres extensions named “pg_foo”, and the clear and obvious choice for “foo” in this case is “clickhouse”.
Unless this is some bad joke that has flown over my head.
Our main focus is comprehensive pushdown capabilities. It was very surprising to see how much the Postgres FDW framework has evolved over the years and the number and types of hooks it now provides for push down. This is why we decided to lean into FDW than build an extension bottoms up. But we may still do that within pg_clickhouse for a few features, wherever FDW framework becomes a restriction.
We’ve made notable progress over the last few months, including support for pushdown of custom aggregations and SEMI JOINs/basic subqueries. Fourteen of twenty-two TPCH queries are now fully pushdownable.
We’ll be doubling down to add pushdown support for much more complex queries, CTEs, window functions, and more. More on the future here - https://github.com/ClickHouse/pg_clickhouse?tab=readme-ov-fi... All with the goal of enabling users to build fast analytics from the Postgres layer itself but still using the power of ClickHouse!
That would be incredible! So many times I want to reach for ClickHouse but whatever company I'm at has so much inertia built into PG. Pleease add CTE support.
And yes I'm aware of PeerDB or whatever that project is called. This is still or even more helpful.
Our CH wrapper is actively maintained, with push down, parameterized views, and async streaming: https://supabase.github.io/wrappers/catalog/clickhouse/
We see a lot of companies choosing CH with PG - it’s fantastic
Very excited to continue working closely to further integrate these amazing open source database technologies and make it easier for users. :)