Pg_lakehouse: Query Any Data Lake from Postgres (opens in new tab)

(github.com)

171 pointslandingunless2y ago72 comments

72 comments

55 comments · 19 top-level

kiwicopple2y ago· 7 in thread

Neat that you plan to support both Delta Lake and Apache Iceberg

I'm curious about HN's position between these two formats? I'm having a hard time deciphering which might be the industry winner (or perhaps they both have a place, no "winner" necessary)

retakeming2y ago

This is anecdotal, but I feel that we (ParadeDB) have received more requests for Iceberg integration vs. Delta Lake. We were actually hesitant to launch pg_lakehouse without Iceberg support, but pulled the trigger on it because the iceberg-rust crate is still in its early days. We will probably be contributing to iceberg-rust to make it work with pg_lakehouse.

lukekim2y ago

Also anecdotal, but we (Spice AI) see more requests for Iceberg, but in practice more deployments of Delta Lake.

1 more reply

kiwicopple2y ago

> We will probably be contributing to iceberg-rust to make it work with pg_lakehouse

That's great news, thanks for your contributions to open source (here, and all the other extensions)

slap_shot2y ago

There isn't a winner and there likely won't be one (at least not for a long time). Tabular will likely be acquired by Snowflake and the two industry behemoths now back their own formats, and each will treat their own as a first class citizen.

philippemnoel2y ago

Agreed, this is why we want to support both. Maybe even Apache Hudi down the line. But I hope the industry converges to a main standard rather than Snowflake/Databricks fighting for their own formats. They can differentiate on much more meaningful features

1 more reply

kcirerick2y ago

I'm also building in the lakehouse space and anecdotally have seen more excitement around Iceberg over delta lake just because of its completely open source origins. Iceberg has evolved faster and has had more contributions from a more diverse set of contributors than Delta Lake. Not sure if this will change with a Snowflake <> Tabular acquisition but I'd easily bet on Iceberg if current trends continue.

philippemnoel2y ago

We agree. We plan to bring Iceberg support as a first-class citizen as soon as we can, but unfortunately the support in Rust these days is still limited. We and the community are working on it

whalesalad2y ago· 6 in thread

How many folks here struggle to adopt tooling like this because it isn’t possible to add psql extensions to places like RDS?

philippemnoel2y ago

We're working on getting our extensions approved on as many platforms as possible

dewey2y ago

Yep, that usually dampens my excitement pretty quickly after seeing a new extension. You‘d also not know if it’s available on new versions. Sometimes you can install them „manually“ like supabase audit but more often that’s not possible.

oulu20062y ago

Me.

I both love and hate RDS - the costs are exorbitant.

I have to setup a kafka pipe streaming CDC events to replicate the data into other DB types to get the benefits of things like this.

pid-12y ago

Moreover, even when extensions are supported by RDS, they often make upgrading database versions a PITA.

oulu20062y ago

It's gotten significantly easier lately, I upgraded from v11 -> v15 with almost no downtime.

thenaturalist2y ago

Not familiar with the process, how do they make it a PITA?

nikita2y ago· 4 in thread

This is great work! Could you please comment on the choice of your license. Lost Postgres extension that achieve wide adoption use Postgres, MIT or Apache license.

philippemnoel2y ago

All ParadeDB extensions are released under AGPL-3.0. We've found that it strikes the right balance between being open-source and enabling the community to adopt for free, while also protecting us from hyperscalers and enabling us to build a sustainable business. Perhaps the topic of a blog post someday :)

ahachete2y ago

I applaud the decision to use AGPL-3.0.

For me, it's a license that provides forward guarantees to the Community: no proprietary forks can happen, so any fork will be an OSS fork from which the upstream project may benefit too, which benefits all users.

That's the reason we chose this license for StackGres [1], another project in the Postgres space.

[1]: https://stackgres.io

1 more reply

nikita2y ago

It looks like hyper scalers can still host it as long as they are publishing changes to the source code ? Am I reading the license right ?

2 more replies

francoismassot2y ago

Well, MongoDB was under AGPL v3.0 :)

epsilonic2y ago· 4 in thread

How does this compare to Hydra? https://www.hydra.so/

philippemnoel2y ago

You can see performance comparison to Hydra on ClickBench: https://benchmark.clickhouse.com/ by selecting ParadeDB and Hydra. Tl;dr: It is much faster.

From a feature-set perspective, in addition to querying local disk, we can query remote object stores (S3, GCS, etc.), table format providers (Delta Lake, soon Iceberg too).

From a code perspective, we're written in Rust on top of open-source standards like OpenDAL and DataFusion, while Hydra is their own codebase built from a fork of Citus columnar, in C.

Hydra is a cool project. Hope this helps! :)

sgt2y ago

And when will you have GCS storage ready? I saw on the website that it is not yet available.

1 more reply

epsilonic2y ago

Thanks for the prompt response, the support for OpenDAL is amazing!

koolba2y ago

That is one hell of a logo!

nathanwallace2y ago· 2 in thread

Readers may also enjoy Steampipe [1], an open source tool to live query 140+ services with SQL (e.g. AWS, GitHub, CSV, Kubernetes, etc). It uses Postgres Foreign Data Wrappers under the hood and supports joins etc with other tables. (Disclaimer - I'm a lead on the project.)

1 - https://github.com/turbot/steampipe

snthpy2y ago

I like steampipe but found the use of postgres a bit heavy for my use cases.

Could you make it run with pg_lite in wasm or DuckDB?

nathanwallace2y ago

In addition to Postgres FDWs, Steampipe plugins are also available as a SQLite extension [1] or a CLI export tool [2] for lighter weight use cases. (Although a local Postgres has a surprisingly small footprint!) Building plugins as DuckDB extensions would be cool too, but we haven't done that yet.

1 - https://steampipe.io/blog/2023-12-sqlite-extensions 2 - https://steampipe.io/blog/2023-12-steampipe-export

2 more replies

arduanika2y ago· 2 in thread

The name seems to be an allusion to the author P.G. Wodehouse, creator of the character Jeeves.

https://en.wikipedia.org/wiki/P._G._Wodehouse

Very clever naming!

pas2y ago

Sorry, what do you base that on? To me it just seems like a straightforward inspiration from the "data lake" -> "lakehouse" terminology that Databricks started (?) using.

https://www.databricks.com/product/data-lakehouse

edit: ah, but in a different comment someone noted that it's not actually a lakehouse, so who knows!? :)

arduanika2y ago

Based on pure speculation. I may be reaching.

My best guess is that Databricks and Pg_lakehouse both independently coined "lakehouse" from "data lake", and that for the latter team, it was partly a pun on Wodehouse. But the creators are welcome to chime in and confirm/deny!

(Or to say, like, "Sure...uh, we totally meant that...yes we are very literary.")

2 more replies

nikita2y ago· 2 in thread

I have another question. So far on the clickbench leaderboard it's 15x slower than baseline. The number 1 place is 1.67 slower the baseline.

I assume that's DataFusion speed. What's the plan to improve upon it?

retakeming2y ago

Could you clarify which result you're referring to as the baseline and "number 1 place?"

I should clarify that our published Clickbench results are from our pg_analytics extension. New results with pg_lakehouse will be released. They're going to beat the old benchmarks because 1. No overhead from Postgres transactions/MVCC, since pg_analytics used the table access method whereas pg_lakehouse is just a foreign data wrapper 2. Uses the latest release of DataFusion.

The performance differences that exist between DataFusion and other OLAP engine are rapidly becoming commoditized. DataFusion is already a world-class query engine and will only improve. pg_lakehouse absorbs all those improvements into Postgres.

riku_iki2y ago

Would be great to also see new pg_lakehouse and datafusion benchmark results here: https://duckdblabs.github.io/db-benchmark/

Currently Datafusion is much slower than duckdb or OOMing.

mustafabal2y ago· 2 in thread

Very nice addition! Do you plan to support Snowflake as an object store in the near future? It's not currently in pg_lakehouse's README.

xuanwo2y ago

Hi, OpenDAL's maintainer here. I'm not sure what "Snowflake as an object store" means since Snowflake is a cloud data warehouse service and not intended for storage services.

philippemnoel2y ago

Snowflake is not in the list of supported stores on Apache OpenDAL, so likely not. It might not expose its storage APIs. I doubt users of Snowflake would want a separate query engine anyways

sdairs2y ago· 2 in thread

Very cool!

Could you share the key difference between this and the previous pg_analytics, and motivation of making it a separate plugin?

retakeming2y ago

Whereas pg_analytics stores the data in Postgres block storage, pg_lakehouse does not use Postgres storage at all.

This makes it a much simpler (and in our opinion, more elegant) extension. We learned that many of our users already stored their Parquet files in S3, so it made sense to connect directly to S3 rather than asking them to ingest those Parquet files into Postgres.

It also accelerates the path to production readiness, since we're not touching Postgres internals (no need to mess with Postgres MVCC, write ahead logs, transactions, etc.)

nitinreddy882y ago

If users are already having datalake kind of system which is generating parquet files, the use case to use Postgres to query the data itself is questionable. I think having Postgres way of doing things should be prioritised if you want to keep your product in unique position.

2 more replies

mcdonje2y ago· 1 in thread

Looks like pg as a replacement for databricks sql, which is already a query engine for datalakes. It's not a lakehouse, but it calls itself one. Seems like a cool and useful project, but the name is problematic.

retakeming2y ago

pg_house just wasn't as catchy!

In all seriousness though, I see your point. While it's true that we don't provide the storage or table format, our belief is that companies actually want to own the data in their S3. We called it pg_lakehouse because it's the missing glue for companies already using Postgres + S3 + Delta Lake/Iceberg to have a lakehouse without new infrastructure.

tarasglek2y ago· 1 in thread

I am not up to date in various lakes. Is this read-only? Are you able to init a lake from scratch?

What's the model to feed such a lake from some queue?

philippemnoel2y ago

For now it is read-only, but soon will be write-supported too. You can feed data via Kafka

samber2y ago· 1 in thread

It seems very promising!

2 questions:

- do you distribute query processing over multiple pg nodes ?

- do you store the metadata in PG, instead of a traditional metastore?

retakeming2y ago

Thanks!

1. It's single node, but DataFusion parallelizes query execution across multiple cores. We do have plans for a distributed architecture, but we've found that you can get ~very~ far just by scaling up a single Postgres node.

2. The only information stored in Postgres are the options passed into the foreign data wrapper and the schema of the foreign table (this is standard for all Postgres foreign data wrappers).

brunoqc2y ago· 1 in thread

Nice. I wish timescaledb open-sourced their s3 storage thing.

philippemnoel2y ago

They've been moving more and more towards closed source over the years, which is a shame but I understand why. We don't offer time-series features today, but we're not ruling out adding support for it eventually if it is desired by our users.

hardwaresofton2y ago· 1 in thread

Yet another amazing postgres plugin made possible by pgrx (https://github.com/pgcentralfoundation/pgrx)

It's really crazy how some projects just instantly enable a whole generation of new possibilities.

If you are impressed like this and want to build something like it -- check out pgrx, it's a pretty great experience.

philippemnoel2y ago

pgrx is indeed wonderful and we would not be able to do our work without it. Big kudos to Eric, Jubilee and rest of team!

ahachete2y ago

The (internal) use of DataFusion to create new, powerful extensions for Postgres is a very clever idea. Very good work for the ParadeDB team.

I like this one very much. Very simple way to avoid having to use different set of tools and query languages (or more limited query languages) to query lakes.

tehlike2y ago

Paradedb is doing a lot of good work with postgres. Pg_analytics, and now pg_lakehouse...

jeadie2y ago

This looks functionally similar as using http://github.com/spiceai/spiceai with a postgreSQL data accelerator.

yrashk2y ago

As somebody who writes a lot of Postgres extensions, I can say this is quite interesting!

I think I can see some parallels to Supabase's wrappers project.

Keep up the good work!

q9tE6uHb7yKq2y ago

looks interesting!

j / k navigate · click thread line to collapse

72 comments

55 comments · 19 top-level

kiwicopple2y ago· 7 in thread

Neat that you plan to support both Delta Lake and Apache Iceberg

I'm curious about HN's position between these two formats? I'm having a hard time deciphering which might be the industry winner (or perhaps they both have a place, no "winner" necessary)

retakeming2y ago

lukekim2y ago

Also anecdotal, but we (Spice AI) see more requests for Iceberg, but in practice more deployments of Delta Lake.

1 more reply

kiwicopple2y ago

> We will probably be contributing to iceberg-rust to make it work with pg_lakehouse

That's great news, thanks for your contributions to open source (here, and all the other extensions)

slap_shot2y ago

philippemnoel2y ago

1 more reply

kcirerick2y ago

philippemnoel2y ago

We agree. We plan to bring Iceberg support as a first-class citizen as soon as we can, but unfortunately the support in Rust these days is still limited. We and the community are working on it

whalesalad2y ago· 6 in thread

How many folks here struggle to adopt tooling like this because it isn’t possible to add psql extensions to places like RDS?

philippemnoel2y ago

We're working on getting our extensions approved on as many platforms as possible

dewey2y ago

oulu20062y ago

Me.

I both love and hate RDS - the costs are exorbitant.

I have to setup a kafka pipe streaming CDC events to replicate the data into other DB types to get the benefits of things like this.

pid-12y ago

Moreover, even when extensions are supported by RDS, they often make upgrading database versions a PITA.

oulu20062y ago

It's gotten significantly easier lately, I upgraded from v11 -> v15 with almost no downtime.

thenaturalist2y ago

Not familiar with the process, how do they make it a PITA?

nikita2y ago· 4 in thread

This is great work! Could you please comment on the choice of your license. Lost Postgres extension that achieve wide adoption use Postgres, MIT or Apache license.

philippemnoel2y ago

ahachete2y ago

I applaud the decision to use AGPL-3.0.

That's the reason we chose this license for StackGres [1], another project in the Postgres space.

[1]: https://stackgres.io

1 more reply

nikita2y ago

It looks like hyper scalers can still host it as long as they are publishing changes to the source code ? Am I reading the license right ?

2 more replies

francoismassot2y ago

Well, MongoDB was under AGPL v3.0 :)

epsilonic2y ago· 4 in thread

How does this compare to Hydra? https://www.hydra.so/

philippemnoel2y ago

You can see performance comparison to Hydra on ClickBench: https://benchmark.clickhouse.com/ by selecting ParadeDB and Hydra. Tl;dr: It is much faster.

From a feature-set perspective, in addition to querying local disk, we can query remote object stores (S3, GCS, etc.), table format providers (Delta Lake, soon Iceberg too).

From a code perspective, we're written in Rust on top of open-source standards like OpenDAL and DataFusion, while Hydra is their own codebase built from a fork of Citus columnar, in C.

Hydra is a cool project. Hope this helps! :)

sgt2y ago

And when will you have GCS storage ready? I saw on the website that it is not yet available.

1 more reply

epsilonic2y ago

Thanks for the prompt response, the support for OpenDAL is amazing!

koolba2y ago

That is one hell of a logo!

nathanwallace2y ago· 2 in thread

1 - https://github.com/turbot/steampipe

snthpy2y ago

I like steampipe but found the use of postgres a bit heavy for my use cases.

Could you make it run with pg_lite in wasm or DuckDB?

nathanwallace2y ago

1 - https://steampipe.io/blog/2023-12-sqlite-extensions 2 - https://steampipe.io/blog/2023-12-steampipe-export

2 more replies

arduanika2y ago· 2 in thread

The name seems to be an allusion to the author P.G. Wodehouse, creator of the character Jeeves.

https://en.wikipedia.org/wiki/P._G._Wodehouse

Very clever naming!

pas2y ago

Sorry, what do you base that on? To me it just seems like a straightforward inspiration from the "data lake" -> "lakehouse" terminology that Databricks started (?) using.

https://www.databricks.com/product/data-lakehouse

edit: ah, but in a different comment someone noted that it's not actually a lakehouse, so who knows!? :)

arduanika2y ago

Based on pure speculation. I may be reaching.

(Or to say, like, "Sure...uh, we totally meant that...yes we are very literary.")

2 more replies

nikita2y ago· 2 in thread

I have another question. So far on the clickbench leaderboard it's 15x slower than baseline. The number 1 place is 1.67 slower the baseline.

I assume that's DataFusion speed. What's the plan to improve upon it?

retakeming2y ago

Could you clarify which result you're referring to as the baseline and "number 1 place?"

riku_iki2y ago

Would be great to also see new pg_lakehouse and datafusion benchmark results here: https://duckdblabs.github.io/db-benchmark/

Currently Datafusion is much slower than duckdb or OOMing.

mustafabal2y ago· 2 in thread

Very nice addition! Do you plan to support Snowflake as an object store in the near future? It's not currently in pg_lakehouse's README.

xuanwo2y ago

Hi, OpenDAL's maintainer here. I'm not sure what "Snowflake as an object store" means since Snowflake is a cloud data warehouse service and not intended for storage services.

philippemnoel2y ago

Snowflake is not in the list of supported stores on Apache OpenDAL, so likely not. It might not expose its storage APIs. I doubt users of Snowflake would want a separate query engine anyways

sdairs2y ago· 2 in thread

Very cool!

Could you share the key difference between this and the previous pg_analytics, and motivation of making it a separate plugin?

retakeming2y ago

Whereas pg_analytics stores the data in Postgres block storage, pg_lakehouse does not use Postgres storage at all.

It also accelerates the path to production readiness, since we're not touching Postgres internals (no need to mess with Postgres MVCC, write ahead logs, transactions, etc.)

nitinreddy882y ago

2 more replies

mcdonje2y ago· 1 in thread

retakeming2y ago

pg_house just wasn't as catchy!

tarasglek2y ago· 1 in thread

I am not up to date in various lakes. Is this read-only? Are you able to init a lake from scratch?

What's the model to feed such a lake from some queue?

philippemnoel2y ago

For now it is read-only, but soon will be write-supported too. You can feed data via Kafka

samber2y ago· 1 in thread

It seems very promising!

2 questions:

- do you distribute query processing over multiple pg nodes ?

- do you store the metadata in PG, instead of a traditional metastore?

retakeming2y ago

Thanks!

2. The only information stored in Postgres are the options passed into the foreign data wrapper and the schema of the foreign table (this is standard for all Postgres foreign data wrappers).

brunoqc2y ago· 1 in thread

Nice. I wish timescaledb open-sourced their s3 storage thing.

philippemnoel2y ago

hardwaresofton2y ago· 1 in thread

Yet another amazing postgres plugin made possible by pgrx (https://github.com/pgcentralfoundation/pgrx)

It's really crazy how some projects just instantly enable a whole generation of new possibilities.

If you are impressed like this and want to build something like it -- check out pgrx, it's a pretty great experience.

philippemnoel2y ago

pgrx is indeed wonderful and we would not be able to do our work without it. Big kudos to Eric, Jubilee and rest of team!

ahachete2y ago

The (internal) use of DataFusion to create new, powerful extensions for Postgres is a very clever idea. Very good work for the ParadeDB team.

I like this one very much. Very simple way to avoid having to use different set of tools and query languages (or more limited query languages) to query lakes.

tehlike2y ago

Paradedb is doing a lot of good work with postgres. Pg_analytics, and now pg_lakehouse...

jeadie2y ago

This looks functionally similar as using http://github.com/spiceai/spiceai with a postgreSQL data accelerator.

yrashk2y ago

As somebody who writes a lot of Postgres extensions, I can say this is quite interesting!

I think I can see some parallels to Supabase's wrappers project.

Keep up the good work!

q9tE6uHb7yKq2y ago

looks interesting!

j / k navigate · click thread line to collapse