I'm curious about HN's position between these two formats? I'm having a hard time deciphering which might be the industry winner (or perhaps they both have a place, no "winner" necessary)
That's great news, thanks for your contributions to open source (here, and all the other extensions)
I both love and hate RDS - the costs are exorbitant.
I have to setup a kafka pipe streaming CDC events to replicate the data into other DB types to get the benefits of things like this.
For me, it's a license that provides forward guarantees to the Community: no proprietary forks can happen, so any fork will be an OSS fork from which the upstream project may benefit too, which benefits all users.
That's the reason we chose this license for StackGres [1], another project in the Postgres space.
[1]: https://stackgres.io
From a feature-set perspective, in addition to querying local disk, we can query remote object stores (S3, GCS, etc.), table format providers (Delta Lake, soon Iceberg too).
From a code perspective, we're written in Rust on top of open-source standards like OpenDAL and DataFusion, while Hydra is their own codebase built from a fork of Citus columnar, in C.
Hydra is a cool project. Hope this helps! :)
Could you make it run with pg_lite in wasm or DuckDB?
1 - https://steampipe.io/blog/2023-12-sqlite-extensions 2 - https://steampipe.io/blog/2023-12-steampipe-export
https://en.wikipedia.org/wiki/P._G._Wodehouse
Very clever naming!
https://www.databricks.com/product/data-lakehouse
edit: ah, but in a different comment someone noted that it's not actually a lakehouse, so who knows!? :)
My best guess is that Databricks and Pg_lakehouse both independently coined "lakehouse" from "data lake", and that for the latter team, it was partly a pun on Wodehouse. But the creators are welcome to chime in and confirm/deny!
(Or to say, like, "Sure...uh, we totally meant that...yes we are very literary.")
I assume that's DataFusion speed. What's the plan to improve upon it?
I should clarify that our published Clickbench results are from our pg_analytics extension. New results with pg_lakehouse will be released. They're going to beat the old benchmarks because 1. No overhead from Postgres transactions/MVCC, since pg_analytics used the table access method whereas pg_lakehouse is just a foreign data wrapper 2. Uses the latest release of DataFusion.
The performance differences that exist between DataFusion and other OLAP engine are rapidly becoming commoditized. DataFusion is already a world-class query engine and will only improve. pg_lakehouse absorbs all those improvements into Postgres.
Currently Datafusion is much slower than duckdb or OOMing.
Could you share the key difference between this and the previous pg_analytics, and motivation of making it a separate plugin?
This makes it a much simpler (and in our opinion, more elegant) extension. We learned that many of our users already stored their Parquet files in S3, so it made sense to connect directly to S3 rather than asking them to ingest those Parquet files into Postgres.
It also accelerates the path to production readiness, since we're not touching Postgres internals (no need to mess with Postgres MVCC, write ahead logs, transactions, etc.)
In all seriousness though, I see your point. While it's true that we don't provide the storage or table format, our belief is that companies actually want to own the data in their S3. We called it pg_lakehouse because it's the missing glue for companies already using Postgres + S3 + Delta Lake/Iceberg to have a lakehouse without new infrastructure.
What's the model to feed such a lake from some queue?
2 questions:
- do you distribute query processing over multiple pg nodes ?
- do you store the metadata in PG, instead of a traditional metastore?
1. It's single node, but DataFusion parallelizes query execution across multiple cores. We do have plans for a distributed architecture, but we've found that you can get ~very~ far just by scaling up a single Postgres node.
2. The only information stored in Postgres are the options passed into the foreign data wrapper and the schema of the foreign table (this is standard for all Postgres foreign data wrappers).
It's really crazy how some projects just instantly enable a whole generation of new possibilities.
If you are impressed like this and want to build something like it -- check out pgrx, it's a pretty great experience.
I like this one very much. Very simple way to avoid having to use different set of tools and query languages (or more limited query languages) to query lakes.
I think I can see some parallels to Supabase's wrappers project.
Keep up the good work!