[2] DuckLake - The SQL-Powered Lakehouse Format for the Rest of Us by Prof. Hannes Mühleisen: https://www.youtube.com/watch?v=YQEUkFWa69o
DuckLake can do things that pg_lake cannot do with Iceberg, and DuckDB can do things Postgres absolutely can't (e.g. query data frames). On the other hand, Postgres can do a lot of things that DuckDB cannot do. For instance, it can handle >100k single row inserts/sec.
Transactions don't come for free. Embedding the engine in the catalog rather than the catalog in the engine enables transactions across analytical and operational tables. That way you can do a very high rate of writes in a heap table, and transactionally move data into an Iceberg table.
Postgres also has a more natural persistence & continuous processing story, so you can set up pg_cron jobs and use PL/pgSQL (with heap tables for bookkeeping) to do orchestration.
There's also the interoperability aspect of Iceberg being supported by other query engines.
DuckLake already has data-inlining for the DuckDB catalog, seems this will be possible once it's supported in the pg catalog.
> Postgres also has a more natural persistence & continuous processing story, so you can set up pg_cron jobs and use PL/pgSQL (with heap tables for bookkeeping) to do orchestration.
This is true, but it's not clear where I'd use this in practice. e.g. if I need to run a complex ETL job, I probably wouldn't do it in pg_cron.
it’s ok in dev/test and for me as the person in the team who’s enamored with duckdb, but it’s made the team experience challenging and so i’ve just kinda reverted to hive partitioned parquet files with a duckdb file that has views created on top of the parquet. attach that file as read only and query away.
i may work up a full example to submit as an issue but up until now too may other things are dominating my time.
When people ask me what’s missing in the Postgres market, I used to tell them “open source Snowflake.”
Crunchy’s Postgres extension is by far the most ahead solution in the market.
Huge congrats to Snowflake and the Crunchy team on open sourcing this.
The trap you end up in is you have to pay snowflake to access your data, iceberg and other technology help with the walled garden.
Not just snowflake, any pay on use provider.
(Context - have spent 5+ years working with Snowflake, it's great, have built drivers for various languages, etc).
It's great that I can run this locally in a Docker container, I'd love to be able to run a managed instance on AWS billed through our existing Snowflake account
Some service writes a lot of data in parquet files stored on S3 (e.g. logs), and now you want that data to be queryable from your application as if it was in postgres (e.g. near real-time analytics dashboard). pg_lake allows you to load these parquet files into postgres and query the data. You can also join that data with existing tables in postgres.
- https://github.com/smithclay/otlp2parquet (shameless plug, based on Clickhouse's Otel schema) - https://github.com/Mooncake-Labs/moonlink (also has OTLP support) - https://github.com/open-telemetry/otel-arrow (official community project under early dev)
And this is where postgres does not cut it.
You need some more CPU and RAM than what you pay for in your postgres instance. I.e. a distributed engine where you don't have to worry about how big your database instance is today.
Your compute asks Postgres “what is the current data for these keys?” Or “what was the current data as of two weeks ago for these keys?” And your compute will then download and aggregate your analytics query directly from the parquet files.
My understanding is the opposite - PG cuts it as a compute layer for small amounts of data, and this is where it excels.
I also assume `pg_lake` was built mainly with the intention of creating/writing tables, and the ability to read comes "for free" as an extra, since Iceberg integration is already written.
Under "Everything is a file", you can read or manipulate a wide variety of information via simple, open/read/write() APIs. Linux provides APIs to modify system settings via filesystem. Get the screen brightness setting? `cat /sys/class/backlight/device0/brightness`. Update? `echo 500 > /sys/class/backlight/device0/brightness`. No need for special APIs, just generic file operations and the kernel handles everything.
FUSE (Filesystem in Userspace) provides even more flexibility by allowing user space programs to build their own drivers that handle any data operation via the filesystem. You can mount remote systems (via SSH) and google drive, and copying files is as easy as `cp /mnt/remote/data/origin /mnt/googledrive/data/`. Or using unique FUSE like pgfuse and redisfs, updating redis value by postgres DB data is just `cat /mnt/postgres/users/100/full_name > /mnt/redis/user_100_full_name`.
But filesystems are only good for hierarchical data while a lot of real world data is relational. Many FUSE software tries hard to represent inherently non-hierarchical data in a filesystem. Data lake allows to use SQL, the elegant abstraction for relational data, across different underlying data sources. They can be physically distant and have different structures. A lot of real world applications are just CRUD on relational data. You can accomplish much more much easier if those data are just a big single database.
Video of their SVP of Product talking about it here: https://youtu.be/PERZMGLhnF8?si=DjS_OgbNeDpvLA04&t=1195
Data platforms like Snowflake are built as a central place to collect your organisation's data, do governance, large scale analytics, AI model training and inference, share data within and across orgs, build and deploy data products, etc. These are not jobs for a Postgres server.
Pg_lake foremost targets Postgres users who currently need complex ETL pipelines to get data in and out of Postgres, and accidental Postgres data warehouses where you ended up overloading your server with slow analytical queries, but you still want to keep using Postgres.
If it's anything like super base, your question the existence of God when trying to get it to work properly.
You pay them to make it work right.
The configuration mainly involves just defining the default iceberg location for new tables, pointing it to the pgduck_server, and providing the appropriate auth/secrets for your bucket access.
What are the deployment implications if one wants to integrate this in production? Surely they'd need a much larger Postgres machine at a minimum.
Is there concern re: "hot neighbour" problems if the DuckDB queries get too heavy? How is that sort of issue potentially handled? I understood from another query that DuckDB is ran in a separate process, so there is room to potentially throttle it
(1) Are there any plans to make this compatible with the ducklake specification? Meaning: Instead of using Iceberg in the background, you would use ducklake with its SQL tables? My knowledge is very limited but to me, besides leveraging duckdb, another big point of ducklake is that it's using SQL for the catalog stuff instead of a confusing mixture of files, thereby offering a bunch of advantages like not having to care about number of snapshots and better concurrent writes.
(2) Might it be possible that pg_duckdb will achieve the same thing in some time or do things not work like that?
(2) In principle, it's a bit easier for pg_duckdb to reuse the existing Ducklake implementation because DuckDB sits in every Postgres process and they can call into each other, but we feel that architecture is less appropriate in terms resource management and stability.
With DuckLake, the query frontend and query engine are DuckDB, and Postgres is used as a catalog in the background.
With pg_lake, the query frontend and catalog are Postgres, and DuckDB is used as a query engine in the background.
Of course, they also use different table formats (though similar in data layer) with different pros and cons, and the query frontends differ in significant ways.
An interesting thing about pg_lake is that it is effectively standalone, no external catalog required. You can point Spark et al. directly to Postgres with pg_lake by using the Iceberg JDBC driver.
Once those tables exist, queries against them are able to either push down entirely to the remote tables and uses a Custom Scan to execute and pull results back into postgres, or we transform/extract the pieces that can be executed remotely using a FDW and then treat it as a tuple source.
In both cases, the user does not need to know any of the details and just runs queries inside postgres as they always have.
And a common permissioning/datasharing layer so I can share data to external and internal parties who can in turn bring their own compute to make their own latency choices.
That said, don't sleep on the "this is awesome" parts in this project... my personal favorite is the automatic schema detection:
``` CREATE TABLE my_iceberg_table () USING iceberg WITH (definition_from = 's3://bucket/source_data.parquet'); ```
This announcement seems huge to me, no?!
Is this really an open source Snowflake covering most use cases?
I think this is a pretty big deal, though.
Snowflake does a lot more, though, especially around sharing data across company boundaries.
https://youtu.be/HZArjlMB6W4?si=BWEfGjMaeVytW8M1
Also, nicer recording from POSETTE: https://youtu.be/tpq4nfEoioE?si=Qkmj8o990vkeRkUa
It comes down to the trade-offs made by operational and analytical query engines being fundamentally different at every level.
Additionally, the postgres extension system supports most of the current project, so wouldn't say it was forced in this case, it was a design decision. :)
I'll see if we can improve the docs or highlight that part better, if it is already documented—we did move some things around prior to release.
> This separation also avoids the threading and memory-safety limitations that would arise from embedding DuckDB directly inside the Postgres process, which is designed around process isolation rather than multi-threaded execution. Moreover, it lets us interact with the query engine directly by connecting to it using standard Postgres clients.
- Separation of concerns, since with a single external process we can share object store caches without complicated locking dances between multiple processes. - Memory limits are easier to reason about with a single external process. - Postgres backends end up being more robust, as you can restart the pgduck_server process separately.
> This separation also avoids the threading and memory-safety limitations that would arise from embedding DuckDB directly inside the Postgres process, which is designed around process isolation rather than multi-threaded execution. Moreover, it lets us interact with the query engine directly by connecting to it using standard Postgres clients.
We've had this discussion like a week ago about how stupid is to use filesystem for this kind of data storage and here we go again. Actually i had to implement this "idea" in practice. What a nonsense.