At first glance, it seems like Delta Lake is inferior to a database. Most databases support multi-table transactions and Delta Lake only support transactions for single table. ACID transaction support is nothing new for a database.
Delta Lake is useful for large datasets and to keep costs low.
There are organizations that are ingesting hundreds of terabytes and petabytes of data into a Delta table every day. They're able to ingest data, perform upserts, and build realtime pipelines with this architecture.
Delta Lake is also free, so you only have to pay for storing the files in the cloud. This is a lot cheaper than a database usually.
Data warehouses are often packaged with a certain amount of shared RAM/storage. This can be a problem for a team with large workflows from many users. It's annoying to share compute with someone that's running a large experiment.
These are the main reasons enterprises shited to data lakes and now Lakehouse storage systems. See this paper to learn more: https://www.cidrdb.org/cidr2021/papers/cidr2021_paper17.pdf
Traditional RDBMSes just don’t scale so well as S3. But S3 didn’t have ACID semantics. Now it does!
Yes, it is basically just another relational database system. -but-, it's a database system that's optimized for a different purpose.
A traditional RDBMS is designed for OLTP workloads, and it does a great job of that. Ideally operations are small, discrete, and handled within milliseconds. In service of that speed, you also want to keep them small and lean, so that you can take maximum advantage of caching hot data in memory. Maybe on the megabytes-to-gigabytes scale.
A data warehouse is designed for more OLAP-style workloads, but the emphasis is still on real-time responses to relatively predictable requests. But it's at the more relaxed end of the "real-time" scale - a query might take a few seconds to run. You'll use extract-transform-load jobs to get the data organized into a structure that's optimized for those workloads before you load it into the warehouse. Data volumes still matter here, but they can be allowed to get quite a bit bigger than what's typical in OLTP databases. Think gigabytes-to-terabytes scale.
Lakehouses, on the other hand, are meant for more of a "get the data somewhere, and then figure out how to use it" mindset. So getting the data into it follows more of an extract-load-transform regime, meaning that significant processing and transformation of the data happens in the course of executing the query itself. The kinds of questions you want to ask are almost unconstrained, and that changes the performance situation again. Millisecond response times are now something that just never happens. Instead you're looking at seconds to minutes, perhaps even hours, being typical execution times for a query. The data also gets bigger again. People often suggest it's potentially on the terabytes-to-petabytes scale, but I haven't seen that myself. Mostly because I've never worked anywhere were anyone even wants to have that much data sitting around to have to manage and govern.
I would say don't get caught up too much on the scale consideration, though. That's real, but I think that the more interesting distinction, and the one that explains why OLTP systems and data warehouses are often implemented using the same RDBMS systems, while lakehouses really do merit a completely different tech stack, is the ETL vs ELT distinction.
Converting a Parquet table to a Delta table is an in-place, cheap computation. You can just add the Delta Lake metadata to an existing Parquet table and then take advantage of transactions and other features. I don't think it's a meaningless comparison.
Iceberg is cool too.
Hudi is nice, but they are in the middle of a big format change right now.
Iceberg is nice, but is the most conservative and slow format out of three.
The Delta Rust implementation is missing more table features, but we're closing the gap fast. We just added support for constraints to Delta Rust and are working on change data feed right now.
If that were true, Snowflake would not be as fast on Iceberg/Parquet as its native format. The engine makes something fast or slow, not the table format.
Disclaimer - am at Snowflake.
We tested all three of them using Spark batches that converted a stream of changes into SCD2.
• the branching and tagging DDL (https://iceberg.apache.org/docs/latest/spark-ddl/#branching-...)
• Iceberg Procedures (https://iceberg.apache.org/docs/latest/spark-procedures/)
Also
> Parquet tables are OK when data is in a single file but are hard to manage and unnecessarily slow when data is in many files
This is not true. Having worked with Spark it's much better to have multiple "small" files than only one big file.
Let's suppose you have a data lake with 40,000 Parquet files. You need to list the files before you can read the data. This can take a few minutes. I've worked on data lakes that require file listing operations that run for hours. Key/value stores aren't good at listing files like Unix filesystems.
When Spark reads the 40,000 Parquet files it needs to figure out the schema. By default, it'll just grab the schema from one of the files and just assume that all the others have the same schema. This could be wrong.
You can set an option telling Spark to read the schemas of all 40,000 Parquet files and make sure they all have the same schema. That's expensive.
Or you can manually specify the schema, but that can be really tedious. What if the table has 200 columns.
The schema in the Parquet footer is perfect for a single file. I think storing the schema in the metadata is much better when data is spread across many Parquet files.
Sounds like this data lake could use a Parquet file listing the Parquet files.
Butter
Per your blog: https://mungingdata.com/devrel/virtuous-content-cycle-develo...
"This post explains how to scale developer advocacy by creating content in a way that answers current user questions and makes it easier to generate additional content in the future"
As a lead for a team of developers who have used Parquet and considering Iceberg for our next-gen stuff, you aren't "answering current user questions" about whether we should consider DeltaLake, at least for me. You are marketing to a past world.
Pointers from anyone on Delta vs Iceberg welcome.
I respect the Iceberg team & their work.
I've been shying away from that post cause I don't wanna start a flamewar, but I will reflect on this and reconsider. Thank you.
Delta Lake stores data in Parquet files and adds a metadata layer to provide support for ACID transactions, schema enforcement, versioned data, and full DML support. Delta Lake also offers concurrency protection.
This post explains all the features offered by Delta Lake in comparison to a plain vanilla Parquet data lake.
"Apache Hudi vs Delta Lake vs Apache Iceberg - Data Lakehouse Feature Comparison" https://www.onehouse.ai/blog/apache-hudi-vs-delta-lake-vs-ap...
* Delta Lake supports merge-on-read via deletion vectors: https://delta.io/blog/2023-07-05-deletion-vectors/
* Why doesn't Delta Lake have efficient bulk load? Lots of the biggest datasets in the world are in Delta tables.
* Delta Lake definitely supports compaction: https://delta.io/blog/2023-01-25-delta-lake-small-file-compa...
* What does CLI support mean in the context of a Lakehouse storage system? You can open up a Spark shell or Python shell to interface with your Delta table. That's like saying "CSV doesn't have a CLI". I don't get it.
I didn't do a detailed review of the post.
I think the website is here: https://delta.io
I did some keyword research and wrote this post cause lots of folks are doing searches for Delta Lake vs Parquet. I'm just trying to share a fair summary of the tradeoffs with folks who are doing this search. It's a popular post and that's why I figured I would share it here.