Delta Lake vs. Parquet: A Comparison (opens in new tab)

(delta.io)

32 pointsMrPowers2y ago54 comments

54 comments

37 comments · 10 top-level

querez2y ago· 6 in thread

I'm not well versed in these things, but at this point, aren't you re-inventing database systems? Talking about things like ACID transactions, schema evolution, dropping columns, ... in the context of a file-format feels bizarre to me.

MrPowersOP2y ago

Yep, it is re-inventing database systems and you raise a great question.

At first glance, it seems like Delta Lake is inferior to a database. Most databases support multi-table transactions and Delta Lake only support transactions for single table. ACID transaction support is nothing new for a database.

Delta Lake is useful for large datasets and to keep costs low.

There are organizations that are ingesting hundreds of terabytes and petabytes of data into a Delta table every day. They're able to ingest data, perform upserts, and build realtime pipelines with this architecture.

Delta Lake is also free, so you only have to pay for storing the files in the cloud. This is a lot cheaper than a database usually.

Data warehouses are often packaged with a certain amount of shared RAM/storage. This can be a problem for a team with large workflows from many users. It's annoying to share compute with someone that's running a large experiment.

These are the main reasons enterprises shited to data lakes and now Lakehouse storage systems. See this paper to learn more: https://www.cidrdb.org/cidr2021/papers/cidr2021_paper17.pdf

DandyDev2y ago

It’s not so bizarre if you realize that bringing ACID semantics to files, lets you use the scalability of file/blob storage like S3 combined with DB-like access.

Traditional RDBMSes just don’t scale so well as S3. But S3 didn’t have ACID semantics. Now it does!

willvarfar2y ago

Snowflake, BigQuery, Firebolt, even Trino, offer a more-classic-DB-like interface to files hosted on classic cloud storage.

2 more replies

gmt20272y ago

This is exactly right. The lakehouse is a custom data warehouse you can build out of these cloud primitives to suit the specific data needs of an organisation. Think of it as a database scaled up by several orders of magnitude. Everything from storage costs to latency can be optimised as design choices. The common core in this architecture is data held in standard file formats such as parquet, delta tables, avro etc.

bunderbunder2y ago

Expanding on what others have already said:

Yes, it is basically just another relational database system. -but-, it's a database system that's optimized for a different purpose.

A traditional RDBMS is designed for OLTP workloads, and it does a great job of that. Ideally operations are small, discrete, and handled within milliseconds. In service of that speed, you also want to keep them small and lean, so that you can take maximum advantage of caching hot data in memory. Maybe on the megabytes-to-gigabytes scale.

A data warehouse is designed for more OLAP-style workloads, but the emphasis is still on real-time responses to relatively predictable requests. But it's at the more relaxed end of the "real-time" scale - a query might take a few seconds to run. You'll use extract-transform-load jobs to get the data organized into a structure that's optimized for those workloads before you load it into the warehouse. Data volumes still matter here, but they can be allowed to get quite a bit bigger than what's typical in OLTP databases. Think gigabytes-to-terabytes scale.

Lakehouses, on the other hand, are meant for more of a "get the data somewhere, and then figure out how to use it" mindset. So getting the data into it follows more of an extract-load-transform regime, meaning that significant processing and transformation of the data happens in the course of executing the query itself. The kinds of questions you want to ask are almost unconstrained, and that changes the performance situation again. Millisecond response times are now something that just never happens. Instead you're looking at seconds to minutes, perhaps even hours, being typical execution times for a query. The data also gets bigger again. People often suggest it's potentially on the terabytes-to-petabytes scale, but I haven't seen that myself. Mostly because I've never worked anywhere were anyone even wants to have that much data sitting around to have to manage and govern.

I would say don't get caught up too much on the scale consideration, though. That's real, but I think that the more interesting distinction, and the one that explains why OLTP systems and data warehouses are often implemented using the same RDBMS systems, while lakehouses really do merit a completely different tech stack, is the ETL vs ELT distinction.

ed_elliott_asc2y ago

Yep, that is the goal to make the files in your data like queryable like a database.

BadHumans2y ago· 4 in thread

Comparing Delta Lake to Parquet is a bit nonsense isn't it? Like comparing Postgres to a zip file. After trying all of the major open table formats, Iceberg in the future in my opinion. Delta is great if you use Databricks but otherwise I don't see a compelling reason to use it over Iceberg.

MrPowersOP2y ago

Lots of organizations have Parquet data lakes and are considering switching to Delta Lake.

Converting a Parquet table to a Delta table is an in-place, cheap computation. You can just add the Delta Lake metadata to an existing Parquet table and then take advantage of transactions and other features. I don't think it's a meaningless comparison.

Iceberg is cool too.

BadHumans2y ago

There is no Parquet table. Parquet is a compressed file format like a zip. Parquet can be read into table formats like Hive, Delta, etc. That is why this comparison makes no sense.

1 more reply

bunderbunder2y ago

I fail to see what's nonsense about comparing an extension to a format to the format it extends.

CharlesW2y ago

It’s like saying, “Which is better, ISOBMFF or MPEG-4”? It’s comparing a format with an application of the format.

1 more reply

orthoxerox2y ago· 4 in thread

Delta is nice, but a lot of features are missing from the FOSS version.

Hudi is nice, but they are in the middle of a big format change right now.

Iceberg is nice, but is the most conservative and slow format out of three.

MrPowersOP2y ago

There are a few features missing from the FOSS Scala/Spark implementation of Delta Lake, but I wouldn't say a lot. The FOSS version supports all the table features in the Delta Lake protocol.

The Delta Rust implementation is missing more table features, but we're closing the gap fast. We just added support for constraints to Delta Rust and are working on change data feed right now.

orthoxerox2y ago

Delta Live Tables and automatic vacuuming are the two big features I'm missing.

2 more replies

chimerasaurus2y ago

I’d take issue with the “Iceberg is slow” theme that Databricks in particular has tried to push.

If that were true, Snowflake would not be as fast on Iceberg/Parquet as its native format. The engine makes something fast or slow, not the table format.

Disclaimer - am at Snowflake.

orthoxerox2y ago

Back when were choosing between the three formats about 1.5 years ago, Iceberg was definitely the slowest. If the situation has changed since then, I would love to see an updated comparison.

We tested all three of them using Spark batches that converted a stream of changes into SCD2.

fractaloop2y ago· 3 in thread

Iceberg (https://iceberg.apache.org) is an open source alternative to Delta Lake that I cannot recommend enough. It organizes your Parquet files (or other serialization formats) in a logical structure with snapshots to allow time travel and git-like semantics for data management and Write-Audit-Publish strategies. My favorite use recently is the idempotent change data capture to ease replication in the event of failures. When your publishing job fails, you can simply replay the same diff between two snapshots and pick up where you left off.

reactordev2y ago

https://iceberg.apache.org is the correct link.

Eridrus2y ago

Can you share some references to the git-like semantics? I couldn't find much about merging the branched tables.

fractaloop2y ago

AFAIK, it’s limited to fast-forward merge strategies, but you can also create or replace branches and tags, along with cherry-picking snapshots. Additional information can be found in:

• the branching and tagging DDL (https://iceberg.apache.org/docs/latest/spark-ddl/#branching-...)

• Iceberg Procedures (https://iceberg.apache.org/docs/latest/spark-procedures/)

alexmolas2y ago· 3 in thread

Isn't delta lake using parquet files? I don't understand the comparison.

Also

> Parquet tables are OK when data is in a single file but are hard to manage and unnecessarily slow when data is in many files

This is not true. Having worked with Spark it's much better to have multiple "small" files than only one big file.

MrPowersOP2y ago

Yea, Spark works best with "right-sized" files.

Let's suppose you have a data lake with 40,000 Parquet files. You need to list the files before you can read the data. This can take a few minutes. I've worked on data lakes that require file listing operations that run for hours. Key/value stores aren't good at listing files like Unix filesystems.

When Spark reads the 40,000 Parquet files it needs to figure out the schema. By default, it'll just grab the schema from one of the files and just assume that all the others have the same schema. This could be wrong.

You can set an option telling Spark to read the schemas of all 40,000 Parquet files and make sure they all have the same schema. That's expensive.

Or you can manually specify the schema, but that can be really tedious. What if the table has 200 columns.

The schema in the Parquet footer is perfect for a single file. I think storing the schema in the metadata is much better when data is spread across many Parquet files.

adolph2y ago

> a data lake with 40,000 Parquet files. You need to list the files before you can read the data. This can take a few minutes.

Sounds like this data lake could use a Parquet file listing the Parquet files.

Butter

1 more reply

jaltekruse2y ago

If the format is splittable you generally can get similar benefits, and parquet files have metadata to point a given reader at a specific chunk of the file that can be read independently. In the case of parquet the writer decides when to finish writing a block/RowGroup, so manually creating smaller files than that can increase parallelism. But you can only go so far as I'm pretty sure I've seen spark combine together very small files into a single threaded read task.

gregw22y ago· 2 in thread

This is a weird comparison to make nowadays. A more relevant question is Delta Lake vs Iceberg.

gregw22y ago

MrPowers, I suspect there is a case to at least attempt to make against Iceberg, but it is strange you aren't making it.

Per your blog: https://mungingdata.com/devrel/virtuous-content-cycle-develo...

"This post explains how to scale developer advocacy by creating content in a way that answers current user questions and makes it easier to generate additional content in the future"

As a lead for a team of developers who have used Parquet and considering Iceberg for our next-gen stuff, you aren't "answering current user questions" about whether we should consider DeltaLake, at least for me. You are marketing to a past world.

Pointers from anyone on Delta vs Iceberg welcome.

MrPowersOP2y ago

Yea, it is fair feedback.

I respect the Iceberg team & their work.

I've been shying away from that post cause I don't wanna start a flamewar, but I will reflect on this and reconsider. Thank you.

1 more reply

MrPowersOP2y ago· 2 in thread

Data Lakes (i.e. Parquet files in storage without a metadata layer) don't support transactions, require expensive file listing operations, and don't support basic DML operations like deleting rows.

Delta Lake stores data in Parquet files and adds a metadata layer to provide support for ACID transactions, schema enforcement, versioned data, and full DML support. Delta Lake also offers concurrency protection.

This post explains all the features offered by Delta Lake in comparison to a plain vanilla Parquet data lake.

alexmolas2y ago

Please, stop using LLM to provide post summaries. This comment is not adding value to the conversation.

MrPowersOP2y ago

I actually wrote this. I thought it was going to be part of the post description and didn't realize it was going to be a comment.

1 more reply

xnx2y ago· 1 in thread

More comparisons (from a competitor?):

"Apache Hudi vs Delta Lake vs Apache Iceberg - Data Lakehouse Feature Comparison" https://www.onehouse.ai/blog/apache-hudi-vs-delta-lake-vs-ap...

MrPowersOP2y ago

Looking at this now.

* Delta Lake supports merge-on-read via deletion vectors: https://delta.io/blog/2023-07-05-deletion-vectors/

* Why doesn't Delta Lake have efficient bulk load? Lots of the biggest datasets in the world are in Delta tables.

* Delta Lake definitely supports compaction: https://delta.io/blog/2023-01-25-delta-lake-small-file-compa...

* What does CLI support mean in the context of a Lakehouse storage system? You can open up a Spark shell or Python shell to interface with your Delta table. That's like saying "CSV doesn't have a CLI". I don't get it.

I didn't do a detailed review of the post.

Zizizizz2y ago· 1 in thread

Delta is pretty great, let's you do upserts into tables in DataBricks much easier than without it.

I think the website is here: https://delta.io

MrPowersOP2y ago

Yea, there is a Rust implementation of the Delta Lake protocol that lets you do upserts without Spark too. This allows pandas, Polars, DataFusion, and PyArrow users to easily do upserts as well.

lgsilver2y ago· 1 in thread

Databricks has been struggling to defend Delta against the fast-moving improvements and widening adoption of Iceberg, championed by two of its major competitors, AWS and Snowflake. This article seems like a bizarre, and maybe even misleading, artifact, given that no one in the industry is comparing Parquet to Delta. They’re weighing Iceberg, which like Delta, can organize and structure groups of parquet (or other format) files…

MrPowersOP2y ago

I work at Databricks, but am pretty much just an OSS nerd, mainly focusing on Delta Rust recently: https://github.com/delta-io/delta-rs

I did some keyword research and wrote this post cause lots of folks are doing searches for Delta Lake vs Parquet. I'm just trying to share a fair summary of the tradeoffs with folks who are doing this search. It's a popular post and that's why I figured I would share it here.

j / k navigate · click thread line to collapse

54 comments

37 comments · 10 top-level

querez2y ago· 6 in thread

MrPowersOP2y ago

Yep, it is re-inventing database systems and you raise a great question.

Delta Lake is useful for large datasets and to keep costs low.

Delta Lake is also free, so you only have to pay for storing the files in the cloud. This is a lot cheaper than a database usually.

These are the main reasons enterprises shited to data lakes and now Lakehouse storage systems. See this paper to learn more: https://www.cidrdb.org/cidr2021/papers/cidr2021_paper17.pdf

DandyDev2y ago

It’s not so bizarre if you realize that bringing ACID semantics to files, lets you use the scalability of file/blob storage like S3 combined with DB-like access.

Traditional RDBMSes just don’t scale so well as S3. But S3 didn’t have ACID semantics. Now it does!

willvarfar2y ago

Snowflake, BigQuery, Firebolt, even Trino, offer a more-classic-DB-like interface to files hosted on classic cloud storage.

2 more replies

gmt20272y ago

bunderbunder2y ago

Expanding on what others have already said:

Yes, it is basically just another relational database system. -but-, it's a database system that's optimized for a different purpose.

ed_elliott_asc2y ago

Yep, that is the goal to make the files in your data like queryable like a database.

BadHumans2y ago· 4 in thread

MrPowersOP2y ago

Lots of organizations have Parquet data lakes and are considering switching to Delta Lake.

Iceberg is cool too.

BadHumans2y ago

There is no Parquet table. Parquet is a compressed file format like a zip. Parquet can be read into table formats like Hive, Delta, etc. That is why this comparison makes no sense.

1 more reply

bunderbunder2y ago

I fail to see what's nonsense about comparing an extension to a format to the format it extends.

CharlesW2y ago

It’s like saying, “Which is better, ISOBMFF or MPEG-4”? It’s comparing a format with an application of the format.

1 more reply

orthoxerox2y ago· 4 in thread

Delta is nice, but a lot of features are missing from the FOSS version.

Hudi is nice, but they are in the middle of a big format change right now.

Iceberg is nice, but is the most conservative and slow format out of three.

MrPowersOP2y ago

There are a few features missing from the FOSS Scala/Spark implementation of Delta Lake, but I wouldn't say a lot. The FOSS version supports all the table features in the Delta Lake protocol.

The Delta Rust implementation is missing more table features, but we're closing the gap fast. We just added support for constraints to Delta Rust and are working on change data feed right now.

orthoxerox2y ago

Delta Live Tables and automatic vacuuming are the two big features I'm missing.

2 more replies

chimerasaurus2y ago

I’d take issue with the “Iceberg is slow” theme that Databricks in particular has tried to push.

If that were true, Snowflake would not be as fast on Iceberg/Parquet as its native format. The engine makes something fast or slow, not the table format.

Disclaimer - am at Snowflake.

orthoxerox2y ago

Back when were choosing between the three formats about 1.5 years ago, Iceberg was definitely the slowest. If the situation has changed since then, I would love to see an updated comparison.

We tested all three of them using Spark batches that converted a stream of changes into SCD2.

fractaloop2y ago· 3 in thread

reactordev2y ago

https://iceberg.apache.org is the correct link.

Eridrus2y ago

Can you share some references to the git-like semantics? I couldn't find much about merging the branched tables.

fractaloop2y ago

AFAIK, it’s limited to fast-forward merge strategies, but you can also create or replace branches and tags, along with cherry-picking snapshots. Additional information can be found in:

• the branching and tagging DDL (https://iceberg.apache.org/docs/latest/spark-ddl/#branching-...)

• Iceberg Procedures (https://iceberg.apache.org/docs/latest/spark-procedures/)

alexmolas2y ago· 3 in thread

Isn't delta lake using parquet files? I don't understand the comparison.

Also

> Parquet tables are OK when data is in a single file but are hard to manage and unnecessarily slow when data is in many files

This is not true. Having worked with Spark it's much better to have multiple "small" files than only one big file.

MrPowersOP2y ago

Yea, Spark works best with "right-sized" files.

You can set an option telling Spark to read the schemas of all 40,000 Parquet files and make sure they all have the same schema. That's expensive.

Or you can manually specify the schema, but that can be really tedious. What if the table has 200 columns.

The schema in the Parquet footer is perfect for a single file. I think storing the schema in the metadata is much better when data is spread across many Parquet files.

adolph2y ago

> a data lake with 40,000 Parquet files. You need to list the files before you can read the data. This can take a few minutes.

Sounds like this data lake could use a Parquet file listing the Parquet files.

Butter

1 more reply

jaltekruse2y ago

gregw22y ago· 2 in thread

This is a weird comparison to make nowadays. A more relevant question is Delta Lake vs Iceberg.

gregw22y ago

MrPowers, I suspect there is a case to at least attempt to make against Iceberg, but it is strange you aren't making it.

Per your blog: https://mungingdata.com/devrel/virtuous-content-cycle-develo...

"This post explains how to scale developer advocacy by creating content in a way that answers current user questions and makes it easier to generate additional content in the future"

Pointers from anyone on Delta vs Iceberg welcome.

MrPowersOP2y ago

Yea, it is fair feedback.

I respect the Iceberg team & their work.

I've been shying away from that post cause I don't wanna start a flamewar, but I will reflect on this and reconsider. Thank you.

1 more reply

MrPowersOP2y ago· 2 in thread

Data Lakes (i.e. Parquet files in storage without a metadata layer) don't support transactions, require expensive file listing operations, and don't support basic DML operations like deleting rows.

This post explains all the features offered by Delta Lake in comparison to a plain vanilla Parquet data lake.

alexmolas2y ago

Please, stop using LLM to provide post summaries. This comment is not adding value to the conversation.

MrPowersOP2y ago

I actually wrote this. I thought it was going to be part of the post description and didn't realize it was going to be a comment.

1 more reply

xnx2y ago· 1 in thread

More comparisons (from a competitor?):

"Apache Hudi vs Delta Lake vs Apache Iceberg - Data Lakehouse Feature Comparison" https://www.onehouse.ai/blog/apache-hudi-vs-delta-lake-vs-ap...

MrPowersOP2y ago

Looking at this now.

* Delta Lake supports merge-on-read via deletion vectors: https://delta.io/blog/2023-07-05-deletion-vectors/

* Why doesn't Delta Lake have efficient bulk load? Lots of the biggest datasets in the world are in Delta tables.

* Delta Lake definitely supports compaction: https://delta.io/blog/2023-01-25-delta-lake-small-file-compa...

I didn't do a detailed review of the post.

Zizizizz2y ago· 1 in thread

Delta is pretty great, let's you do upserts into tables in DataBricks much easier than without it.

I think the website is here: https://delta.io

MrPowersOP2y ago

Yea, there is a Rust implementation of the Delta Lake protocol that lets you do upserts without Spark too. This allows pandas, Polars, DataFusion, and PyArrow users to easily do upserts as well.

lgsilver2y ago· 1 in thread

MrPowersOP2y ago

I work at Databricks, but am pretty much just an OSS nerd, mainly focusing on Delta Rust recently: https://github.com/delta-io/delta-rs

j / k navigate · click thread line to collapse