Lakehouse: New Open Platforms That Unify Data Warehousing and Advanced Analytics [pdf] (opens in new tab)

(cidrdb.org)

36 pointssolidangle5y ago9 comments

9 comments

9 comments · 3 top-level

MrPowers5y ago· 4 in thread

Some additional context:

* Companies are querying thousands / tens of thousands of Parquet files stored in the cloud via Spark

* Parquet lakes can be partitioned which works well for queries that filter on the partitioned key (and slows down queries that don't filter on the partition key)

* Parquet files contain min/max metadata for all columns. When possible, entire files are skipped, but this is relatively rare. This is called predicate pushdown filtering.

* Parquet files allow for the addition of custom metadata, but Spark doesn't let users use the custom metadata when filtering

* Spark is generally bad at joining two big tables (it's good at broadcast joins, generally when one of the tables is 2GB or less)

* Companies like Snowflake & Memsql have Spark connectors that let certain parts of queries get pushed down.

There is a huge opportunity to build a massive company on data lakes optimized for Spark. The amount of wasted compute cycles filtering over files that don't have any data relevant to the query is staggering.

Wonnk135y ago

I don't want to sound like a shill, but isn't this exactly what DataBricks has been pitching with their new DeltaLake thing?

I was listening to the A16Z podcast and they were discussing this in depth.

MSM5y ago

EDIT: After looking into it, it seems like Spark calls both things predicate pushdowns (eliminating unnecessary row group reads via the statistics AND pushing the predicates down to the lowest possible level). You're right, I'm wrong!

>Parquet files contain min/max metadata for all columns. When possible, entire files are skipped, but this is relatively rare. This is called predicate pushdown filtering.

A nitpick, but I wouldn't call this predicate pushdown, it's partition (or segment) elimination. A predicate being pushed down potentially allows files to be skipped through this process though

tomnipotent5y ago

It's min/max per row group, so (potentially) huge chunks of the Parquet file don't need to be read from disk if only a subset qualify.

charlieflowers5y ago

What about Dremio? I get the impression it’s much better at querying parquet and other static file formats than Spark.

1 more reply

georgewfraser5y ago· 2 in thread

What is not said in this article is that you can use modern data warehouses, like Snowflake and BigQuery, in the exact same way: a single system that serves as both your data lake and your data warehouse. Databricks and the cloud data warehouses are rapidly converging. Databricks has enough SQL functionality that it can be reasonably be called an RDBMS, and Snowflake has demonstrated that you can incorporate the benefits of a data lake into a data warehouse by separating compute from storage. At this point, the main difference is that with Databricks you can directly access the underlying Parquet files in S3. Does that matter? For some users, yes.

lumost5y ago

There isn’t too much preventing a data warehouse provider from providing storage access, as long as they are willing to maintain some semblance of backward compatibility in the format.

tomnipotent5y ago

That's exactly what Snowflake lets you do. I worked with an analytics vendor that was using Snowflake behind the scenes, and rather than ETL/copy the data over we had the option of just pointing our Snowflake compute at their Snowflake storage.

Zaheer5y ago

I thought this was a good article on building a Lakehouse on AWS: https://aws.amazon.com/blogs/big-data/harness-the-power-of-y...

It's high level and focuses on some of the business needs for requiring this sort of architecture.

j / k navigate · click thread line to collapse

9 comments

9 comments · 3 top-level

MrPowers5y ago· 4 in thread

Some additional context:

* Companies are querying thousands / tens of thousands of Parquet files stored in the cloud via Spark

* Parquet lakes can be partitioned which works well for queries that filter on the partitioned key (and slows down queries that don't filter on the partition key)

* Parquet files contain min/max metadata for all columns. When possible, entire files are skipped, but this is relatively rare. This is called predicate pushdown filtering.

* Parquet files allow for the addition of custom metadata, but Spark doesn't let users use the custom metadata when filtering

* Spark is generally bad at joining two big tables (it's good at broadcast joins, generally when one of the tables is 2GB or less)

* Companies like Snowflake & Memsql have Spark connectors that let certain parts of queries get pushed down.

Wonnk135y ago

I don't want to sound like a shill, but isn't this exactly what DataBricks has been pitching with their new DeltaLake thing?

I was listening to the A16Z podcast and they were discussing this in depth.

MSM5y ago

>Parquet files contain min/max metadata for all columns. When possible, entire files are skipped, but this is relatively rare. This is called predicate pushdown filtering.

A nitpick, but I wouldn't call this predicate pushdown, it's partition (or segment) elimination. A predicate being pushed down potentially allows files to be skipped through this process though

tomnipotent5y ago

It's min/max per row group, so (potentially) huge chunks of the Parquet file don't need to be read from disk if only a subset qualify.

charlieflowers5y ago

What about Dremio? I get the impression it’s much better at querying parquet and other static file formats than Spark.

1 more reply

georgewfraser5y ago· 2 in thread

lumost5y ago

There isn’t too much preventing a data warehouse provider from providing storage access, as long as they are willing to maintain some semblance of backward compatibility in the format.

tomnipotent5y ago

Zaheer5y ago

I thought this was a good article on building a Lakehouse on AWS: https://aws.amazon.com/blogs/big-data/harness-the-power-of-y...

It's high level and focuses on some of the business needs for requiring this sort of architecture.

j / k navigate · click thread line to collapse