* Companies are querying thousands / tens of thousands of Parquet files stored in the cloud via Spark
* Parquet lakes can be partitioned which works well for queries that filter on the partitioned key (and slows down queries that don't filter on the partition key)
* Parquet files contain min/max metadata for all columns. When possible, entire files are skipped, but this is relatively rare. This is called predicate pushdown filtering.
* Parquet files allow for the addition of custom metadata, but Spark doesn't let users use the custom metadata when filtering
* Spark is generally bad at joining two big tables (it's good at broadcast joins, generally when one of the tables is 2GB or less)
* Companies like Snowflake & Memsql have Spark connectors that let certain parts of queries get pushed down.
There is a huge opportunity to build a massive company on data lakes optimized for Spark. The amount of wasted compute cycles filtering over files that don't have any data relevant to the query is staggering.
I was listening to the A16Z podcast and they were discussing this in depth.
>Parquet files contain min/max metadata for all columns. When possible, entire files are skipped, but this is relatively rare. This is called predicate pushdown filtering.
A nitpick, but I wouldn't call this predicate pushdown, it's partition (or segment) elimination. A predicate being pushed down potentially allows files to be skipped through this process though
It's high level and focuses on some of the business needs for requiring this sort of architecture.