Databricks open-sources Delta Lake to make data lakes more reliable (opens in new tab)

(techcrunch.com)

142 pointssolidangle7y ago54 comments

54 comments

50 comments · 13 top-level

I don't really understand the concept of Data Lake, and wikipedia isn't helping much... is it just a buzzword for a collection of data stores?

shrikant7y ago

It's basically a reliable, horizontally scalable object store + a collection of data storage and processing engines.

So amongst the cloud providers, AWS calls a combination of S3 + Glue + Athena (for example) a "data lake", where S3 is the object storage which can store data in various formats, and Glue and Athena are used to transform/process/query the data. See a more detailed article/guide here: https://aws.amazon.com/lake-formation/

If you didn't want to put anything into the cloud and keep all your services on-premise, a local Hadoop cluster could be a data lake, for example using HDFS + Zookeeper + YARN + Hive.

[This is a huge over-smiplification because it's late and I really should be going home :)]

mucio7y ago

To storage and compute I would also add metadata, this is why most data lake projects end up in data swamps.

When people just dump data in their storage they end having really hard time sharing them in their organization.

bob10297y ago

I am personally inclined to lump this in with the other enterprisey buzzwords that are generally safe to avoid. From my perspective, the 'data lake' is the actual collection of business systems and your ability to extract useful information from them.

Trying to come up with some unified standard or common API for the extraction, transformation and implementation of useful data from a heterogenous collection of systems sounds like a problematic task at best.

The other, more likely, interpretation of 'data lake' is that it is the staging ground between your ability to do the above stated activity and other downstream systems interested in the data. If the idea is that you are creating the actual normalization layer, I feel like this is still more in the realm of SQL/ETL, as there really isn't any other direction to move that would reduce your entropy in a valuable way (relative to your time invested).

SQLite or Postgres is usually the right choice. This simple rule can help you avoid a lot of pain. Once you have convinced everyone that Postgres is to be used, the only other real barriers are your ability to get a data transport to each business system and the authoring of some SQL scripts. The workload of building a SQL representation of any particular business system is fairly predictable once you break it down to entity-relationship abstractions. Also, using a language with powerful class/object, serialization and database support such as C# or Java can cut your workload by orders of magnitude if you choose a SQL architecture. In C# for instance, you can just write POCOs and use Entity Framework to build out all of the SQL for you. This is not the most performant option by a long shot, but it can get you going incredibly quickly on a first iteration.

mr_toad7y ago

> Once you have convinced everyone that Postgres is to be used

... the data has already disappeared.

Systems get shut down and replaced. Operational systems may discard history.

By the time you get a fully operational data warehouse set up, it may be too late to preserve the data.

based27y ago

https://martinfowler.com/bliki/DataLake.html

MR4D7y ago

Awesome link!

The key line for me:

"The data lake stores raw data, in whatever form the data source provides."

The emphasis on"raw" was his, not mine.

lichtenberger7y ago

Thanks from me, too :-)

jesseryoung7y ago

Going through the process of defining and planning to build a data lake with my team right now.

A data lake is like you said a collection of data stores, and the industry as a whole hasn't defined it very well past that.

IMHO - A (useful) data lake is a platform that can support any type of data store in any format (be it relational, flat, graph, document, etc) and offer a way to consistently query it. A (useful) data lake does much in the way of managing metadata about those data stores that makes it easy to consume.

xwkk7y ago

I have heard these promises of a single query language for different data formats many times and the results (foisted upon developers and customers) have always far more painful than just sticking to SQL.

1 more reply

wldata7y ago

Plug: We help discover, catalog and search for metadata and lineage in your data lake. We support all major hadoop distros and cloud platforms. Check us out! https://www.waterlinedata.com

galazzah7y ago

AFAIK, the data lake is the next step in the evolution of the data warehouse. Instead of storing data in a data mart/data warehouse, the concept of the datalake (as a design pattern) is that you don't schemas (support for unstructured data), better support for auditing and data governance/democratization, and schema (?) evolution

Joeri7y ago

I wouldn't say you don't have schema's, rather you have schema-on-read instead of schema-on-write, and you use an extract-load-transform pattern instead of extract-transform-load. The data is replicated as-is into the data lake and only then do you figure out what to do with it.

2 more replies

rmk27y ago

Note that a data lake does not necessarily replace the data warehouse, but rather often complements it. As such, you store your raw data from various sources in a centralised data store (Hadoop-like, NoSQL, etc.). From there, you prune, clean, select, and potentially aggregate data that you would like to provide in a quality-controlled way to your business users, in a data warehouse. This data warehouse most often will be a more traditional relational data store (usually some flavour of SQL database), which allows users to select data from a curated, pre-selected slice of the overall data stored in the data lake, and which enables easier integration with common reporting tools, whether more traditional standard reporting tools or self-service BI tools.

manigandham7y ago

Practical definitions:

Data lake = place to store unstructured raw data. Usually as files in an object store or Hadoop/HDFS cluster. Analyse with data processing/SQL frameworks. Schema may be part of the data (parquet, avro) or on-read (raw csvs or json modeled into tables).

Data warehouse = place to store processed (semi)structured data. Usually in a distributed columnstore database. Analyze with SQL. Schema is pre-defined by the tables. Usually for smaller, faster, real-time queries or as a cache in-front of a data lake.

Some cloud data warehouses like BigQuery and Snowflake can also query unstructured files and even run on top of the object store so the boundaries are getting blurred. Will probably converge at some point in the future.

cavisne7y ago

The number of definitions here suggest its not well defined, one more :)

A data lake is when you store your data in object storage (S3, GCS etc) as opposed to a filesystem (HDFS) or some indexed datastore (Redshift etc).

This potentially saves a lot of money because you can scale compute separately from storage, and object storage can be extremely cost effective compared to running a distributed filesystem.

Where the two overlap is when you store something like parquet in object storage, the file format is somewhat indexed already so you spend a bit more money preprocessing it but save a lot of money querying it.

I think whether its "raw" json or log files or preprocessed parquet doesnt really differentiate whether its a data lake or not

kakoni7y ago

But interestingly delta.io only supports HDFS currently.

StreamBright7y ago

It is a fancy name for a data swamp. :) On the more serious note data warehouses called nowadays data lakes, not sure wjy people prefer this term, maybe because some folks like to think that it is ok to just throw all data in one place and it will be ok to query it like that. I still prefer data warehouse over it.

rooam-dev7y ago

Some call it "data dump" too.

stcredzero7y ago

It kind of makes sense for a cloud to form above a lake.

On another tangent, I wonder if it would be possible to make an un-warrantable cloud? Would it be feasible to create a distributed cloud within the borders of the US or another industrialized country which doesn't actually exist at a particular address?

mmrezaie7y ago· 9 in thread

What are the other alternatives for data lakes that can be used (both open source and close)?

zjaffee7y ago

Apache Iceberg is probably the closest product to what databricks is open sourcing, but none of these products are everything that's needed for datalake management.

What these products do is make it as easy to use decoupled storage and compute as your analytics system as it would be to use a fully managed analytics DBMS system.

groodt7y ago

Yes, when I heard about Delta I thought the same. Would love to see a comparison between Delta and Iceberg. I wonder if Ryan Blue is on HN.

mobileexpert7y ago

You can see some related but not directly comparable efforts in HuDI see https://eng.uber.com/uber-big-data-platform/

Or https://iceberg.apache.org/

Which both keep track of data versioning and management of file based datasets on object stores.

Most people today have very ad how approaches to handling data versioning and lineage on Hadoop datasets.

jdoliner7y ago

Throwing my own project's hat in the ring, Pachyderm[0] is opensource, written in Go and built on Docker and Kubernetes. It versions controls your data, makes modifications atomic and tracks data lineage. You implement your pipelines in containers, so any tool you can put in a container can be used (and horizontally scaled) on Pachyderm.

[0] https://github.com/pachyderm/pachyderm

century197y ago

DIY is probably the biggest alternative. I think Databricks saw so many customers solving the same problem and decided to make a service of it. Great idea and a bit surprising that it hasn't continued as an Enterprise feature that they only offer customers.

monkeyoct7y ago

Hive ACID solves a similar set of problems with a similar design (base Parquet/ORC files + delta files in the same format). It's a bit more "managed" in that compaction is automatic and it supports features like column-level access control, which aren't possible when you're executing an end-user's Scala code against a directory of files.

https://docs.hortonworks.com/HDPDocuments/HDP3/HDP-3.1.0/usi...

fimbulvetr7y ago

AWS has one in preview: https://aws.amazon.com/lake-formation/

mobileexpert7y ago

I don’t think Amundsen is comparable. Amundsen is essentially a data catalog that has ambition to get into master data management.

Delta here is adding features more closely associated with RDBMS or MPP data warehouses to the Spark data pipelines with parquet data on object stores big data world.

1 more reply

lichtenberger7y ago

That's what I'd be interested in, too :-)

georgewfraser7y ago· 5 in thread

There's a lot of confusion around data lakes. One source of confusion is that "data lake" versus "data warehouse" is often presented as a choice, where you can have either:

1. A data lake, where all data is stored in its native format (CSV, JSON, ...), in an object store (S3, GCS, ...), with the schema defined on read (Hive, Presto, ...).

2. A data warehouse, where all the data is organized in a highly structured tables (star schema) in a commercial database (Snowflake, Redshift, ...).

This is a false choice! Modern data warehouses, particularly Snowflake and BigQuery, are fully capable of storing semi-structured data.

Furthermore, you do not need to curate your data into a star schema before loading it. The ideal way to set up a modern data warehouse is to establish a "staging" schema that matches the source, and then transform that data into a star schema or data marts using SQL. In this scenario, your "data lake" and "data warehouse" are just two different schemas within the same database.

There are still some scenarios where it makes sense to build a data lake in addition to a data warehouse, primarily future-proofing. I wrote a blog post where I tried to outline these scenarios: https://fivetran.com/blog/when-to-adopt-a-data-lake

nitrogen7y ago

Has anyone written about privacy implications of data lakes and data warehouses? The Extract in ETL is usually supposed to filter out private data, but if instead all of the raw native data is dumped into a data lake, what ensures that data is handled with the same care as the individual systems that normally handle the data? What stops some random business analyst from running individual or aggregated queries that would be contractually or legally forbidden?

dikei7y ago

The solution is to divide your Data Lake into different zones with access control, so that user can only access what they're allowed to. That said, it's a lot of work to do this properly, so it's often neglected.

georgewfraser7y ago

That is a great question, we have written about this too: https://fivetran.com/blog/how-fivetran-helps-you-stay-compli...

Short version, you need to identify data that absolutely must not be retained and either block it or hash it as close as possible to the source. This means you still have to do a little transformation before you load into your data lake/warehouse.

Second, you need to identify the soft constraints and enforce them with the access controls of your data warehouse. This is (another) reason why you should use a relational database like Snowflake or BigQuery as your primary data store, and treat any nonrelational data lake like Parquet-in-S3 as a backup/staging area for 1 or more relational stores.

triplee7y ago

This! And the two can coexist as well -- Data lake for everything (partitioned appropriately) and modern data warehouse for the stuff from your data lake that you want to curate (of course with partially structured intake first).

kakoni7y ago

Interesting. But what is delta.io then, as you are storing data not in its pure native format (CSV,JSON,TSV...) but in parquet files. Or does it even matter?

iblaine7y ago· 3 in thread

What's the difference between a Delta Lake and Change Data Capture? Seems like in both cases you're creating a type 2 dimension against a source table.

atwebb7y ago

It is a different technology entirely. CDC is just the log of changes on a relational table. Delta Lake appears to be providing more native administrative capabilities to a data lake implementation (schemas, transactions, cataloging).

century197y ago

CDC you are looking at the change logs of a Relational Database typically.

With a (batch) Data Lake you accept as input different file types (JSON, CSV, AVRO) from various systems. Could be Hadoop systems, could be from a COBOL system on a mainframe.

mobileexpert7y ago

The Delta here isn’t so related to CDC. This is a replacement for the ad how methods people use now to keep multiple readers and writers of datasets in parquet format in agreement when using Spark’s dataframe API.

huac7y ago· 1 in thread

Are there other ways of implementing ACID transactions on Spark tables?

StreamBright7y ago

What do you mean Spark tables? Generally speaking it is a bad idea to try to combine ACID with data warehouses.

ekzhu7y ago

We (data curation lab at Univ of Toronto) are doing research in data lake discovery problems. One of the problems we are looking at is how to efficiently discover joinable and unionable tables. For example, find all the rental listings from various sources to create a master list (union); or find tables such as rental listings and school districts that can be used to augment each other (join). The technical challenges in finding joinable and unionable tables in data lakes involve the following: (1) the data schema is often inconsistent and poorly managed, so we can’t simply rely on that schema; and (2) the scale of data lakes can be in the order of hundreds of thousands of tables, making a content based search algorithm expensive. We came up with some solutions that are based on data sketches with several published papers [1,2,3]. The python library “datasketch” was a byproduct if these work.

Many challenges remain though, and we would like to explore some of the more pertinent ones. In fact, we are conducting a survey to understand the current state of data lakes in industry and the challenges experienced. If you're interested in learning more, see what we came up with here: https://www.surveymonkey.com/r/R7MYXSJ - would love to see what the HN community thinks about the current state of data lakes.

[1] http://www.vldb.org/pvldb/vol9/p1185-zhu.pdf [2] http://www.vldb.org/pvldb/vol11/p813-nargesian.pdf [3] http://www.cs.toronto.edu/~ekzhu/papers/josie.pdf

mobileexpert7y ago

Other efforts in improving the parquet datasets on cloud storage world:

https://github.com/apache/incubator-iceberg

https://github.com/apache/incubator-hudi

Happy to see Delta go open source.

MrPowers7y ago

They tried to keep it closed and sell it as a premium service, but looks like they need help from the open source community to make the product better. Great to see. Databricks has its roots in open source (the founder created Spark) and it's great that they're still making a lot of open source code rather than making everything private.

tlrobinson7y ago

I appreciate the thought TechCrunch put into the image representing ACID-compliant data lakes.

dikei7y ago

What I don't like about these ACID storage layers is they reduce compatibility between different query engine. For example, Spark cannot read Hive ACID tables natively and Hive cannot read Spark Delta tables either. Then there's other tools such as Presto or Drill which can read neither.

When you use an ACID storage layer, you're kinda locked into one solution for both ETL and query, which is not nice.

playing_colours7y ago

Recently I was interested to learn more on Data Lakes, how to design and maintain them.

There is a lot of information in articles, blogs, but I prefer books as a solid source of structured and aggregated information.

Surprisingly, I found just a single proper book on the topic: https://www.amazon.com/Enterprise-Big-Data-Lake-Delivering/d...

FridgeSeal7y ago

Sounds cool, but then I'd have to use Spark...

5874-4b22-a4e07y ago

Cloud -> Data lake -> Data stream -> Data Ocean -> Cloud

j / k navigate · click thread line to collapse

54 comments

50 comments · 13 top-level

alexchamberlain7y ago· 19 in thread

I don't really understand the concept of Data Lake, and wikipedia isn't helping much... is it just a buzzword for a collection of data stores?

shrikant7y ago

It's basically a reliable, horizontally scalable object store + a collection of data storage and processing engines.

If you didn't want to put anything into the cloud and keep all your services on-premise, a local Hadoop cluster could be a data lake, for example using HDFS + Zookeeper + YARN + Hive.

[This is a huge over-smiplification because it's late and I really should be going home :)]

mucio7y ago

To storage and compute I would also add metadata, this is why most data lake projects end up in data swamps.

When people just dump data in their storage they end having really hard time sharing them in their organization.

bob10297y ago

mr_toad7y ago

> Once you have convinced everyone that Postgres is to be used

... the data has already disappeared.

Systems get shut down and replaced. Operational systems may discard history.

By the time you get a fully operational data warehouse set up, it may be too late to preserve the data.

based27y ago

https://martinfowler.com/bliki/DataLake.html

MR4D7y ago

Awesome link!

The key line for me:

"The data lake stores raw data, in whatever form the data source provides."

The emphasis on"raw" was his, not mine.

lichtenberger7y ago

Thanks from me, too :-)

jesseryoung7y ago

Going through the process of defining and planning to build a data lake with my team right now.

A data lake is like you said a collection of data stores, and the industry as a whole hasn't defined it very well past that.

xwkk7y ago

1 more reply

wldata7y ago

Plug: We help discover, catalog and search for metadata and lineage in your data lake. We support all major hadoop distros and cloud platforms. Check us out! https://www.waterlinedata.com

galazzah7y ago

Joeri7y ago

2 more replies

rmk27y ago

manigandham7y ago

Practical definitions:

cavisne7y ago

The number of definitions here suggest its not well defined, one more :)

A data lake is when you store your data in object storage (S3, GCS etc) as opposed to a filesystem (HDFS) or some indexed datastore (Redshift etc).

This potentially saves a lot of money because you can scale compute separately from storage, and object storage can be extremely cost effective compared to running a distributed filesystem.

I think whether its "raw" json or log files or preprocessed parquet doesnt really differentiate whether its a data lake or not

kakoni7y ago

But interestingly delta.io only supports HDFS currently.

StreamBright7y ago

rooam-dev7y ago

Some call it "data dump" too.

stcredzero7y ago

It kind of makes sense for a cloud to form above a lake.

mmrezaie7y ago· 9 in thread

What are the other alternatives for data lakes that can be used (both open source and close)?

zjaffee7y ago

Apache Iceberg is probably the closest product to what databricks is open sourcing, but none of these products are everything that's needed for datalake management.

What these products do is make it as easy to use decoupled storage and compute as your analytics system as it would be to use a fully managed analytics DBMS system.

groodt7y ago

Yes, when I heard about Delta I thought the same. Would love to see a comparison between Delta and Iceberg. I wonder if Ryan Blue is on HN.

mobileexpert7y ago

You can see some related but not directly comparable efforts in HuDI see https://eng.uber.com/uber-big-data-platform/

Or https://iceberg.apache.org/

Which both keep track of data versioning and management of file based datasets on object stores.

Most people today have very ad how approaches to handling data versioning and lineage on Hadoop datasets.

jdoliner7y ago

[0] https://github.com/pachyderm/pachyderm

century197y ago

monkeyoct7y ago

https://docs.hortonworks.com/HDPDocuments/HDP3/HDP-3.1.0/usi...

fimbulvetr7y ago

AWS has one in preview: https://aws.amazon.com/lake-formation/

mobileexpert7y ago

I don’t think Amundsen is comparable. Amundsen is essentially a data catalog that has ambition to get into master data management.

Delta here is adding features more closely associated with RDBMS or MPP data warehouses to the Spark data pipelines with parquet data on object stores big data world.

1 more reply

lichtenberger7y ago

That's what I'd be interested in, too :-)

georgewfraser7y ago· 5 in thread

There's a lot of confusion around data lakes. One source of confusion is that "data lake" versus "data warehouse" is often presented as a choice, where you can have either:

1. A data lake, where all data is stored in its native format (CSV, JSON, ...), in an object store (S3, GCS, ...), with the schema defined on read (Hive, Presto, ...).

2. A data warehouse, where all the data is organized in a highly structured tables (star schema) in a commercial database (Snowflake, Redshift, ...).

This is a false choice! Modern data warehouses, particularly Snowflake and BigQuery, are fully capable of storing semi-structured data.

nitrogen7y ago

dikei7y ago

georgewfraser7y ago

That is a great question, we have written about this too: https://fivetran.com/blog/how-fivetran-helps-you-stay-compli...

triplee7y ago

kakoni7y ago

Interesting. But what is delta.io then, as you are storing data not in its pure native format (CSV,JSON,TSV...) but in parquet files. Or does it even matter?

iblaine7y ago· 3 in thread

What's the difference between a Delta Lake and Change Data Capture? Seems like in both cases you're creating a type 2 dimension against a source table.

atwebb7y ago

century197y ago

CDC you are looking at the change logs of a Relational Database typically.

With a (batch) Data Lake you accept as input different file types (JSON, CSV, AVRO) from various systems. Could be Hadoop systems, could be from a COBOL system on a mainframe.

mobileexpert7y ago

huac7y ago· 1 in thread

Are there other ways of implementing ACID transactions on Spark tables?

StreamBright7y ago

What do you mean Spark tables? Generally speaking it is a bad idea to try to combine ACID with data warehouses.

ekzhu7y ago

[1] http://www.vldb.org/pvldb/vol9/p1185-zhu.pdf [2] http://www.vldb.org/pvldb/vol11/p813-nargesian.pdf [3] http://www.cs.toronto.edu/~ekzhu/papers/josie.pdf

mobileexpert7y ago

Other efforts in improving the parquet datasets on cloud storage world:

https://github.com/apache/incubator-iceberg

https://github.com/apache/incubator-hudi

Happy to see Delta go open source.

MrPowers7y ago

tlrobinson7y ago

I appreciate the thought TechCrunch put into the image representing ACID-compliant data lakes.

dikei7y ago

When you use an ACID storage layer, you're kinda locked into one solution for both ETL and query, which is not nice.

playing_colours7y ago

Recently I was interested to learn more on Data Lakes, how to design and maintain them.

There is a lot of information in articles, blogs, but I prefer books as a solid source of structured and aggregated information.

Surprisingly, I found just a single proper book on the topic: https://www.amazon.com/Enterprise-Big-Data-Lake-Delivering/d...

FridgeSeal7y ago

Sounds cool, but then I'd have to use Spark...

5874-4b22-a4e07y ago

Cloud -> Data lake -> Data stream -> Data Ocean -> Cloud

j / k navigate · click thread line to collapse