So amongst the cloud providers, AWS calls a combination of S3 + Glue + Athena (for example) a "data lake", where S3 is the object storage which can store data in various formats, and Glue and Athena are used to transform/process/query the data. See a more detailed article/guide here: https://aws.amazon.com/lake-formation/
If you didn't want to put anything into the cloud and keep all your services on-premise, a local Hadoop cluster could be a data lake, for example using HDFS + Zookeeper + YARN + Hive.
[This is a huge over-smiplification because it's late and I really should be going home :)]
When people just dump data in their storage they end having really hard time sharing them in their organization.
Trying to come up with some unified standard or common API for the extraction, transformation and implementation of useful data from a heterogenous collection of systems sounds like a problematic task at best.
The other, more likely, interpretation of 'data lake' is that it is the staging ground between your ability to do the above stated activity and other downstream systems interested in the data. If the idea is that you are creating the actual normalization layer, I feel like this is still more in the realm of SQL/ETL, as there really isn't any other direction to move that would reduce your entropy in a valuable way (relative to your time invested).
SQLite or Postgres is usually the right choice. This simple rule can help you avoid a lot of pain. Once you have convinced everyone that Postgres is to be used, the only other real barriers are your ability to get a data transport to each business system and the authoring of some SQL scripts. The workload of building a SQL representation of any particular business system is fairly predictable once you break it down to entity-relationship abstractions. Also, using a language with powerful class/object, serialization and database support such as C# or Java can cut your workload by orders of magnitude if you choose a SQL architecture. In C# for instance, you can just write POCOs and use Entity Framework to build out all of the SQL for you. This is not the most performant option by a long shot, but it can get you going incredibly quickly on a first iteration.
... the data has already disappeared.
Systems get shut down and replaced. Operational systems may discard history.
By the time you get a fully operational data warehouse set up, it may be too late to preserve the data.
The key line for me:
"The data lake stores raw data, in whatever form the data source provides."
The emphasis on"raw" was his, not mine.
A data lake is like you said a collection of data stores, and the industry as a whole hasn't defined it very well past that.
IMHO - A (useful) data lake is a platform that can support any type of data store in any format (be it relational, flat, graph, document, etc) and offer a way to consistently query it. A (useful) data lake does much in the way of managing metadata about those data stores that makes it easy to consume.
Data lake = place to store unstructured raw data. Usually as files in an object store or Hadoop/HDFS cluster. Analyse with data processing/SQL frameworks. Schema may be part of the data (parquet, avro) or on-read (raw csvs or json modeled into tables).
Data warehouse = place to store processed (semi)structured data. Usually in a distributed columnstore database. Analyze with SQL. Schema is pre-defined by the tables. Usually for smaller, faster, real-time queries or as a cache in-front of a data lake.
Some cloud data warehouses like BigQuery and Snowflake can also query unstructured files and even run on top of the object store so the boundaries are getting blurred. Will probably converge at some point in the future.
A data lake is when you store your data in object storage (S3, GCS etc) as opposed to a filesystem (HDFS) or some indexed datastore (Redshift etc).
This potentially saves a lot of money because you can scale compute separately from storage, and object storage can be extremely cost effective compared to running a distributed filesystem.
Where the two overlap is when you store something like parquet in object storage, the file format is somewhat indexed already so you spend a bit more money preprocessing it but save a lot of money querying it.
I think whether its "raw" json or log files or preprocessed parquet doesnt really differentiate whether its a data lake or not
On another tangent, I wonder if it would be possible to make an un-warrantable cloud? Would it be feasible to create a distributed cloud within the borders of the US or another industrialized country which doesn't actually exist at a particular address?
What these products do is make it as easy to use decoupled storage and compute as your analytics system as it would be to use a fully managed analytics DBMS system.
Or https://iceberg.apache.org/
Which both keep track of data versioning and management of file based datasets on object stores.
Most people today have very ad how approaches to handling data versioning and lineage on Hadoop datasets.
https://docs.hortonworks.com/HDPDocuments/HDP3/HDP-3.1.0/usi...
Delta here is adding features more closely associated with RDBMS or MPP data warehouses to the Spark data pipelines with parquet data on object stores big data world.
1. A data lake, where all data is stored in its native format (CSV, JSON, ...), in an object store (S3, GCS, ...), with the schema defined on read (Hive, Presto, ...).
2. A data warehouse, where all the data is organized in a highly structured tables (star schema) in a commercial database (Snowflake, Redshift, ...).
This is a false choice! Modern data warehouses, particularly Snowflake and BigQuery, are fully capable of storing semi-structured data.
Furthermore, you do not need to curate your data into a star schema before loading it. The ideal way to set up a modern data warehouse is to establish a "staging" schema that matches the source, and then transform that data into a star schema or data marts using SQL. In this scenario, your "data lake" and "data warehouse" are just two different schemas within the same database.
There are still some scenarios where it makes sense to build a data lake in addition to a data warehouse, primarily future-proofing. I wrote a blog post where I tried to outline these scenarios: https://fivetran.com/blog/when-to-adopt-a-data-lake
Short version, you need to identify data that absolutely must not be retained and either block it or hash it as close as possible to the source. This means you still have to do a little transformation before you load into your data lake/warehouse.
Second, you need to identify the soft constraints and enforce them with the access controls of your data warehouse. This is (another) reason why you should use a relational database like Snowflake or BigQuery as your primary data store, and treat any nonrelational data lake like Parquet-in-S3 as a backup/staging area for 1 or more relational stores.
With a (batch) Data Lake you accept as input different file types (JSON, CSV, AVRO) from various systems. Could be Hadoop systems, could be from a COBOL system on a mainframe.
Many challenges remain though, and we would like to explore some of the more pertinent ones. In fact, we are conducting a survey to understand the current state of data lakes in industry and the challenges experienced. If you're interested in learning more, see what we came up with here: https://www.surveymonkey.com/r/R7MYXSJ - would love to see what the HN community thinks about the current state of data lakes.
[1] http://www.vldb.org/pvldb/vol9/p1185-zhu.pdf [2] http://www.vldb.org/pvldb/vol11/p813-nargesian.pdf [3] http://www.cs.toronto.edu/~ekzhu/papers/josie.pdf
https://github.com/apache/incubator-iceberg
https://github.com/apache/incubator-hudi
Happy to see Delta go open source.
When you use an ACID storage layer, you're kinda locked into one solution for both ETL and query, which is not nice.
There is a lot of information in articles, blogs, but I prefer books as a solid source of structured and aggregated information.
Surprisingly, I found just a single proper book on the topic: https://www.amazon.com/Enterprise-Big-Data-Lake-Delivering/d...