ELI5 is for reddit, generally here we expect you can google it to get the ELI5 explanation before giving us your hot take in a comment
> Organizations that successfully generate business value from their data, will outperform their peers.
at which point I'm like
> ok, I'm reading a covert advertisement about Fancy Cloud Technology aimed at some kind of big-spending manager, which is unlikely to tell me meaningfully what this actually is
and I'm out. I was looking for content that was in a more neutral, purely educational genre, and wondering what collection of non-cloud analogues it replaces/is composed of. Someone writing in the comments
> I used it to transform several terabytes of JSON into nice relational data for analysts without too much effort
is way, way more direct and helpful than mentioning that 'unlike data warehouses, data lakes support non-relational data'. Like great, it's a cloud thing that supports a variety of databases. But what is it?
> before giving us your hot take in a comment
I didn't give any take at all? I just really found all the sources that came up on the first page of search results to be almost in the wrong genre for me, and expected (correctly) that people on this site would be able to produce descriptions in 1-5 sentences that worked way better for me.
Pretty much all of the answers I got here were really good, and I'm glad I asked.
> A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale. You can store your data as-is, without having to first structure the data, and run different types of analytics—from dashboards and visualizations to big data processing, real-time analytics, and machine learning to guide better decisions.
This may be self-explanatory for you, but what it means in practice is not as self-evident as you believe. For all it describes, it could be an FTP upload directory that loads things into an sqlite database. It's not until the scale is invoked (multi-terabyte/day) that the inadequacies of a naive solution become apparent. For those in that area of the industry, Snowflake is already known. (Seriously, if you're running into issues with limitations of RedShift, it behooves you to take a look at Snowflake.) For those that aren't, data warehousing is unfamiliar, never mind data lake. For those outside the ML sphere, the finer points of training runs are also non-obvious.
Data lakes are more modern and came about as people realized they had 30 databases and the business wanted to do queries against all of them simultaneously (IE, join your credit card transaction history with historical rates of default in a zip code), quickly. The data warehouse solution was to use federated database queries (JOINs across databases), or force everybody to consolidate. A data lake is a single virtual entity that represents "all your data in one place".
It's based on a weak analogy where a warehouse is a place where you put stuff in very well organized locations while a lake is a place where a bunch of different waters slosh together.
Storing unstructured data in a database is dumb because databases cost about 10X storage space due to indexing, while unstructured data often can just sit around passively in a filesystem (and/or have a filesystem index built into it for fast queries).
I view this through the lens of web tech, for example, see the wars between the mapreduce and database people and how Google evolved from MapReduce against GFS to Flumes against Spanner, showing we just live in an endless cycle of renaming old technology.
It's absolutely correct that the terminology doesn't map perfectly
It is barely true nowadays.
i didn't enjoy working w/either the datastore directly, or the DBA team that ran it either. an early, more old-white-dude "i just want to serve 5T"