I'm curious about the difference between "continuous MapReduce" and I guess a subgraph in a "differential dataflow" (which I have read about but never really used). https://github.com/TimelyDataflow/differential-dataflow
I think a reasonable TLDR might be to say that continuous map reduce has a better fault-tolerance story, while timely dataflow is more efficient for things like reactive joins. They both have their purpose, though, and I imagine that both Flow and Materialize will go on to co-exists as successful products.
So keeping track of min/max/average as you add new data is now "continous MapReduce"?
Don't get me wrong, a data platform that ingests data and computes useful user-defined aggregates from that sounds useful. But this article feels like an attempt to position that as some kind of incredible industry-leading insight that is a novel take on $buzzword, when it really isn't.
Yeah, the article is an odd spin on what they're building.
> min/max/average as you add new data
Or a 2D kernel density estimate for your dashboards, a real-time view of 3-neighbors in a graph (nodes+edges definition) sized by log1p(request frequency), .... I find it way easier to write a few custom incremental primitives to piece together into that kind of algorithm than to write such an algorithm from scratch.
I'm not crazy about a general-purpose framework/product that tries to allow incremental updates of AllTheThings™ -- my experience thus far suggests that getting it to do what you want (or perform reasonably) on your own data will require enough kludges that you would have been far better off writing the WholeDamnedThing™ yourself.
If they do only support min/max/average and other simple transforms then that's probably not great; they'd be competing directly with something like QuestDB, which is a phenomenal product I'm leaning toward more and more. You don't need millisecond view update times if you can query the whole db in milliseconds.
Flow's model doesn't use windows, and allows for long-distance (in time) joins and aggregations. There's no concept of "late" data in Flow: it just keeps on updating the desired aggregate.
This is if you want to use high level API. If you use lower-level ProcessFunction you have even more flexibility.
"Reduced lock contention sharded hash tables are this season's new casual!"
Considering that oracle is not in fact magic, this meant that a large number of firms were spending 7-8 figures annually on oracle licenses. Map reduce/Hadoop was the first accepted alternative that didn’t involve spending outrageous license fees and instead involved outrageous hardware expenditure.
Con: OSS is less optimized than proprietary solution, requiring bigger hardware
Pro: OSS allows you to buy bigger hardware, use all of it without logical restrictions, and scale infinitely beyond the arbitrary point you were locked into with licensing.
And then the new-found efficiency frees up time to discover/identify $(x,)xxx,xxx+ in manual work that can also now be done with your new-found compute...
Wow. Way to prevent us from progressing beyond the industrial revolution.
($catchup_speed++)
It’s incredibly simple for the end user conceptually but encapsulates optimizing processing across a distributed file system, fault tolerance, shuffling key value pairs, job stage planning, handling intermediates ect.
Hadoop a big data framework that reduces the level of competence required to write data pipelines because it was able to hide a massive amount of complexity behind the map reduce abstraction.
Id even argue that hive, snowflake, and other sql data warehouses have taken this idea further, where most sql primitives can be implemented as map reduce derivatives. With this next level of abstraction, dbas and non-engineers are witting map reduce computations.
I think my point is that abstractions like map reduce have had a democratizing effect on who can implement high scale data processing and their value is that they took something incredibly complex and made it simple.
I don't know what I should be proud of when I learn something new, it all seems extremely basic compared as soon as I learn it.
It’s pretty easy to simplify things down until they sound unimpressive.
However, in most other cases there are now far better alternatives (although tbh I'm not sure how many were around when MapReduce was introduced).
The main limitation around mapreduce is the barries imposed by the shuffle stage and after the end of the reduce if chaining together multiple mapreduce operations. Dataflow frameworks remove these barriers to various degrees, which often lowers latency and can improve resource utilization.
It dramatically reduced the cost of entry for many ad-tech applications.
Imagine you want to change part of your pipelines logic. Now either all data needs to be reprocessed (expensive, depends on you having retained past data, will your low latency continuous pipeline keep running while the backlog is cleared, is the code really idempotent or will a rerun lead to half the records failing to be reprocessed?). Or you need to not reprocess old data (now there is inconsistency in historical records, what do you do if you make a bad release which just outputs zeros?).
In any real organisation, you'll need both approaches. And it'll end up a mess with versions of code and versions of data. Now some customer comes along and demands a GDPR deletion of their session records and you have no way to even find all the versions of all the copies of the records let alone delete them and make everything else consistent...
But most people don't have "big data" in the sense of having data that requires more than a single machine to process.
Most people who think they have "big data" still don't have big data (e.g. I've done work on datasets where people insisted on using "big data" solutions when it could all easily fit in a Postgres instance with or without a columnar store with most of the working set cached in memory for a fraction of the cost).
It "went away" in the sense that more people realised they could avoid it with a few simple steps (e.g. pre-processing during ingestion), and/or fit the data they needed on fast-growing individual servers, and so the number of people continuing to use it more closer approximated the set of people who actually work on big data.
For those who actually needed it, it of course never went away.
I'm sure it wouldn't take the form of a dedicated process; probably just a language-agnostic programming pattern
- Map reduce is not solely useful because it allows you to delete source data, it's a trivial method of parallel processing. It's not the most trivial or most modern. Just common