http://googleresearch.blogspot.com/2011/09/sorting-petabytes...
Using two different metrics for two like things seems like there is some sort of implication there. Were Google's machines single-cored?
Certainly it would be interesting to have an apples to apples comparison. But the computers aren't the only thing that is relevant -- we also need to know about the networking hardware.
They also tuned their code to this specific problem:
"Exploiting Cache Locality: In the sort benchmark, each record is 100 bytes, where the sort key is the first 10 bytes. As we were profiling our sort program, we noticed the cache miss rate was high, because each comparison required an object pointer lookup that was random..... Combining TimSort with our new layout to exploit cache locality, the CPU time for sorting was reduced by a factor of 5."
I would love to see MR and Spark compete on the exact same hardware configuration.
You may find this benchmark [1] interesting to read.
It needs some updating (a lot has changed since February 2014), but it compares Shark (which uses Spark as its execution engine) to Hive (using Hadoop 1 MapReduce as its execution engine) and a number of other systems.
The benchmark is run on EC2 and is detailed in such a way that it should be independently verifiable. Hive and Shark are run on identically sized clusters, though I don't know if the other details of the configuration were identical.
Going up to 1 petabyte, the Hadoop comparison adds more nodes, 3800, while the Spark benchmark actually reduced the number of nodes to 190.
Does Spark scale well beyond ~200 nodes, or does the network become the bottleneck?
In any case, it's an impressive result considering that they didn't use Spark's in-memory cache.
> [O]ur Spark cluster was able to sustain ... 1.1 GB/s/node network activity during the reduce phase, saturating the 10Gbps link available on these machines.
If the network is the bottleneck it makes sense to reduce the number of nodes to reduce the network communications.
Spark runs fine on thousands of nodes.
If they used reserved instances in USEast, it drops to $181.
Obviously there are lots of costs involved beside the final perfect run, but it's an interesting ballpark.
We haven't tested Spark 0.8 at this scale. In general Spark is advancing at a rapid rate that 1.1 is very very different from 0.8.
The TB sort benchmark is pretty useless to me - I am much more concerned with stability, a vibrant community (which means people, the software they write and institutions using Hadoop in production).
Last time I tinkered with Spark (this was over a year ago) it was so buggy, next to useless, but perhaps things have changed.
Still - the idea that there is some sort of a revolutionary new approach that is paradigm-shifting and is way better than anything before should be viewed with extreme skepticism.
The problem of distributed computing is not a simple one. I remember tinkering with the Linux kernel back in the mid nineties, and 20 years later it still has ways to go to improve.
Twenty years from now it might or might not be Hadoop that is the tool for this sort of thing, we don't know, but I will not take seriously anything or anyone who claims that the "next best thing" is here in 2014.
2. Yes, Spark was/is buggy.
3. For me Spark is really paradigm shift, next generation framework compared to M/R
If by M/R you mean Hadoop - Cloudera has done no such thing, their largest customer base is Hadoop.
As to "paradigm shift", we're so early in this that I don't think there even is a paradigm to shift.
There is a place for arguing how effective Map/Reduce is, but it's been known for years that M/R is not the only, nor best general purpose algorithm for solving all problems. More and more tools these days do not use M/R, Spark including, and Spark certainly is no the first tool to provide an alternative to M/R. AFAIK Google has abandoned M/R years ago.
I just don't understand this constant boasting about Spark, it seems very suspicious to me.
As pointed out in the article multiple times, we are comparing with MR here. We are not comparing with Hadoop as an ecosystem. Spark plays nicely with Hadoop. As a matter of fact, this experiment ran on HDFS.
In terms of vibrant community, Spark is now the largest open source Big Data project by community/contributor count. More than 300 people have contributed code to the project.
I wrote a post on this: http://www.silota.com/site-search-blog/approximate-median-co...
Good job