Spark Breaks Previous Large-Scale Sort Record (opens in new tab)

(databricks.com)

142 pointsmetronius11y ago56 comments

56 comments

37 comments · 12 top-level

rxin11y ago· 6 in thread

Thanks for sharing this. I'm the author of this blog post. Free free to ask me anything.

Your post mentions "single root IO virtualization" as a factor in maximizing network performance. I am wondering what the impact of this was in your sorting. Do you have data for runs where you didn't enable this?

rxin11y ago

It was part of the enhanced networking. Without enhanced networking, we were getting about 600MB/s, vs 1.1GB/s with.

AustinBGibbons11y ago

Hi Reynold! Do you have numbers / intuition for how previous versions of spark would have run? I'm upgrading (soon) from spark 0.8 to spark 1.1 and am curious to see the performance gains (especially w.r.t. shuffles)

rxin11y ago

Hi Austin,

We haven't tested Spark 0.8 at this scale. In general Spark is advancing at a rapid rate that 1.1 is very very different from 0.8.

ch11y ago

Curious if using the Sparrow scheduler would have been a net gain/loss to this type of work load?

rxin11y ago

It would help a little bit (maybe a few percent), but not much because the scheduling latency was relatively low for these tasks (the largest scheduling delay was ~10 secs, whereas each task takes minutes).

discardorama11y ago· 5 in thread

It's interesting, but not earth-shattering. The "10x fewer nodes" means nothing; how powerful are the new nodes? What's the network? Do you use SSDs? etc. etc.

They also tuned their code to this specific problem:

"Exploiting Cache Locality: In the sort benchmark, each record is 100 bytes, where the sort key is the first 10 bytes. As we were profiling our sort program, we noticed the cache miss rate was high, because each comparison required an object pointer lookup that was random..... Combining TimSort with our new layout to exploit cache locality, the CPU time for sorting was reduced by a factor of 5."

I would love to see MR and Spark compete on the exact same hardware configuration.

saryant11y ago

The article says exactly what they ran on. EC2 i2.8xlarge instances which have 32 cores, 800GB SSD and 244GB RAM.

grier11y ago

Also used the "Enhanced Networking" option on the instances which means single root I/O virtualization underneath.

discardorama11y ago

I read that. But how does that compare with the nodes they're comparing against ("10x fewer nodes")?

2 more replies

discardorama11y ago

Not just 800GB of SSD; 8x 800GB of SSD!

1 more reply

nchammas11y ago

> I would love to see MR and Spark compete on the exact same hardware configuration.

You may find this benchmark [1] interesting to read.

It needs some updating (a lot has changed since February 2014), but it compares Shark (which uses Spark as its execution engine) to Hive (using Hadoop 1 MapReduce as its execution engine) and a number of other systems.

The benchmark is run on EC2 and is detailed in such a way that it should be independently verifiable. Hive and Shark are run on identically sized clusters, though I don't know if the other details of the configuration were identical.

[1] https://amplab.cs.berkeley.edu/benchmark/

gtrubetskoy11y ago· 5 in thread

The strength of Hadoop isn't so much speed but that it's been around and there is a pretty impressive and fairly mature set of projects that comprises the Hadoop ecosystem, from Yarn to Hive, etc. There are still many issues to resolve, and this evolution will continue for decades to come.

The TB sort benchmark is pretty useless to me - I am much more concerned with stability, a vibrant community (which means people, the software they write and institutions using Hadoop in production).

Last time I tinkered with Spark (this was over a year ago) it was so buggy, next to useless, but perhaps things have changed.

Still - the idea that there is some sort of a revolutionary new approach that is paradigm-shifting and is way better than anything before should be viewed with extreme skepticism.

The problem of distributed computing is not a simple one. I remember tinkering with the Linux kernel back in the mid nineties, and 20 years later it still has ways to go to improve.

Twenty years from now it might or might not be Hadoop that is the tool for this sort of thing, we don't know, but I will not take seriously anything or anyone who claims that the "next best thing" is here in 2014.

metroniusOP11y ago

1. Cloudera left M/R for Spark, Mahout left M/R for Spark. Spark community will be huge soon.

2. Yes, Spark was/is buggy.

3. For me Spark is really paradigm shift, next generation framework compared to M/R

gtrubetskoy11y ago

Hadoop != M/R, FWIW. M/R support is left in Yarn for backwards compatibility mostly.

If by M/R you mean Hadoop - Cloudera has done no such thing, their largest customer base is Hadoop.

As to "paradigm shift", we're so early in this that I don't think there even is a paradigm to shift.

1 more reply

gtrubetskoy11y ago

Spark requires Hadoop to run, so this whole Spark vs Hadoop debate makes no sense whatsoever.

There is a place for arguing how effective Map/Reduce is, but it's been known for years that M/R is not the only, nor best general purpose algorithm for solving all problems. More and more tools these days do not use M/R, Spark including, and Spark certainly is no the first tool to provide an alternative to M/R. AFAIK Google has abandoned M/R years ago.

I just don't understand this constant boasting about Spark, it seems very suspicious to me.

2 more replies

rxin11y ago

Actually Doug Cutting himself (who created Hadoop) tweeted about this. I guess Spark gets some of his blessing :)

As pointed out in the article multiple times, we are comparing with MR here. We are not comparing with Hadoop as an ecosystem. Spark plays nicely with Hadoop. As a matter of fact, this experiment ran on HDFS.

In terms of vibrant community, Spark is now the largest open source Big Data project by community/contributor count. More than 300 people have contributed code to the project.

gtrubetskoy11y ago

I remember Nathan Marz saying that Storm is the most active project on Github about a year ago. ;)

2 more replies

panarky11y ago· 3 in thread

The 100 terabyte benchmark used 206 Spark nodes, compared with 2100 Hadoop nodes.

Going up to 1 petabyte, the Hadoop comparison adds more nodes, 3800, while the Spark benchmark actually reduced the number of nodes to 190.

Does Spark scale well beyond ~200 nodes, or does the network become the bottleneck?

In any case, it's an impressive result considering that they didn't use Spark's in-memory cache.

Lanzaa11y ago

I believe the network had become a bottleneck. As per the article:

> [O]ur Spark cluster was able to sustain ... 1.1 GB/s/node network activity during the reduce phase, saturating the 10Gbps link available on these machines.

If the network is the bottleneck it makes sense to reduce the number of nodes to reduce the network communications.

rxin11y ago

The job is actually very linearly scalable. i.e. running it on 200 nodes roughly doubles the throughput of 100 nodes.

rxin11y ago

It is mainly the cost of getting nodes from EC2 at that point. It becomes hard to get a huge number of i2.8xl instances.

Spark runs fine on thousands of nodes.

showerst11y ago· 3 in thread

For the curious, the (max) price of those instances is $6.82/hr, so 206 * 6.82 * (23/60) = $538.55 --If they did it with non-reserved instances in US East.

If they used reserved instances in USEast, it drops to $181.

Obviously there are lots of costs involved beside the final perfect run, but it's an interesting ballpark.

gphil11y ago

One of the big positives of Spark is that its architecture is amenable to having workers run on spot instances, which are even cheaper than reserved instances.

sp33211y ago

You have to put spaces around your * 's to keep HN from italicizing everything.

showerst11y ago

Oops, edited. Thanks!

chubot11y ago· 2 in thread

FWIW, in 2011, Google wrote that they achieved a PB sort in 33 minutes on 8000 computers, vs. 234 minutes on 190 computers with 6080 cores reported by Spark here.

http://googleresearch.blogspot.com/2011/09/sorting-petabytes...

deeviant11y ago

I'm not sure why you list Google as using "8000 computers" and Spark using "190 computers with 6080 cores".

Using two different metrics for two like things seems like there is some sort of implication there. Were Google's machines single-cored?

chubot11y ago

I'm just writing down exactly what they reported. They used different metrics.

Certainly it would be interesting to have an apples to apples comparison. But the computers aren't the only thing that is relevant -- we also need to know about the networking hardware.

gane5h11y ago· 1 in thread

Going on a tangent here: this benchmark highlights the difficulty of sorting in general. Sorts are necessary for computing percentiles (such as the median.) In practical applications, an approximate algorithm such as t-digest should suffice. You can return results in seconds as opposed to "chest thumping" benchmarks to prove a point. :)

I wrote a post on this: http://www.silota.com/site-search-blog/approximate-median-co...

sonoffett11y ago

Perhaps I misunderstand your comment, but you actually don't need to sort to compute a median (see O(n) median of medians algorithm [1]).

[1] http://en.wikipedia.org/wiki/Median_of_medians

ddlatham11y ago

Most recent results I can see to compare to (Google, Yahoo, Quantcast): https://www.quantcast.com/inside-quantcast/2013/12/petabyte-...

coldcode11y ago

No matter what the circumstances, sorting 100 TB or 1 PB of anything is impressive, much less doing it during the time it takes me to eat lunch.

vinay_ys11y ago

Where can I find the source code & instructions on how to reproduce this benchmark?

metroniusOP11y ago

What change the biggest difference in performance between Spark and MapReduce?

xxcode11y ago

Does this mean that Spark is the new God. If this is the case, then Databricks will be the next Cloudera. Cloudera is probably a 10B+ company.

Good job

j / k navigate · click thread line to collapse

56 comments

37 comments · 12 top-level

rxin11y ago· 6 in thread

Thanks for sharing this. I'm the author of this blog post. Free free to ask me anything.

chad_walters11y ago

rxin11y ago

It was part of the enhanced networking. Without enhanced networking, we were getting about 600MB/s, vs 1.1GB/s with.

AustinBGibbons11y ago

rxin11y ago

Hi Austin,

We haven't tested Spark 0.8 at this scale. In general Spark is advancing at a rapid rate that 1.1 is very very different from 0.8.

ch11y ago

Curious if using the Sparrow scheduler would have been a net gain/loss to this type of work load?

rxin11y ago

discardorama11y ago· 5 in thread

It's interesting, but not earth-shattering. The "10x fewer nodes" means nothing; how powerful are the new nodes? What's the network? Do you use SSDs? etc. etc.

They also tuned their code to this specific problem:

I would love to see MR and Spark compete on the exact same hardware configuration.

saryant11y ago

The article says exactly what they ran on. EC2 i2.8xlarge instances which have 32 cores, 800GB SSD and 244GB RAM.

grier11y ago

Also used the "Enhanced Networking" option on the instances which means single root I/O virtualization underneath.

discardorama11y ago

I read that. But how does that compare with the nodes they're comparing against ("10x fewer nodes")?

2 more replies

discardorama11y ago

Not just 800GB of SSD; 8x 800GB of SSD!

1 more reply

nchammas11y ago

> I would love to see MR and Spark compete on the exact same hardware configuration.

You may find this benchmark [1] interesting to read.

[1] https://amplab.cs.berkeley.edu/benchmark/

gtrubetskoy11y ago· 5 in thread

The TB sort benchmark is pretty useless to me - I am much more concerned with stability, a vibrant community (which means people, the software they write and institutions using Hadoop in production).

Last time I tinkered with Spark (this was over a year ago) it was so buggy, next to useless, but perhaps things have changed.

Still - the idea that there is some sort of a revolutionary new approach that is paradigm-shifting and is way better than anything before should be viewed with extreme skepticism.

The problem of distributed computing is not a simple one. I remember tinkering with the Linux kernel back in the mid nineties, and 20 years later it still has ways to go to improve.

metroniusOP11y ago

1. Cloudera left M/R for Spark, Mahout left M/R for Spark. Spark community will be huge soon.

2. Yes, Spark was/is buggy.

3. For me Spark is really paradigm shift, next generation framework compared to M/R

gtrubetskoy11y ago

Hadoop != M/R, FWIW. M/R support is left in Yarn for backwards compatibility mostly.

If by M/R you mean Hadoop - Cloudera has done no such thing, their largest customer base is Hadoop.

As to "paradigm shift", we're so early in this that I don't think there even is a paradigm to shift.

1 more reply

gtrubetskoy11y ago

Spark requires Hadoop to run, so this whole Spark vs Hadoop debate makes no sense whatsoever.

I just don't understand this constant boasting about Spark, it seems very suspicious to me.

2 more replies

rxin11y ago

Actually Doug Cutting himself (who created Hadoop) tweeted about this. I guess Spark gets some of his blessing :)

In terms of vibrant community, Spark is now the largest open source Big Data project by community/contributor count. More than 300 people have contributed code to the project.

gtrubetskoy11y ago

I remember Nathan Marz saying that Storm is the most active project on Github about a year ago. ;)

2 more replies

panarky11y ago· 3 in thread

The 100 terabyte benchmark used 206 Spark nodes, compared with 2100 Hadoop nodes.

Going up to 1 petabyte, the Hadoop comparison adds more nodes, 3800, while the Spark benchmark actually reduced the number of nodes to 190.

Does Spark scale well beyond ~200 nodes, or does the network become the bottleneck?

In any case, it's an impressive result considering that they didn't use Spark's in-memory cache.

Lanzaa11y ago

I believe the network had become a bottleneck. As per the article:

> [O]ur Spark cluster was able to sustain ... 1.1 GB/s/node network activity during the reduce phase, saturating the 10Gbps link available on these machines.

If the network is the bottleneck it makes sense to reduce the number of nodes to reduce the network communications.

rxin11y ago

The job is actually very linearly scalable. i.e. running it on 200 nodes roughly doubles the throughput of 100 nodes.

rxin11y ago

It is mainly the cost of getting nodes from EC2 at that point. It becomes hard to get a huge number of i2.8xl instances.

Spark runs fine on thousands of nodes.

showerst11y ago· 3 in thread

For the curious, the (max) price of those instances is $6.82/hr, so 206 * 6.82 * (23/60) = $538.55 --If they did it with non-reserved instances in US East.

If they used reserved instances in USEast, it drops to $181.

Obviously there are lots of costs involved beside the final perfect run, but it's an interesting ballpark.

gphil11y ago

One of the big positives of Spark is that its architecture is amenable to having workers run on spot instances, which are even cheaper than reserved instances.

sp33211y ago

You have to put spaces around your * 's to keep HN from italicizing everything.

showerst11y ago

Oops, edited. Thanks!

chubot11y ago· 2 in thread

FWIW, in 2011, Google wrote that they achieved a PB sort in 33 minutes on 8000 computers, vs. 234 minutes on 190 computers with 6080 cores reported by Spark here.

http://googleresearch.blogspot.com/2011/09/sorting-petabytes...

deeviant11y ago

I'm not sure why you list Google as using "8000 computers" and Spark using "190 computers with 6080 cores".

Using two different metrics for two like things seems like there is some sort of implication there. Were Google's machines single-cored?

chubot11y ago

I'm just writing down exactly what they reported. They used different metrics.

Certainly it would be interesting to have an apples to apples comparison. But the computers aren't the only thing that is relevant -- we also need to know about the networking hardware.

gane5h11y ago· 1 in thread

I wrote a post on this: http://www.silota.com/site-search-blog/approximate-median-co...

sonoffett11y ago

Perhaps I misunderstand your comment, but you actually don't need to sort to compute a median (see O(n) median of medians algorithm [1]).

[1] http://en.wikipedia.org/wiki/Median_of_medians

ddlatham11y ago

Most recent results I can see to compare to (Google, Yahoo, Quantcast): https://www.quantcast.com/inside-quantcast/2013/12/petabyte-...

coldcode11y ago

No matter what the circumstances, sorting 100 TB or 1 PB of anything is impressive, much less doing it during the time it takes me to eat lunch.

vinay_ys11y ago

Where can I find the source code & instructions on how to reproduce this benchmark?

metroniusOP11y ago

What change the biggest difference in performance between Spark and MapReduce?

xxcode11y ago

Does this mean that Spark is the new God. If this is the case, then Databricks will be the next Cloudera. Cloudera is probably a 10B+ company.

Good job

j / k navigate · click thread line to collapse