undefined | Better HN

0 pointstlipcon11y ago0 comments

If only that were true -- the shuffle is typically seek-bound when the intermediate data doesn't fit into cache (plenty of papers show this pretty conclusively).

0 comments

rxin11y ago

Hi Todd,

Except in the case of MR 2100 nodes the entire dataset fit in memory :)

lmeyerov11y ago

Doesn't that speak to his point? Either smaller memory and more nodes, or more memory and less nodes. Why not do apples to apples? (This feels like the benchmarketing going on in browsers, which, at this point, is largely meaningless.)

Edit: on the other hand, this is an endorsement of the current wave of "per node performance stinks, let's avoid rewriting software for an extra year or two by throwing SSDs at it." Great for hardware vendors!

rxin11y ago

No it doesn't. The old record used 2100 nodes so the entire data actually fit in memory. There shouldn't be much seek happening even in the MR 2100 case. In Spark's case, the data actually doesn't fit in memory.

Also this was primary network bound. The old record had 2100 nodes with 10Gbps network.

discardorama11y ago

As per the specification of this test, the data has to be committed on disk before it is considered sorted. So even if it all fits in memory, it has to be on disk before the end.

So you have 100TB of disk read, followed by 100TB of disk write, all on HDDs. That's about 100GB/node; and since Hadoop nodes are typically in RAID-6, each write has an associated read and write too.

This does not even include the intermediate files, which (depending on how the kernel parameters have been set), could have been written on disk. Typical dirty_background_ratio is 10; so after 6GB of dirty pages, pdflush will kick in and start writing to the spinning disk.

1 more reply

lmeyerov11y ago

Ah, so you're saying as this is network bound, you want to launch much fewer wimpy nodes in exchange for a few bigger ones (those w/ SSDs -- over 1PB worth!). If it was memory-bound, you wouldn't care how they're spread as long as it was still in-memory.

Another way of looking at this is performance per watt or dollar. The r3.2 has 60GB, so comparing to that, Spark cost the same ~$1.4K while giving a 3X speedup. (Or on host that charges per minute, it'd be the same performance at 3X cheaper.)

This is me not knowing the space: would MR (or more modern things like Tez) perform worse on this HW setup, or is this a reminder that hardware/config tuning matters?

1 more reply

j / k navigate · click thread line to collapse

0 comments

rxin11y ago

Hi Todd,

Except in the case of MR 2100 nodes the entire dataset fit in memory :)

lmeyerov11y ago

rxin11y ago

Also this was primary network bound. The old record had 2100 nodes with 10Gbps network.

discardorama11y ago

As per the specification of this test, the data has to be committed on disk before it is considered sorted. So even if it all fits in memory, it has to be on disk before the end.

So you have 100TB of disk read, followed by 100TB of disk write, all on HDDs. That's about 100GB/node; and since Hadoop nodes are typically in RAID-6, each write has an associated read and write too.

1 more reply

lmeyerov11y ago

This is me not knowing the space: would MR (or more modern things like Tez) perform worse on this HW setup, or is this a reminder that hardware/config tuning matters?

1 more reply

j / k navigate · click thread line to collapse