undefined | Better HN

0 pointsSmerity11y ago0 comments

There's are two levels of optimizations that come into play: AWS setup and the choice of primary language. I'm always happy to speak about both as we love seeing experiments run over the data!

AWS optimizations: For that level of cost efficiency, you really need to use spot instances. A cluster of 100 m1.xlarge machines (1.5TB of RAM, 400 cores, and 168TB of magnetic disk storage) will only cost you $3 per hour using spot instances, rather than $30 on-demand. You should pay on-demand prices for the Hadoop master however.

You'll also want to roll your own Hadoop cluster as opposed to using Elastic MapReduce (EMR). EMR is amazing but the cost overhead when using it on spot instances is ~100%.

For the code itself, this is a situation where you'll want to stick with the programming language that has both the best performance and best ecosystem. I'm personally not a big fan of Java, but it really does win out here -- it's close to C or C++ for performance and has the advantage of the Hadoop ecosystem behind it. Other languages are certainly usable, but even if LanguageX only ran 4x slower than Java, the resulting job would be 4x more expensive due to paying by the hour.

Other than that, it's really just a standard MapReduce job using Hadoop. You can see an three examples for the three different data formats we use at: https://github.com/commoncrawl/cc-warc-examples/

0 comments

2 comments · 1 top-level

elyase11y ago· 1 in thread

Thanks for your comments. I see you get 400 cores for 3USD/h . Where you optimizing for processing power per dollar? Do you know if the m1.xlarge instances are the best for this use case?

SmerityOP11y ago

It really depends on the use case. I mentioned m1.xlarge as they provide a good general cluster setup. The mix of CPU, RAM, and disk space should work well for most experiments one might want to perform. m1.xlarge instances also have 1Gbps network interfaces when others in the same price range have 500Mbps -- a vestige of the older generation of machines. Finally, they excel at disk space. If you're utilizing HDFS heavily, newer instances are usually SSD (good) but have 10 to 20 times less disk storage (bad).

For the same dollar amount, you can trade for other specs though.

If you're more interested in CPU / RAM / SSD for example, paying $3USD/h for r3.xlarge gets you 3TB of RAM, about 1.5 times more computing power (same number of cores but more compute units), but far less disk space -- 8TB of SSD.

In the end, it really depends on the task at hand, but your dollar does go quite far regardless!

j / k navigate · click thread line to collapse