AWS optimizations: For that level of cost efficiency, you really need to use spot instances. A cluster of 100 m1.xlarge machines (1.5TB of RAM, 400 cores, and 168TB of magnetic disk storage) will only cost you $3 per hour using spot instances, rather than $30 on-demand. You should pay on-demand prices for the Hadoop master however.
You'll also want to roll your own Hadoop cluster as opposed to using Elastic MapReduce (EMR). EMR is amazing but the cost overhead when using it on spot instances is ~100%.
For the code itself, this is a situation where you'll want to stick with the programming language that has both the best performance and best ecosystem. I'm personally not a big fan of Java, but it really does win out here -- it's close to C or C++ for performance and has the advantage of the Hadoop ecosystem behind it. Other languages are certainly usable, but even if LanguageX only ran 4x slower than Java, the resulting job would be 4x more expensive due to paying by the hour.
Other than that, it's really just a standard MapReduce job using Hadoop. You can see an three examples for the three different data formats we use at: https://github.com/commoncrawl/cc-warc-examples/