We tried a bunch of technologies like Nutch, Heritrix, Storm Crawler, ... eventually settled on Mixnode and since it's a 'cloud platform' we didn't really have to change anything.
As for processing the data we crawled, we are using ArchiveSpark (https://github.com/helgeho/ArchiveSpark)
Also, Mixnode defaults on Amazon S3 for storage which was ok with us since we're using EC2 for processing the results.