A couple points in no particular order:
* EMR Hive is a closed source fork of the upstream Apache Hive code base. The EMR docs imply that the latest version of EMR Hive is based on Apache Hive 0.8.1 (which was released more than a year ago), which means EMR users aren't benefitting from the performance improvements that appeared in the 0.9 and 0.10 releases.
* It is implied (though not explicitly stated) that the Hive queries were run against gzip compressed TSV files stored in S3, while Redshift was allowed to spend 17 hours converting the same data to its own optimized internal format. Hive supports an optimized columnar format too (RCFile). Why wasn't that used in this performance comparison?
Because then the stupid headline wouldn't be so sensationalist, would it?
// I have no dog in this fight, but hate twisted claims
How does it compare against Greenplum or Aster or Vertica and is it more cost-effective? Those are important questions.
vs 1491 Hadoop
Looks like to me Hadoop is about 40 times faster...
If you do enough queries, redshift will come out faster (assuming the numbers are correct).
As Carl Sagan said..
"Extraordinary claims require extraordinary evidence"
> Amazon Redshift delivers fast query and I/O performance for virtually any size dataset by using columnar storage technology and parallelizing and distributing queries across multiple nodes.
Column stores databases[2] can be screamingly fast for analytics operations compared to RDBMS or other DB types (ala assorted NoSQL). See Kdb[3] or MonetDB[4] for examples of specific implementations. I'd fully expect a competent column store designed for horizontal scaling to obliterate Hive for a wide range of problems.
The usual big-data caveat: you need to pay attention to the fit of your tools against your problem and your data. I don't expect RedShift to be any different. Still, it's pretty exciting to see a new analysis DB tech cropping up like this. And doubly interesting to see this coming from Amazon.
[1] https://aws.amazon.com/redshift/
[2] https://en.wikipedia.org/wiki/Column-oriented_DBMS
[3a] http://kx.com/kdb-plus.php
[3b] https://en.wikipedia.org/wiki/K_%28programming_language%29#K...
There is a lot of new DB tech, Redshift doesn't seem particularly competitive at the moment unless you only need to use it a portion of the time, where Amazon excels.
Hadoop is heavily horizontally scalable, but that's about it.
Of course, if your data set can fit in memory, then Redshift or similar technologies probably is a better choice than Hive. But it's important to remember that the performance gains here come as the result of a tradeoff.
The comparison is complete and utter joke.
So I guess it does matter.
Nevertheless, I'll take a moment to predict that articles like this will be only becoming more and more frequent in time. Hadoop has entered its "enterprisey" stage, with massively complex, cumbersome code, arcane performance tuning, bullshit consulting business built around it (complete with books and "certificates")...
The more agile competitors will be snapping at its flanks (and ankles), sometimes without merit, and sometimes with.
It turns out usage based billing can be cheaper if you don't use a resource.