Amazon Redshift is 10x faster and cheaper than Hadoop and Hive (opens in new tab)

(slideshare.net)

138 pointsfujibee13y ago42 comments

42 comments

40 comments · 18 top-level

meritt13y ago· 4 in thread

Comparing a column-oriented RDBMS with parallel query execution versus hadoop is a joke in the first place. Hadoop is extremely slow. That's nothing new. This is not an apples-to-apples comparison whatsoever.

How does it compare against Greenplum or Aster or Vertica and is it more cost-effective? Those are important questions.

nieksand13y ago

Comparing Redshift against Hadoop+Hive is reasonable. As you pointed out... the technologies are very different. However, there is a large overlap in use cases.

annnnd13y ago

I strongly disagree. I am (or was) using both Hadoop and HBase and they are useful for very different purposes (huge amounts of nonstructured data, possibly with difficult-to-predict use cases versus structured data). Also note that Hive is just a layer over Hadoop with DB-like syntax, it doesn't make Hadoop a DB. It is still running MR queries beneath it.

verily13y ago

Especially given that Redshift, Greenplum and Aster directly build on or incorporate technology from PostgreSQL.

effn13y ago

This is true for Vertica as well.

BrianEatWorld13y ago· 4 in thread

I am still new to large data, but isn't a solution like Redshift similar to Google's Big Query in that it only works with data that has a schema? How might one use Redshift with a db thats originally in Mongo?

zeeg13y ago

You won't fit that much data into Mongo anyways, so does it matter?

taligent13y ago

People have been apparently storing 3TB of data in MongoDB.

So I guess it does matter.

scotth13y ago

Impose a schema.

BrianEatWorld13y ago

Is there an easy way to go back and forth between data with schema and data without? I'd love the benefits of this for queries, but for the production side of things, imposing a schema would be costly.

cwsteinbach13y ago· 3 in thread

Disclaimer: I'm a committer on the Apache Hive project.

A couple points in no particular order:

* EMR Hive is a closed source fork of the upstream Apache Hive code base. The EMR docs imply that the latest version of EMR Hive is based on Apache Hive 0.8.1 (which was released more than a year ago), which means EMR users aren't benefitting from the performance improvements that appeared in the 0.9 and 0.10 releases.

* It is implied (though not explicitly stated) that the Hive queries were run against gzip compressed TSV files stored in S3, while Redshift was allowed to spend 17 hours converting the same data to its own optimized internal format. Hive supports an optimized columnar format too (RCFile). Why wasn't that used in this performance comparison?

ajays13y ago

Why wasn't that used in this performance comparison?

Because then the stupid headline wouldn't be so sensationalist, would it?

// I have no dog in this fight, but hate twisted claims

viralbajaria13y ago

great feedback. I was also skeptical of using EMR hive due to the fact that it is so far behind in versions. Also RedShift can do the analytics part very well but I don't think it can do the exploration part that Hadoop/Hive are so good at (but maybe I am wrong)

Evbn13y ago

Wow. A year ago a solution architect promised me they would catch up to mainline to get a bunch of critical bugfixes.

free65213y ago· 3 in thread

So redshift took 155 seconds + 17 hours (17 * 3600) = 61355 secs total

vs 1491 Hadoop

Looks like to me Hadoop is about 40 times faster...

MBCook13y ago

That's like saying a bicycle is faster than a car because I can buy a bike in 10 minutes while it may take me a couple of hours to get through the car's paperwork.

If you do enough queries, redshift will come out faster (assuming the numbers are correct).

TallGuyShort13y ago

If you do enough queries, you should spend the time to use RCFile for Hive, in which case redshift wont come out _that_ much faster. The point is the 17 hours is not negligible.

1 more reply

seanmcdirmid13y ago

That's assuming Hive doesn't have it's own special format that could be converted to to improve performance, right?

jaytaylor13y ago· 3 in thread

I haven't tried redshift before, but coming from a MR/Hadoop/Hive background, this seems to me like quite a sensational claim. I'd be very keen to hear other's thoughts on how widely these kinds of gains would apply for BigData processing.

As Carl Sagan said..

"Extraordinary claims require extraordinary evidence"

http://en.wikipedia.org/wiki/Carl_Sagan

saidajigumi13y ago

Hive is not particularly fast in and of itself; it just has horizontal scaling and a SQL-ish front-end. Looking at AWS RedShift's homepage[1] (emphasis added):

> Amazon Redshift delivers fast query and I/O performance for virtually any size dataset by using columnar storage technology and parallelizing and distributing queries across multiple nodes.

Column stores databases[2] can be screamingly fast for analytics operations compared to RDBMS or other DB types (ala assorted NoSQL). See Kdb[3] or MonetDB[4] for examples of specific implementations. I'd fully expect a competent column store designed for horizontal scaling to obliterate Hive for a wide range of problems.

The usual big-data caveat: you need to pay attention to the fit of your tools against your problem and your data. I don't expect RedShift to be any different. Still, it's pretty exciting to see a new analysis DB tech cropping up like this. And doubly interesting to see this coming from Amazon.

[1] https://aws.amazon.com/redshift/

[2] https://en.wikipedia.org/wiki/Column-oriented_DBMS

[3a] http://kx.com/kdb-plus.php

[3b] https://en.wikipedia.org/wiki/K_%28programming_language%29#K...

[4] http://www.monetdb.org/Home

AndyNemmity13y ago

SAP HANA has a column store, and a row store, and does OLAP (Analytics) and OLTP.

There is a lot of new DB tech, Redshift doesn't seem particularly competitive at the moment unless you only need to use it a portion of the time, where Amazon excels.

ryanpers13y ago

Given the legendary performance issues of Hadoop I am not really surprised.

Hadoop is heavily horizontally scalable, but that's about it.

z_13y ago· 2 in thread

Slide 2&6, one query every 30 minutes.

It turns out usage based billing can be cheaper if you don't use a resource.

danudey13y ago

I'm willing to bet that's a not-uncommon scenario for a lot of organizations, however. If you're doing continuous querying of large amounts of data, then it's probably worth building your own hadoop cluster (physically or via Amazon), but a lot of people are just going to accumulate data and then make queries against it. Lots of 'active users per day', 'traffic by hour', 'purchases by popularity', etc. only get run to create data for the CEO every morning, or by the marketing manager every afternoon, that sort of thing.

zwily13y ago

I don't understand that part of the slides. Redshift isn't billed per-query, it's billed by instance-hour.

jeremyjh13y ago· 1 in thread

1.2 TB really is not very much data in the context of "Big Data". The supposed advantage of Hadoop is that it can scale horizontally with linear performance.

pjscott13y ago

They note on slide 9 that this is only biggish data -- but sometimes, that's what you need to work with.

1 more reply

ryanbrush13y ago· 1 in thread

I wish the post had gone into depth on _why_ Redshift was significantly faster, but I'm betting it uses in-memory joins whereas (hence the size limitations it mentions) whereas Hive joins are just MapReduce jobs that keep only minimal subsets of data in memory at a given point. The upshot is the Hive/MapReduce strategy isn't limited by physical memory.

Of course, if your data set can fit in memory, then Redshift or similar technologies probably is a better choice than Hive. But it's important to remember that the performance gains here come as the result of a tradeoff.

taligent13y ago

It was significantly faster because as was mentioned above the graph ignores the the 17 HOURS it took for RedShift to import the data.

The comparison is complete and utter joke.

pytrin13y ago· 1 in thread

Worth noting this presentation was made by Hapyrus, a Hadoop specialized startup from 500startups. They know quite a bit about running Hadoop. Following the results of their tests they are now adding Redshift support to their services.

Uchikoma13y ago

... and want to sell their Redshift services starting with a bang.

iblaine13y ago

There's a motive here. hapyrus.com is pushing themselves as Redshift consultants. Oh you're using Hadoop? Redshift is better, cheaper and you can pay us to help you use it.

tonfa13y ago

I wonder how it compares to BigQuery: https://developers.google.com/bigquery/docs/pricing

ameyamk13y ago

They should compare redshift with hadoop + Imapala, OR hbase with Phoexnix from Salesforce. Comparing with hadoop + hive is not a correct comparison

Radim13y ago

Indeed, this comparison seems fishy.

Nevertheless, I'll take a moment to predict that articles like this will be only becoming more and more frequent in time. Hadoop has entered its "enterprisey" stage, with massively complex, cumbersome code, arcane performance tuning, bullshit consulting business built around it (complete with books and "certificates")...

The more agile competitors will be snapping at its flanks (and ankles), sometimes without merit, and sometimes with.

AndyNemmity13y ago

I'd be interested in seeing a comparison between Redshift and SAP HANA, but a more fair comparison than this one by someone who isn't partisan.

fujibeeOP13y ago

We wrote the blog post about this benchmark. http://www.hapyrus.com/blog/posts/behind-amazon-redshift-is-...

kushti13y ago

Seems like stupid marketing shit

hobbyist13y ago

Are they benchmarking hash join on hadoop and redshift?

cmccabe13y ago

Anyone interested in SQL queries on Hadoop should be checking out Cloudera Impala. It's open source.

j / k navigate · click thread line to collapse

42 comments

40 comments · 18 top-level

meritt13y ago· 4 in thread

How does it compare against Greenplum or Aster or Vertica and is it more cost-effective? Those are important questions.

nieksand13y ago

Comparing Redshift against Hadoop+Hive is reasonable. As you pointed out... the technologies are very different. However, there is a large overlap in use cases.

annnnd13y ago

verily13y ago

Especially given that Redshift, Greenplum and Aster directly build on or incorporate technology from PostgreSQL.

effn13y ago

This is true for Vertica as well.

BrianEatWorld13y ago· 4 in thread

zeeg13y ago

You won't fit that much data into Mongo anyways, so does it matter?

taligent13y ago

People have been apparently storing 3TB of data in MongoDB.

So I guess it does matter.

scotth13y ago

Impose a schema.

BrianEatWorld13y ago

cwsteinbach13y ago· 3 in thread

Disclaimer: I'm a committer on the Apache Hive project.

A couple points in no particular order:

ajays13y ago

Why wasn't that used in this performance comparison?

Because then the stupid headline wouldn't be so sensationalist, would it?

// I have no dog in this fight, but hate twisted claims

viralbajaria13y ago

Evbn13y ago

Wow. A year ago a solution architect promised me they would catch up to mainline to get a bunch of critical bugfixes.

free65213y ago· 3 in thread

So redshift took 155 seconds + 17 hours (17 * 3600) = 61355 secs total

vs 1491 Hadoop

Looks like to me Hadoop is about 40 times faster...

MBCook13y ago

That's like saying a bicycle is faster than a car because I can buy a bike in 10 minutes while it may take me a couple of hours to get through the car's paperwork.

If you do enough queries, redshift will come out faster (assuming the numbers are correct).

TallGuyShort13y ago

If you do enough queries, you should spend the time to use RCFile for Hive, in which case redshift wont come out _that_ much faster. The point is the 17 hours is not negligible.

1 more reply

seanmcdirmid13y ago

That's assuming Hive doesn't have it's own special format that could be converted to to improve performance, right?

jaytaylor13y ago· 3 in thread

As Carl Sagan said..

"Extraordinary claims require extraordinary evidence"

http://en.wikipedia.org/wiki/Carl_Sagan

saidajigumi13y ago

Hive is not particularly fast in and of itself; it just has horizontal scaling and a SQL-ish front-end. Looking at AWS RedShift's homepage[1] (emphasis added):

> Amazon Redshift delivers fast query and I/O performance for virtually any size dataset by using columnar storage technology and parallelizing and distributing queries across multiple nodes.

[1] https://aws.amazon.com/redshift/

[2] https://en.wikipedia.org/wiki/Column-oriented_DBMS

[3a] http://kx.com/kdb-plus.php

[3b] https://en.wikipedia.org/wiki/K_%28programming_language%29#K...

[4] http://www.monetdb.org/Home

AndyNemmity13y ago

SAP HANA has a column store, and a row store, and does OLAP (Analytics) and OLTP.

There is a lot of new DB tech, Redshift doesn't seem particularly competitive at the moment unless you only need to use it a portion of the time, where Amazon excels.

ryanpers13y ago

Given the legendary performance issues of Hadoop I am not really surprised.

Hadoop is heavily horizontally scalable, but that's about it.

z_13y ago· 2 in thread

Slide 2&6, one query every 30 minutes.

It turns out usage based billing can be cheaper if you don't use a resource.

danudey13y ago

zwily13y ago

I don't understand that part of the slides. Redshift isn't billed per-query, it's billed by instance-hour.

jeremyjh13y ago· 1 in thread

1.2 TB really is not very much data in the context of "Big Data". The supposed advantage of Hadoop is that it can scale horizontally with linear performance.

pjscott13y ago

They note on slide 9 that this is only biggish data -- but sometimes, that's what you need to work with.

1 more reply

ryanbrush13y ago· 1 in thread

taligent13y ago

It was significantly faster because as was mentioned above the graph ignores the the 17 HOURS it took for RedShift to import the data.

The comparison is complete and utter joke.

pytrin13y ago· 1 in thread

Uchikoma13y ago

... and want to sell their Redshift services starting with a bang.

iblaine13y ago

There's a motive here. hapyrus.com is pushing themselves as Redshift consultants. Oh you're using Hadoop? Redshift is better, cheaper and you can pay us to help you use it.

tonfa13y ago

I wonder how it compares to BigQuery: https://developers.google.com/bigquery/docs/pricing

ameyamk13y ago

They should compare redshift with hadoop + Imapala, OR hbase with Phoexnix from Salesforce. Comparing with hadoop + hive is not a correct comparison

Radim13y ago

Indeed, this comparison seems fishy.

The more agile competitors will be snapping at its flanks (and ankles), sometimes without merit, and sometimes with.

AndyNemmity13y ago

I'd be interested in seeing a comparison between Redshift and SAP HANA, but a more fair comparison than this one by someone who isn't partisan.

fujibeeOP13y ago

We wrote the blog post about this benchmark. http://www.hapyrus.com/blog/posts/behind-amazon-redshift-is-...

kushti13y ago

Seems like stupid marketing shit

hobbyist13y ago

Are they benchmarking hash join on hadoop and redshift?

cmccabe13y ago

Anyone interested in SQL queries on Hadoop should be checking out Cloudera Impala. It's open source.

j / k navigate · click thread line to collapse