Amazon Redshift - What You Need To Know (opens in new tab)

(dailytechnology.net)

62 pointskungfooey12y ago19 comments

19 comments

18 comments · 5 top-level

stevoski12y ago· 5 in thread

select count( * ) from dummy_table;

takes 10 minutes on PostgreSQL to return the result 21454134?

With H2, an open source embedded SQL database I use daily, a "select count( * ) from table_name" query returns the result instantly. I assumed therefore that this was the norm...

saurik12y ago

As mentioned by sargun (who was irritatingly downvoted by someone) the difference is that PostgreSQL uses the MVCC concurrency model, which allows you to do quite complex queries concurrently with other people making writes to the database, but comes at the cost that there is no longer any objective count of the rows in a given table: the number of rows is related to which transactions are considered live at any given moment.

http://wiki.postgresql.org/wiki/Slow_Counting

kungfooeyOP12y ago

In my case, this was inadvertently a good comparison. Most of the "ad-hoc" queries we're doing are AVG, SUM, or similar operations that require scanning the entire table.

I may try to edit this post to compare those operations, as it may be more meaningful.

Thanks for posting the link. TIL.

scragg12y ago

I thought this query was instant in most dbs because it reads an internal row count of the table since this particular query has no conditions or joins.

sargun12y ago

Does H2 do MVCC?

ogrisel12y ago

AFAIK not by default. Furthermore the MVCC mode is still considered experimental: http://www.h2database.com/html/advanced.html#mvcc

monstrado12y ago· 2 in thread

These are very different databases, PostgreSQL is a transactional database, while Redshift (aka: ParAccel) is an analytical database. Each of these databases have implemented much different design decisions, which improve queries on certain type of workloads.

PostgreSQL is optimized for blazing fast record mutations, inserts, while maintaining adequate query response times on medium to large size data. Redshift (ParAccel), or your other analytical databases like Greenplum, Teradata, or Netezza make optimizations that make sense for queries that pertain to the majority of a tables data (full table scans).

For example, Redshift stores stores a tables columns in separate locations, allowing you to not only skip reading columns which don't pertain to the query, but also allows for easier disk parallelization. Keeping your columns separate from each other slows down things like record reconstruction, since you're performing n disk seeks, where n = # of columns...this is bad for databases where you need a single record. Databases like Redshift are meant to compliment your MySQL and PostgreSQL databases, not replace them.

Shameless plug (I work at Cloudera): For reasons unknown, the article dismissed Hadoop without listing any reasons. If you're interested in having a secondary system designed for doing ad-hoc queries over your large datasets (billions of rows), I suggest trying out Impala. You can run it across a few servers you have sitting around... http://rideimpala.com/

kungfooeyOP12y ago

> These are very different databases..

Sure, they are. The point is that the API is similar, so developers already familiar with SQL, or libraries that speak SQL (SQLAlchemy in our case) can speak to Redshift. I attempted to make that point in the post, hopefully I can make that more clear.

> For reasons unknown, the article dismissed Hadoop without listing any reasons.

I didn't feel that it was the place for it, but the basics are that we didn't have the skills in place to restructure our system to use Hadoop. Second, our data structures are constantly changing, which Hadoop didn't accommodate very well. Third, we were not able to accommodate ad-hoc queries as every reduce operation needed to be figured out prior to running. Last, we were looking for something to handle "large" data sets, but not "big data." Hadoop works great for "big data" (petabytes) but it was overkill for us (working with terabytes).

I'm sure all of these problems could have been resolved, but the point of my write-up is not to kick dirt at Hadoop so much as to explain how Redshift fit our needs. Other companies, I'm sure, have different needs and should of course make their own evaluations.

monstrado12y ago

Thanks for the clarifications, I wasn't not meaning to discredit your article in anyway. I am just trying to help people understand that these databases are very different from eachother, and were created to solve different use cases.

> Hadoop works great for "big data" (petabytes) but it was overkill for us (working with terabytes).

This is a very common misunderstanding about Hadoop, terabytes of data is still very large when you want to apply complex transformations, or allow someone to run queries over it with low latency expectations (a few seconds).

Impala does not use MapReduce, it was designed with low latency goals in mind...speeds are comparable to RedShift (https://amplab.cs.berkeley.edu/benchmark/), although they are citing an older version.

1 more reply

nemothekid12y ago· 2 in thread

In the COUNT(*) example, looks like you annotated the wrong snippet.

"5 seconds! That’s an improvement. Note that I didn’t make any adjustements to the data: no indexes, no differences in table structure." should be 1.5s.

jmduke12y ago

Looks like he typo'd and put a space in between the "1." and the "5", which got interpreted as an ordered list.

kungfooeyOP12y ago

Whoops, thanks for catching that. I didn't actually have a space there, but it did get interpreted that way.

superails12y ago· 2 in thread

> Redshift smelled enough like PostgreSQL

I'm curious. How does something smell like Postgres? Does it use some of the same PG-specific datatypes?

kungfooeyOP12y ago

It uses similar data types and has a similar API. The differences between the two are summarized in the documentation.

http://docs.aws.amazon.com/redshift/latest/dg/c_redshift-and...

tmarthal12y ago

Not only does it use the same datatypes, but I know that it uses the same pg wire format such that the java (and I'm sure python) postgres drivers work out of the box.

ceyhunkazel12y ago· 2 in thread

I think beside familiar PostgreSQL enviroment there is not much advantage over SAP HANA. I listed advantages of HANA on my posting https://news.ycombinator.com/item?id=6466222

mtam12y ago

Redshift is a SQL relational database for large structured data volumes, whereas Amazon EMR (Elastic MapReduce) is for non-structured data. The Redshift FAQ has more details on this topic: http://aws.amazon.com/redshift/faqs/#0120

Other than the marketing buzz from SAP, I have not seen any reputable (neutral) comparison between Hana and Redshift that includes price. I also have never met anyone that uses/run Hana that was not already an SAP ERP client, albeit the fact that SAP keeps saying Hana is the fastest-growing product in the company's history.

tonywe6412y ago

I don't think it makes sense to compare Redshift to HANA. There is no way that HANA could scale the way ParAccel/Vertica scales within the same price range.

Besides all the marketing buzz SAP has been throwing it out there, how many companies actually use it with a decent amount of data (at least 10TB+ range)?

j / k navigate · click thread line to collapse

19 comments

18 comments · 5 top-level

stevoski12y ago· 5 in thread

select count( * ) from dummy_table;

takes 10 minutes on PostgreSQL to return the result 21454134?

With H2, an open source embedded SQL database I use daily, a "select count( * ) from table_name" query returns the result instantly. I assumed therefore that this was the norm...

saurik12y ago

http://wiki.postgresql.org/wiki/Slow_Counting

kungfooeyOP12y ago

In my case, this was inadvertently a good comparison. Most of the "ad-hoc" queries we're doing are AVG, SUM, or similar operations that require scanning the entire table.

I may try to edit this post to compare those operations, as it may be more meaningful.

Thanks for posting the link. TIL.

scragg12y ago

I thought this query was instant in most dbs because it reads an internal row count of the table since this particular query has no conditions or joins.

sargun12y ago

Does H2 do MVCC?

ogrisel12y ago

AFAIK not by default. Furthermore the MVCC mode is still considered experimental: http://www.h2database.com/html/advanced.html#mvcc

monstrado12y ago· 2 in thread

kungfooeyOP12y ago

> These are very different databases..

> For reasons unknown, the article dismissed Hadoop without listing any reasons.

monstrado12y ago

> Hadoop works great for "big data" (petabytes) but it was overkill for us (working with terabytes).

1 more reply

nemothekid12y ago· 2 in thread

In the COUNT(*) example, looks like you annotated the wrong snippet.

"5 seconds! That’s an improvement. Note that I didn’t make any adjustements to the data: no indexes, no differences in table structure." should be 1.5s.

jmduke12y ago

Looks like he typo'd and put a space in between the "1." and the "5", which got interpreted as an ordered list.

kungfooeyOP12y ago

Whoops, thanks for catching that. I didn't actually have a space there, but it did get interpreted that way.

superails12y ago· 2 in thread

> Redshift smelled enough like PostgreSQL

I'm curious. How does something smell like Postgres? Does it use some of the same PG-specific datatypes?

kungfooeyOP12y ago

It uses similar data types and has a similar API. The differences between the two are summarized in the documentation.

http://docs.aws.amazon.com/redshift/latest/dg/c_redshift-and...

tmarthal12y ago

Not only does it use the same datatypes, but I know that it uses the same pg wire format such that the java (and I'm sure python) postgres drivers work out of the box.

ceyhunkazel12y ago· 2 in thread

I think beside familiar PostgreSQL enviroment there is not much advantage over SAP HANA. I listed advantages of HANA on my posting https://news.ycombinator.com/item?id=6466222

mtam12y ago

tonywe6412y ago

I don't think it makes sense to compare Redshift to HANA. There is no way that HANA could scale the way ParAccel/Vertica scales within the same price range.

Besides all the marketing buzz SAP has been throwing it out there, how many companies actually use it with a decent amount of data (at least 10TB+ range)?

j / k navigate · click thread line to collapse