takes 10 minutes on PostgreSQL to return the result 21454134?
With H2, an open source embedded SQL database I use daily, a "select count( * ) from table_name" query returns the result instantly. I assumed therefore that this was the norm...
I may try to edit this post to compare those operations, as it may be more meaningful.
Thanks for posting the link. TIL.
PostgreSQL is optimized for blazing fast record mutations, inserts, while maintaining adequate query response times on medium to large size data. Redshift (ParAccel), or your other analytical databases like Greenplum, Teradata, or Netezza make optimizations that make sense for queries that pertain to the majority of a tables data (full table scans).
For example, Redshift stores stores a tables columns in separate locations, allowing you to not only skip reading columns which don't pertain to the query, but also allows for easier disk parallelization. Keeping your columns separate from each other slows down things like record reconstruction, since you're performing n disk seeks, where n = # of columns...this is bad for databases where you need a single record. Databases like Redshift are meant to compliment your MySQL and PostgreSQL databases, not replace them.
Shameless plug (I work at Cloudera): For reasons unknown, the article dismissed Hadoop without listing any reasons. If you're interested in having a secondary system designed for doing ad-hoc queries over your large datasets (billions of rows), I suggest trying out Impala. You can run it across a few servers you have sitting around... http://rideimpala.com/
Sure, they are. The point is that the API is similar, so developers already familiar with SQL, or libraries that speak SQL (SQLAlchemy in our case) can speak to Redshift. I attempted to make that point in the post, hopefully I can make that more clear.
> For reasons unknown, the article dismissed Hadoop without listing any reasons.
I didn't feel that it was the place for it, but the basics are that we didn't have the skills in place to restructure our system to use Hadoop. Second, our data structures are constantly changing, which Hadoop didn't accommodate very well. Third, we were not able to accommodate ad-hoc queries as every reduce operation needed to be figured out prior to running. Last, we were looking for something to handle "large" data sets, but not "big data." Hadoop works great for "big data" (petabytes) but it was overkill for us (working with terabytes).
I'm sure all of these problems could have been resolved, but the point of my write-up is not to kick dirt at Hadoop so much as to explain how Redshift fit our needs. Other companies, I'm sure, have different needs and should of course make their own evaluations.
> Hadoop works great for "big data" (petabytes) but it was overkill for us (working with terabytes).
This is a very common misunderstanding about Hadoop, terabytes of data is still very large when you want to apply complex transformations, or allow someone to run queries over it with low latency expectations (a few seconds).
Impala does not use MapReduce, it was designed with low latency goals in mind...speeds are comparable to RedShift (https://amplab.cs.berkeley.edu/benchmark/), although they are citing an older version.
"5 seconds! That’s an improvement. Note that I didn’t make any adjustements to the data: no indexes, no differences in table structure." should be 1.5s.
I'm curious. How does something smell like Postgres? Does it use some of the same PG-specific datatypes?
http://docs.aws.amazon.com/redshift/latest/dg/c_redshift-and...
Other than the marketing buzz from SAP, I have not seen any reputable (neutral) comparison between Hana and Redshift that includes price. I also have never met anyone that uses/run Hana that was not already an SAP ERP client, albeit the fact that SAP keeps saying Hana is the fastest-growing product in the company's history.
Besides all the marketing buzz SAP has been throwing it out there, how many companies actually use it with a decent amount of data (at least 10TB+ range)?