But what useful conclusions can be drawn from it?
A few examples of some useful conclusions:
- Just because a relatively well-optimized PostgreSQL database on a regular workstation takes 5 minutes to run a query doesn't mean you can't get special hardware to run that query faster than you can type.
- Spark + S3 + Amazon Elastic Map Reduce look like an ideal tool to work with large data, but they're pretty slow compared to better tools, and even compared to plain PostgreSQL.
- HDFS really is a lot faster than S3.
- Performance of an Xeon Phi 64-core CPU is within an order of magnitude to an NVidia Titan X.
- Loading 104 GB of compressed data into Q/kdb+ expands to 125 GB with and takes about 30 minutes, but on Redshift expands to 2 TB and takes many hours to upload on a normal connection, plus 4 hours to actually import!
- It might cost $5000 to custom-build a GPU-based supercomputer that can do these queries in under a second, but you can run similar queries if you're willing to wait for 5 minutes each by spinning up instances for a few dollars an hour plus a few more dollars an hour for storage, or by just running PostgreSQL on your workstation.
Also, not a conclusion, but it's incredibly useful to have a simple example exactly how to configure the tool and import some CSV data
Just because a relatively well-optimized PostgreSQL database on a regular workstation takes 5 minutes to run a query doesn't mean you can't get special hardware to run that query faster than you can type.
Already well established for years with systems like redis, and more recently with gpu databases, and other techniques posted on HN regularly.
Spark + S3 + Amazon Elastic Map Reduce...is pretty slow compared to better tools, and even compared to plain PostgreSQL.
Not valid because it doesn't generalize. It so much depends on type of work being done, system architecture, etc, that you can only say it may or may not be true.
HDFS really is a lot faster than S3.
This is already well established, Amazon states aa much right in the docs: http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-pl...
Performance of an Xeon Phi 64-core CPU is within an order of magnitude to an NVidia Titan X.
Not precise enough to matter because getting within 10x difference is not close to being competitive.*
Loading 104 GB of compressed data into Q/kdb+ expands to 125 GB with and takes about 30 minutes, but on Redshift expands to 2 TB and takes many hours to upload on a normal connection, plus 4 hours to actually import!
I don't see how it's possible for 104GB of csv text data to decompress into only 125GB. For cvs to compress only ~20%...doesn't make sense.
It might cost $5000 to custom-build a GPU-based supercomputer that can do these queries in under a second
No, two problems here. The hardware in question could have used 1 cheap CPU instead of two expensive Xeons and been much less expensive. Bigger problem: The MapD software itself will be $50,000.
> I don't see how it's possible for 104GB of csv text data to decompress into only 125GB. For cvs to compress only ~20%...doesn't make sense.
The CSV file itself is around 500 GB. The internal representation, which might use binary formats for numbers, or compress text, uses 125 GB. Redshift expands it to 2TB for all the indexing and mapping.
> Bigger problem: The MapD software itself will be $50,000.
Ouch. That's a rather large oversight. Is the author affiliated with MapD, perhaps?
This is actually my biggest complaint with the article. He used cstore_fdw with Postgres, which doesn't allow much real indexing, and as far as I can tell (knowing only a little bit about it) he didn't really use any of the benefits of cstore_fdw.
I'd be interested to see how plain Postgres, possibly on a compressed filesystem, with properly-indexed tables stacks up.