The JSON support (SUPER type) is kind of cool, and they are moving towards more “automatic” sorting + partitioning, but it’s just all a bit shit to be honest.
We encountered major bugs with data-sharing, our clusters keep insisting that zstd is the best compression format to use for all our data (but then never actually using it), materialised views often fail to update and understanding why is a nightmare, terrible performance if your strings are varchar(max) (guess what Glue sets them to…), Redshift data often just dies (4 hour downtime recently, no status page) and has some really weird semantics around listing queries, before the data API you couldn’t run async queries and it’s eventbridge integration straight up doesn’t work, nightmare bugs in the Java connection library that don’t show up using psql, tiny set of types (no arrays, uuids), unkillable queries, AQUA actually causing everything to slow down hugely, critical release notes posted only in a fucking random forum, etc etc.
Snowflake has apparently sorted this, as well as including ingestion tools (snowpipe) that you’d otherwise have to stitch together with AWS Glue or something (a cursed service if ever there was one).
That being said, in some cases Redshift absolutely flies. But the real world isn’t filled with ideal schemas and natural sort keys. It’s messy. And Snowflake deals with messy better.
snowflake give you visiblity into clustering [1] and in the query profile view you can see how pruning is working( or not working)
Can you give an example of what you visibility you would like to see in terms of partitioning?
1. https://docs.snowflake.com/en/sql-reference/functions/system...
Edit for future readers: the original comment was "I haven’t really been impressed with Redshift, it seems like too little and too late".
basically distributing compute down to the actual storage nodes
What they emit is the dance routine of a sugar coated cheerleader squad.
I am of the view everything which is not a strength is obfuscated.
I have zero faith, confidence and trust is all information AWS emits.
I approach press releases and the docs on the basis that they cover up the actual implementation, and so my task is to find out what is actually going on under the hood, so I can actually make sense of what's been provided and operate it correctly (or avoid it completely, as it may be!)
I beleive the real advantage AWS has here is in cost. Snowflake has positioned itself as price competitive with Redshift but this is primarily due to Snowflake's ability to scale on-demand, whereas prior Redshift versions required you to size for peak usage (RA3 helped with this). In my experience Snowflake is an order of magnitude more expensive if you compare similiar workloads and do not account for idle time. We will need to see the performance of a "Redshift Processing Unit" to be sure of the advantage, but even so AWS will be able provide significant downward cost pressure through this offering.
Cost reasons is why I'm most bullish about DataBricks's FOSS https://delta.io
My experience with support/account managers is that they always tell you "yes, Redshift can do this", and the and the only way to actually get a "no" out of them is to already know Redshift cannot do something, and to explain to them why.
They won't deny reality, but you would never have got that answer from them in any other way.
I suspect the problem is the training AWS give its staff. The material they are taught is relentlessly positive and I suspect AWS staff actually have no idea what Redshift is no good for.
(Indeed, if you read the official docs for RS, which I strongly advise you never to do, you will come out the other end under the impression there is literally nothing Redshift cannot do; the docs describe everything using positive terms only.)
The advantage is the flexibility to easily change compute resource. A disadvantage is that your data is now in S3 or something very like it, and this I think alters the characteristics of write performance, for the cluster; I've not yet looked into this, but it's on the list.
You absolutely should beware of falling into the trap of imagining that serverless simply gives you flexible compute and that's the only change to behaviour.
AWS in their press releases and docs are relentlessly positive - anything which is not a strength is obfuscated - so only actual experimentation and investigation throws light on what you're really getting.
Technically, Athena is based on modified Presto while Redshift is (very) heavily modified Postgres.
Athena = Lambda + S3 (what i would call true serverless)
Redshift Serverless = Auto AWS Managed EC2 instances with local storage + S3
Although I could be wrong as I just had a quick 5 minute look at it...
From their FAQ [1]:
Q: What is the difference between Amazon Athena, Amazon EMR, and Amazon Redshift?
[...] Amazon Redshift provides the fastest query performance for enterprise reporting and business intelligence workloads, particularly those involving extremely complex SQL with multiple joins and sub-queries.
[...] Amazon Athena provides the easiest way to run ad-hoc queries for data in S3 without the need to setup or manage any servers.
Elsewhere they also recommend Athena for unstructured data. This and a relational database like Postgres are fundamentally different and you shouldn't use one when your use case is primarily meant for the other.Having used both I do think BigQuery is better in a lot of ways (although it's easier to make it a lot expensive too), but I'm really excited to see Redshift catch up. Adding the serverless options are really great too since my biggest complaint with Redshift was managing the quantity and type of the underlying instances.
See more details here:
BigQuery is a full database. It is significantly faster than running anything from Athena. The closest comparison on AWS is Redshift.
Getting data into Athena isn't something that is just done for you. Athena just takes what you've put on S3 and queries over it - and leaves getting it onto S3 (and into an efficient format) as an exercise for the reader.
Athena's speed varies a lot depending on what format you put things in. Querying over CSVs will mean that you're slow and reading a lot of data. Querying over ORC (column-store) files is pretty quick.
The big thing is Athena's pricing. They price it on how much data you read in reality not how much data would be read if things weren't optimized. BigQuery charges you based on how much data would be read if it weren't optimized. With BigQuery, an integer is always 8 bytes. It doesn't matter if they're able to optimize it down to nothing using RLE (run length encoding). You still pay the full 8 bytes. If your ORC files make that integer column tiny, you get the benefit of that.
BigQuery is great, but Athena's pricing is a lot cheaper given that you get to benefit from any storage optimization you do.
Out of curiosity, how have you used Athena that you're seeing it be so much slower? In my experience, BigQuery is faster (maybe 2x faster), but I've been using column-oriented data with Athena. If you're using CSVs with Athena, it will be way slower than BigQuery.
I'm always a little surprised that AWS doesn't build Athena out more, but I guess if they did they'd want money and margin for the value add. Still, Athena is a pretty decent serverless Presto and Presto can work pretty well over data in column formats.
There are too many simple problems which should have been caught in testing, and the problem which have over time been found absolutely imply unprofessional, even amateurish software development standards.
For example, recently, the format of the version string was changed. This broke a lot of existing software, which had hard coded parsing - SQL Alchemy stopped working - so did AWS's own JDBC driver.
This on the face of it indicates the RS test suite does not include any connections over JDBC.
It then turned out the version string had anyway been inaccurate for months, because RS had moved from GCC 3.4.2 (I think it was) to 7.3. But the version string kept reporting the old numbers.
I can't even begin to describe how many issues - flat factual errors, and profoundly meaningful gobble-de-gook - in the official docs.
The whole thing just feels too much like amateur hour.
Load my data where? This is "serverless".
More seriously, "serverless" usually just means you aren't supposed to worry about server/cluster management, not that there are no servers anywhere. So it really means "load your data to Redshift, wherever that lives".