Amazon Athena: Query S3 Using SQL (opens in new tab)

(aws.amazon.com)

176 pointspolmolea9y ago44 comments

44 comments

This looks very neat. I'm someone who deals with a lot of plaintext data from a variety of sources, and so I find using ack/grep and csvkit to be efficient enough for my purposes of exploration. I love using SQL and SQLite but rarely do it for "fun" -- that is, I'll use it when I've committed to building a project, but not for exploration. This seems like it could lighten the friction quite a bit.

If anyone from AWS is here: how is this used internally at Amazon?

ktamura9y ago

The real question to ask is, will Amazon contribute back to open source? Presto itself is plenty proven and scalable: after all, it was created at Facebook.

kermatt9y ago

"Amazon Athena uses Presto with ANSI SQL support and works with a variety of standard data formats, including CSV, JSON, ORC, and Parquet."

I wonder if this is essentially a Presto SaaS product?

maslam9y ago

Yes

spullara9y ago

It looks really interesting but I'm surprised they launched it with the create table flow broken. The query you see here was generated by their wizard...

https://www.dropbox.com/s/s4cw5x7yyrdl3ch/Screenshot%202016-...

jakozaur9y ago

Looks very similar to Google Big Query.

Even the pricing is same: $5 / TB of data scanned.

estefan9y ago

When I tried it it was slower than bigquery. Plus you've got to mess about creating hive schemas.

spullara9y ago

I don't know why you are getting downvoted. For all those data formats you have to painstakingly make table schemas for them before you can query them. Not like Snowflake or BigQuery. One of the biggest strikes against Presto IMHO.

2 more replies

guywithabike9y ago

TFA states: "Amazon Athena uses Presto with ANSI SQL support and works with a variety of standard data formats, including CSV, JSON, ORC, and Parquet."

1 more reply

jackmaney9y ago

> Q: What data formats does Amazon Athena support?

> Amazon Athena supports a wide variety of data formats like CSV, TSV, JSON, or Textfiles and also supports open source columnar formats such as Apache ORC and Apache Parquet. Athena also supports compressed data in Snappy, Zlib, and GZIP formats. By compressing, partitioning, and using columnar formats you can improve performance and reduce your costs.

https://aws.amazon.com/athena/faqs/

buremba9y ago

I hope that Amazon contributes back to the Presto community.

nimrody9y ago

Would be useful if AVRO files were supported. This was the data can also be imported into Redshift if needed (Redshift does support Avro).

Other formats are schema-less (JSON,CSV, etc.) or not supported by Redshift (ORC, Parquet). Perhaps less efficient for some queries (AVRO is not a columnar format) but still useful.

dhananjayc9y ago

Is it possible to connect Athena to existing Hive Metastore?

nodesocket9y ago

Anybody have an example of storing NGINX access logs and using Athena to search them?

neximo649y ago

Any examples of queries and what this can do? S3 was file storage as far as i thought?

raghavsethi9y ago

Athena (Presto) supports standard ANSI SQL - you can query data that's stored in S3.

neximo649y ago

How does that work though, so say my bucket has 10,000 json files in it and I want all of them with the name attributes being like '%john'. Is that possible?

cdevs9y ago

$5 a terabyte jeebus...don't f that query up

bsg759y ago

Just like with BigQuery, a carefully thought out partitioning scheme is critical, or your queries need to be carefully locked down to prevent excessive table scanning. I burned through my BigQuery trial credit fast, by not using partitions during a quick-and-dirty test.

nulagrithom9y ago

Wondering if I could use this like SQLite for Lambdas. I'd like to build some serverless apps, but the commitment to a monthly fee from DynamoDB puts me off. Could I use Athena to drive down my cost to zero as long as the app is unused?

asteadman9y ago

Note the 10MB minimum "charge" per Query. For small datasets under 10MB, you'd only get up to 200 Queries for the minimum billable $0.01. That would be a fairly small number of queries, so probably not that useful. Plus you'd have all kinds of issues regarding consistency if your data was dynamic (s3 is a blob store, not a database, normal s3 consistency guarantees still apply).

I'm confused though. The monthly fees for dynamodb only apply after you exceed the free tier, and for someone who is unable to commit to a monthly fee because they envision low usage, shouldn't the free tier be sufficient? (Honest question, I'm looking at using dynamodb, but comments like this make me think I'm missing something)

nulagrithom9y ago

Only thing you're missing is that I didn't realize DynamoDB's free tier was a non-expiring free tier.

brilliantcode9y ago

DynamoDB is like $5 or $10 bucks a month? but I understand the need to keep it to a minimum.

Athena is really interesting and if it can be as it is advertised "Serverless SQL" then they've got a killer product in the pipes: A future where developers no longer need to spend time on scaling, configuring, maintaining, strategizing deployments but upload code and instantly begin reaping the benefits of serverless tech.

The only missing component that would be a killer feature is something that answers to Azure's Active Directory. It would be nice if we had serverless plug-and-play user authentication and access control that integrated with Lambda and Athena.

I'd imagine some sort of "RoR on Serverless" type of framework that will scaffold out CRUD, User Management & REST Api is going to be in the works as well.

The only potential downside I see at the moment for Serverless is the uncertainty surrounding cold boots, it will directly affect user experience. It's fine when you got enough traffic to keep things in the "warm" state but there needs to be no dead zone when the call to the API Gateway is taking many seconds waiting for Lambda function to fire.

sciurus9y ago

Just because you can query it with SQL doesn't make it a relational database suitable for use for OLTP. Athena is built on Presto, so see https://prestodb.io/docs/current/overview/use-cases.html for an explanation.

asteadman9y ago

Re: users auth. Isn't that what Cognito is supposed to be? I mean, I don't fully understand it, but I think so.

As for the cold boot issue, I thought the standing solution was to have a "fast-exit" ping-like code-path within the lambda. Query it on a regular basis (you can even do it with a lambda scheduled-event). That way your lambda should be kept warm.

1 more reply

balls1879y ago

Tried it twice, and it crashed big time.

balls1879y ago

Also gives me a 500 on US-WEST-2

asafm9y ago

I wonder why they haven't chose Apache Drill over Presto. Anyone knows?

intrasight9y ago

what does "point your data in S3" mean?

justinsaccount9y ago

Are you talking about this? you left out a word.

Simply point to your data in Amazon S3

intrasight9y ago

Still makes no sense. Please explain if you understand.

1 more reply

mrwnmonm9y ago

John Forstrom: Amazon Athena - welcome to 2010! https://twitter.com/jforstrom/status/804007642246938624

j / k navigate · click thread line to collapse

44 comments

danso9y ago

If anyone from AWS is here: how is this used internally at Amazon?

ktamura9y ago

The real question to ask is, will Amazon contribute back to open source? Presto itself is plenty proven and scalable: after all, it was created at Facebook.

kermatt9y ago

"Amazon Athena uses Presto with ANSI SQL support and works with a variety of standard data formats, including CSV, JSON, ORC, and Parquet."

I wonder if this is essentially a Presto SaaS product?

maslam9y ago

Yes

spullara9y ago

It looks really interesting but I'm surprised they launched it with the create table flow broken. The query you see here was generated by their wizard...

https://www.dropbox.com/s/s4cw5x7yyrdl3ch/Screenshot%202016-...

jakozaur9y ago

Looks very similar to Google Big Query.

Even the pricing is same: $5 / TB of data scanned.

estefan9y ago

When I tried it it was slower than bigquery. Plus you've got to mess about creating hive schemas.

spullara9y ago

2 more replies

guywithabike9y ago

TFA states: "Amazon Athena uses Presto with ANSI SQL support and works with a variety of standard data formats, including CSV, JSON, ORC, and Parquet."

1 more reply

jackmaney9y ago

> Q: What data formats does Amazon Athena support?

https://aws.amazon.com/athena/faqs/

buremba9y ago

I hope that Amazon contributes back to the Presto community.

nimrody9y ago

Would be useful if AVRO files were supported. This was the data can also be imported into Redshift if needed (Redshift does support Avro).

Other formats are schema-less (JSON,CSV, etc.) or not supported by Redshift (ORC, Parquet). Perhaps less efficient for some queries (AVRO is not a columnar format) but still useful.

dhananjayc9y ago

Is it possible to connect Athena to existing Hive Metastore?

nodesocket9y ago

Anybody have an example of storing NGINX access logs and using Athena to search them?

neximo649y ago

Any examples of queries and what this can do? S3 was file storage as far as i thought?

raghavsethi9y ago

Athena (Presto) supports standard ANSI SQL - you can query data that's stored in S3.

neximo649y ago

How does that work though, so say my bucket has 10,000 json files in it and I want all of them with the name attributes being like '%john'. Is that possible?

cdevs9y ago

$5 a terabyte jeebus...don't f that query up

bsg759y ago

nulagrithom9y ago

asteadman9y ago

nulagrithom9y ago

Only thing you're missing is that I didn't realize DynamoDB's free tier was a non-expiring free tier.

brilliantcode9y ago

DynamoDB is like $5 or $10 bucks a month? but I understand the need to keep it to a minimum.

I'd imagine some sort of "RoR on Serverless" type of framework that will scaffold out CRUD, User Management & REST Api is going to be in the works as well.

sciurus9y ago

asteadman9y ago

Re: users auth. Isn't that what Cognito is supposed to be? I mean, I don't fully understand it, but I think so.

1 more reply

balls1879y ago

Tried it twice, and it crashed big time.

balls1879y ago

Also gives me a 500 on US-WEST-2

asafm9y ago

I wonder why they haven't chose Apache Drill over Presto. Anyone knows?

intrasight9y ago

what does "point your data in S3" mean?