Show HN: s3-lambda – Lambda functions over S3 objects: each, map, reduce, filter (opens in new tab)

(github.com)

177 pointswellsjohnston9y ago74 comments

74 comments

51 comments · 12 top-level

hoodoof9y ago· 14 in thread

Its weird how S3 seems to be the unwanted stepchild of AWS.

So many obvious innovations just aren't turning up.

For example, strangely, AWS introduced tagging for S3 resources, but you can't search/filter by tag, nor is the tag even returned when you get a list of objects, you can only get the tag with an object request. The word "pointless" springs to mind.

In fact it's strange that there is NO useful filtering at all apart from the very useful folder/hierarchy/prefix filtering. But apart from that you can't do wildcard searches or filters or date filters or tag filters.

I'm building an application right now that needs to get a list of all the jpg files - the only way to do that is get every single object in the bucket and manually filter out the unwanted ones - feels like its 1988 again.

It seems like it would also be valuable for there to be alternate interfaces to S3 such as the ability to send data via ftp or SMTP or sftp or whatever, but there are no such interfaces.

Hopefully Google will goad AWS into action on S3 innovation by implementing such features.

lobster_johnson9y ago

S3's API is so rudimentary that I prefer to think of it as a non-enumerable key/value store.

I learned this the hard way: We had an application where made the mistake of storing about a billion files in a nearly flat structure — one level of nesting, probably 100m "folders" in the root. Then one day we needed to go through it to prune stuff that was no longer in use. Unfortunately, if you don't have a "shardable" prefix, list requests are impossible to parallelize efficiently (because you can't subdivide the work), and our scripts took weeks to run to completion. Hard-earned experience: If you're storing large quantities of stuff in S3, always pick a shardable prefix. The upload date is a good choice. A random string will also do.

After this, my solution for any non-trivially-sized storage use case is to store an inventory of objects separately in a performant PostgreSQL database, and make sure all writes go through a service layer that shields the consumer from the details of S3. This has some benefits over a hypothetical centralized approach (but some downsides, like the possibility that things get out of sync if you sidestep the inventory). Overall, I wish S3 would store its metadata in something like BigQuery.

Anyone know if Google Cloud Platform's S3 equivalent, Cloud Storage, improves on these issues?

lobster_johnson9y ago

Replying to myself: Disappointingly, it seems GCP's Cloud Storage is pretty much a carbon clone of S3 as far as the API is concerned, down to the prefix/delimiter-based search.

jessaustin9y ago

I wonder if "bucket notifications" are reliable enough that one could keep such an index DB populated automatically?

2 more replies

otterley9y ago

Have you looked into the inventory functionality? It was just added last November. http://docs.aws.amazon.com/AmazonS3/latest/dev/storage-inven...

1 more reply

philliphaydon9y ago

I did something similar storing the information in PostgreSQL but made the inserts/updates/deletes based on the events of s3. If an object was stored it would insert into the database. If it was deleted it would soft delete in the database. Worked out well for me.

afandian9y ago

As someone heading down a similar path (and I'm fairly sure I've got sensible prefixes) can you share an example of a prefix that caused you trouble. Is it something like

    /path/to/big-dir/«lots-of-sequential-filenames»

?

2 more replies

kot-behemoth9y ago

While great points, I think it might then go beyond the "Simple" in the S3 name itself. Wasn't the original purpose of the service to have it as a dumb storage, and you'll layer metadata as required? I.e. storing indices separately with whatever functionality is needed (be it date/path filtering).

hoodoof9y ago

Perhaps true.

I'll never do that though because I'd have to use DynamoDB, which is a technology that is high on my list of "technologies that I am least enthused about".

Also, I really shouldn't have to go to all the work of creating and maintaining a metadata database and implementing a query API just because I want to do searches more powerful than "list all objects" - that's Amazon's job.

1 more reply

edblarney9y ago

Surely some extra functionality would not obfuscate the inherent simplicity of S3.

A an S3Query module would not, I think, make things harder for S3 users.

And frankly - it would be awesome.

I used s3 a lot, and loathe to switch to a DB if I can avoid it.

Some querying and indexing features I think would be taken up by a large number of devs.

xtracto9y ago

This, we love S3. What we did is add a SQL tier for some of the data we are storing there in case we want to do some more structured operations.

1 more reply

illumin89y ago

First, I wanted to say, you bring up some very good points. S3 wasn't really designed to be a searchable key/value store, as you have to pay for lookups, and pagination kills your ability to effectively search anything greater than a few thousand objects in a hierarchy, within a reasonable amount of time.

There are, however, ways to solve this: you could fire a Lambda function whenever an object is put into your S3 bucket that simply adds a single row to a DynamoDB table with the object name, along with any additional metadata you might like to capture to assure data provenance. Then, to search, you can simply query the DynamoDB table.

As always, there are many basic building blocks at AWS, but you have to connect them together (like legos) before they become useful for most applications.

hoodoof9y ago

As mentioned elsewhere in this thread, an external metadata database of S3 object immediately introduces synching and validity issues.

DOS is smarter than S3.