That said, there are a few areas where your system glosses over the needs of a data pipeline:
* "immutable by convention" is not a data preservation strategy; the system should enforce immutability
* what about deserialization? it's not enough to store and move bits. there are so many examples of "serdes" headaches. pickling (yes, pickle is a horrible format) in python 2 vs python 3 is one example. not to mention performance. my point is not that scripts can't do serdes, but that serdes information should travel with the data, so it's (mostly) transparent to the consumer.
* multiple writers (e.g. suppose you are generating training data in a distributed manner) requires write atomicity at the bucket level, which S3 doesn't provide
* deduplication of data fragments - I can see how one might do this with a "scripts over S3" strategy, but it's complicated enough that it's far easier to rely on a third-party app that just works in this regard
* fine-grained permissions - what if each data package has a different audience? sure, you can roll this with S3, but is that the best use of developer time?
* change history and access auditing
* querying and filtering - in many cases there is an enormous data corpus which needs to be sliced a different way by each user, e.g. Google Open Images. it is much more robust to have a single query mechanism that understands data layout than to write a fresh script for each slice.
* indexing data so they are searchable, etc.
PS - I am a contributor to Quilt.