Full disclosure: I run Paperspace (https://www.paperspace.com) and am working with the Quilt team to integrate their tools in to our platform.
These are some other people working in roughly the same space: http://datproject.org/ http://www.pachyderm.io/
But it does seem like Quilt is a go-to, if you are looking for a "Github for data" host.
- Girder: http://girder.readthedocs.io
- Intake: https://github.com/ContinuumIO/intake
I haven't used any of these, but I agree that the idea is quite compelling.
My team has found this drastically easier than Quilt, and we do a ton of stuff with reproducible environments in Docker, creating Makefiles to reproduce exact model training with the exact same data, etc. We probably hit just about every case there is (huge models, small models, models where we'd like to train separately or collectively on a bunch of different benchmark data sets, in-house data sets, models that need to be refreshed with new data in pipelines, etc.) So far, Quilt has not been competitive with a simple repo of shell scripts for us, in terms of ease of use or effectiveness in maintaining different packages of data.
The other super nice thing is that when people start out on new models or experiments, we already have our in-house maintained copies of a bunch of academic data sets, private data sets, etc., and you can throw together an incredibly simple Dockerfile or Makefile that uses the appropriate script. It's just one or two lines of shell code and voila, you have an environment with the dataset you want. Check that into git and now your experiment is immediately reproducible from day one. We've found this to dramatically increase the amount of code review that researchers engage in for checking their statistical methodology and sanity checking their intended models or experiments. With Quilt, you have the extra issue of versioning (rather than harshly enforcing all data sets to be immutable ... even just adding one more training example to the data set means you must provide a new shell script that downloads the old data, injects your lone additional sample, and has a documentation entry about exactly what it is doing), as well as the overhead of using yet another tool instead of super standard shell scripts.
For me, any of the tools that pop up attempting to be like conda-forge but for data packages is sort of like taking a gatling gun to a problem that can be solved with a hammer.
That said, there are a few areas where your system glosses over the needs of a data pipeline:
* "immutable by convention" is not a data preservation strategy; the system should enforce immutability
* what about deserialization? it's not enough to store and move bits. there are so many examples of "serdes" headaches. pickling (yes, pickle is a horrible format) in python 2 vs python 3 is one example. not to mention performance. my point is not that scripts can't do serdes, but that serdes information should travel with the data, so it's (mostly) transparent to the consumer.
* multiple writers (e.g. suppose you are generating training data in a distributed manner) requires write atomicity at the bucket level, which S3 doesn't provide
* deduplication of data fragments - I can see how one might do this with a "scripts over S3" strategy, but it's complicated enough that it's far easier to rely on a third-party app that just works in this regard
* fine-grained permissions - what if each data package has a different audience? sure, you can roll this with S3, but is that the best use of developer time?
* change history and access auditing
* querying and filtering - in many cases there is an enormous data corpus which needs to be sliced a different way by each user, e.g. Google Open Images. it is much more robust to have a single query mechanism that understands data layout than to write a fresh script for each slice.
* indexing data so they are searchable, etc.
PS - I am a contributor to Quilt.
Not to mention every one of those implementation packages their preprocessed version into a different data format, and then creates a different data pipeline (and I only looked at tensorflow implementations)
That's two more implementations that I haven't considered. I'm sure most of the processing steps under the hood are the same or similar, but as I'm not an audio processing expert, I can't tell which method is better (and why).
And it's hard to tell if it "works well" because or despite the way I processed the files.
$ apt-cache show quilt
Package: quilt
[..]
Description-en: Tool to work with series of patches
Quilt manages a series of patches by keeping track of the changes each of them makes. They are logically organized as a stack, and you can apply, un-apply, refresh them easily by traveling into the stack (push/pop). . Quilt is good for managing additional patches applied to a package received as a tarball or maintained in another version control system. The stacked organization is proven to be efficient for the management of very large patch sets (more than hundred patches). As matter of fact, it was designed by and for Linux kernel hackers (Andrew Morton, from the -mm branch, is the original author), and its main use by the current upstream maintainer is to manage the (hundreds of) patches against the kernel made for the SUSE distribution. . This package provides seamless integration into Debhelper or CDBS, allowing maintainers to easily add a quilt-based patch management system in their packages. The package also provides some basic support for those not using those tools. See README.Debian for more information.
$ zcat /usr/share/doc/quilt/changelog.gz | tail -n3
Version 0.26 (Tue Oct 21 2003) - Change summary not available
Inference example: https://www.paperspace.com/console/jobs/js4mqzm91fj2lg
Disclosure: I work on Paperspace