Reproducible machine learning with PyTorch and Quilt (opens in new tab)

(blog.paperspace.com)

135 pointsakarve7y ago26 comments

26 comments

23 comments · 7 top-level

jononor7y ago· 12 in thread

Was not aware of Quilt for hosting datasets. Is it the go-to in this area? What other alternatives are there?

DTE7y ago

I have been researching data orchestration/versioning tools for a long time and have been following the Quilt guys closely. It is definitely one of the more powerful tools in the ML/AI engineer's toolbox and solves a huge problem that almost everyone runs in to right out the gate. It's still early days in this space but Quilt gets a lot of things right and I'm super excited to see this product develop.

Full disclosure: I run Paperspace (https://www.paperspace.com) and am working with the Quilt team to integrate their tools in to our platform.

eindiran7y ago

You can use AWS to host open datasets: https://aws.amazon.com/opendata/public-datasets/

These are some other people working in roughly the same space: http://datproject.org/ http://www.pachyderm.io/

But it does seem like Quilt is a go-to, if you are looking for a "Github for data" host.

jmaxfield7y ago

I use Quilt pretty much daily and while I like AWS open datasets I don't think it is as actively developed on as Quilt is. DAT project on the other hand I really do like as a way to simply transfer large amounts of data between contributors, that said, if you are just trying to get data out there and have people use it freely for their own work I think Quilt presents the solution due to searchable and easily understood python (and I think an R repo) usage of datasets.

jdoliner7y ago

Pachyderm founder here. We're not really a data hosting provider, although we may offer that in the future. Right now Pachyderm is more like intranet data hosting for companies. You have to spin up your own Kubernetes cluster and deploy Pachyderm on it. It's also not normally used to download data onto your local machine for processing because it has its own computation layer which allows you to run code at scale and tracks the provenance of the data to keep things reproducible.

guenp7y ago

I've tried dat (https://datproject.org/) and git lfs (https://git-lfs.github.com/) but so far have found quilt to be easiest to use & best fitting to my use case (experimental physics characterization experiments).

shoyer7y ago

A couple of publicly available alternatives I'm aware of include:

- Girder: http://girder.readthedocs.io

- Intake: https://github.com/ContinuumIO/intake

I haven't used any of these, but I agree that the idea is quite compelling.

mlthoughts20187y ago

Just use AWS S3 (or similar) and shell scripts. My team uses a git repository named something like "data-packages", which is nothing but a collection of shell scripts with the name <dataset>.sh, that perform the necessary download and extraction steps to get a dataset from S3. Data sets are immutable by convention, so any changes to a data set requires you to provide a totally new shell script. That script could download an older data set and then mutate it if you don't want to maintain large copies of big data sets, but the older data set itself is not permitted to be mutated on S3.

My team has found this drastically easier than Quilt, and we do a ton of stuff with reproducible environments in Docker, creating Makefiles to reproduce exact model training with the exact same data, etc. We probably hit just about every case there is (huge models, small models, models where we'd like to train separately or collectively on a bunch of different benchmark data sets, in-house data sets, models that need to be refreshed with new data in pipelines, etc.) So far, Quilt has not been competitive with a simple repo of shell scripts for us, in terms of ease of use or effectiveness in maintaining different packages of data.

The other super nice thing is that when people start out on new models or experiments, we already have our in-house maintained copies of a bunch of academic data sets, private data sets, etc., and you can throw together an incredibly simple Dockerfile or Makefile that uses the appropriate script. It's just one or two lines of shell code and voila, you have an environment with the dataset you want. Check that into git and now your experiment is immediately reproducible from day one. We've found this to dramatically increase the amount of code review that researchers engage in for checking their statistical methodology and sanity checking their intended models or experiments. With Quilt, you have the extra issue of versioning (rather than harshly enforcing all data sets to be immutable ... even just adding one more training example to the data set means you must provide a new shell script that downloads the old data, injects your lone additional sample, and has a documentation entry about exactly what it is doing), as well as the overhead of using yet another tool instead of super standard shell scripts.

For me, any of the tools that pop up attempting to be like conda-forge but for data packages is sort of like taking a gatling gun to a problem that can be solved with a hammer.

akarveOP7y ago

Interesting thoughts. Quilt has a ways to grow. You correctly point out that, in some cases, S3 is lighter weight. You'll see future versions of Quilt get lighter, and offer more S3-like "just store this" functionality. In its next minor revision, Quilt simplifies point updates (i.e. it will be possible to update a single training example without materializing the entire package).

That said, there are a few areas where your system glosses over the needs of a data pipeline:

* "immutable by convention" is not a data preservation strategy; the system should enforce immutability

* what about deserialization? it's not enough to store and move bits. there are so many examples of "serdes" headaches. pickling (yes, pickle is a horrible format) in python 2 vs python 3 is one example. not to mention performance. my point is not that scripts can't do serdes, but that serdes information should travel with the data, so it's (mostly) transparent to the consumer.

* multiple writers (e.g. suppose you are generating training data in a distributed manner) requires write atomicity at the bucket level, which S3 doesn't provide

* deduplication of data fragments - I can see how one might do this with a "scripts over S3" strategy, but it's complicated enough that it's far easier to rely on a third-party app that just works in this regard

* fine-grained permissions - what if each data package has a different audience? sure, you can roll this with S3, but is that the best use of developer time?

* change history and access auditing

* querying and filtering - in many cases there is an enormous data corpus which needs to be sliced a different way by each user, e.g. Google Open Images. it is much more robust to have a single query mechanism that understands data layout than to write a fresh script for each slice.

* indexing data so they are searchable, etc.

PS - I am a contributor to Quilt.

1 more reply

jononor7y ago

Do you store the datasets as tar/zip archives on S3, or do you have some way of representing how a collection of items goes together to form a dataset?

2 more replies

casegold7y ago

Quilt has been great for CLI versioning while preparing large datasets

shauni7y ago

Has anyone tried out the option of self-hosting Quilt registries? I really like the idea of Quilt, although I am worried that my network bandwidth would be an issue for 10-100GB datasets...

rryan7y ago

dat: https://datproject.org/

p1esk7y ago· 2 in thread

Oh, this resonates with me so much! I'm running 4 different DeepSpeech models right now, each using a differently processed version of LibriSpeech dataset (mfcc/fbanks/linear spectrograms, deltas? energy? padding? etc). Because the original DS papers didn't bother describing it, and every implementation I found uses completely different methods and libraries.

Not to mention every one of those implementation packages their preprocessed version into a different data format, and then creates a different data pipeline (and I only looked at tensorflow implementations)

stealthcat7y ago

Why don't you use STFT + Conv2D like Deep Speech 2 did. It works well in my case.

p1esk7y ago

The DeepSpeech2 paper does not include any details about audio processing. I see an older Baidu-Research implementation of DS1 that uses "log of linear spectrogram from FFT energy". Also, there's a pytorch implementation [1], where they use Librosa's STFT, is that what you're referring to?

That's two more implementations that I haven't considered. I'm sure most of the processing steps under the hood are the same or similar, but as I'm not an audio processing expert, I can't tell which method is better (and why).

And it's hard to tell if it "works well" because or despite the way I processed the files.

[1] https://github.com/SeanNaren/deepspeech.pytorch

infinity07y ago· 1 in thread

A step in the right direction for machine learning in science, but they could have done some more research into naming conflicts:

$ apt-cache show quilt

Package: quilt

[..]

Description-en: Tool to work with series of patches

Quilt manages a series of patches by keeping track of the changes each of them makes. They are logically organized as a stack, and you can apply, un-apply, refresh them easily by traveling into the stack (push/pop). . Quilt is good for managing additional patches applied to a package received as a tarball or maintained in another version control system. The stacked organization is proven to be efficient for the management of very large patch sets (more than hundred patches). As matter of fact, it was designed by and for Linux kernel hackers (Andrew Morton, from the -mm branch, is the original author), and its main use by the current upstream maintainer is to manage the (hundreds of) patches against the kernel made for the SUSE distribution. . This package provides seamless integration into Debhelper or CDBS, allowing maintainers to easily add a quilt-based patch management system in their packages. The package also provides some basic support for those not using those tools. See README.Debian for more information.

$ zcat /usr/share/doc/quilt/changelog.gz | tail -n3

Version 0.26 (Tue Oct 21 2003) - Change summary not available

akarveOP7y ago

i hear you. on pypi the name is uncontested so, at least in the python eco-system, there is only one quilt. that said, for future revisions we'll try for a unique name because it can indeed be confusing, e.g. in the apt-get case.

ForFreedom7y ago· 1 in thread

Isn't quilt just bluring the pixels to an extend?

akarveOP7y ago

Quilt isn't doing the inference (the PyTorch model is). But, in any case, no. Super-resolution is more than blurring, it's pixel inference. https://arxiv.org/abs/1609.05158

dkobran7y ago

In case you missed it, here's a link to the full training example that you can run yourself: https://www.paperspace.com/console/jobs/jvqssfqawv5zn/logs

Inference example: https://www.paperspace.com/console/jobs/js4mqzm91fj2lg

Disclosure: I work on Paperspace

cwyers7y ago

It seems to me like the machine learning algorithm here is mostly learning how to add JPEG compression artifacts to images.

rhacker7y ago

Please please please don't kill our favorite plot device. Make sure the process takes exactly 3 days.

j / k navigate · click thread line to collapse

26 comments

23 comments · 7 top-level

jononor7y ago· 12 in thread

Was not aware of Quilt for hosting datasets. Is it the go-to in this area? What other alternatives are there?

DTE7y ago

Full disclosure: I run Paperspace (https://www.paperspace.com) and am working with the Quilt team to integrate their tools in to our platform.

eindiran7y ago

You can use AWS to host open datasets: https://aws.amazon.com/opendata/public-datasets/

These are some other people working in roughly the same space: http://datproject.org/ http://www.pachyderm.io/

But it does seem like Quilt is a go-to, if you are looking for a "Github for data" host.

jmaxfield7y ago

jdoliner7y ago

guenp7y ago

shoyer7y ago

A couple of publicly available alternatives I'm aware of include:

- Girder: http://girder.readthedocs.io

- Intake: https://github.com/ContinuumIO/intake

I haven't used any of these, but I agree that the idea is quite compelling.

mlthoughts20187y ago

For me, any of the tools that pop up attempting to be like conda-forge but for data packages is sort of like taking a gatling gun to a problem that can be solved with a hammer.

akarveOP7y ago

That said, there are a few areas where your system glosses over the needs of a data pipeline:

* "immutable by convention" is not a data preservation strategy; the system should enforce immutability

* multiple writers (e.g. suppose you are generating training data in a distributed manner) requires write atomicity at the bucket level, which S3 doesn't provide

* fine-grained permissions - what if each data package has a different audience? sure, you can roll this with S3, but is that the best use of developer time?

* change history and access auditing

* indexing data so they are searchable, etc.

PS - I am a contributor to Quilt.

1 more reply

jononor7y ago

Do you store the datasets as tar/zip archives on S3, or do you have some way of representing how a collection of items goes together to form a dataset?

2 more replies

casegold7y ago

Quilt has been great for CLI versioning while preparing large datasets

shauni7y ago

Has anyone tried out the option of self-hosting Quilt registries? I really like the idea of Quilt, although I am worried that my network bandwidth would be an issue for 10-100GB datasets...

rryan7y ago

dat: https://datproject.org/

p1esk7y ago· 2 in thread

stealthcat7y ago

Why don't you use STFT + Conv2D like Deep Speech 2 did. It works well in my case.

p1esk7y ago

And it's hard to tell if it "works well" because or despite the way I processed the files.

[1] https://github.com/SeanNaren/deepspeech.pytorch

infinity07y ago· 1 in thread

A step in the right direction for machine learning in science, but they could have done some more research into naming conflicts:

$ apt-cache show quilt

Package: quilt

[..]

Description-en: Tool to work with series of patches

$ zcat /usr/share/doc/quilt/changelog.gz | tail -n3

Version 0.26 (Tue Oct 21 2003) - Change summary not available

akarveOP7y ago

ForFreedom7y ago· 1 in thread

Isn't quilt just bluring the pixels to an extend?

akarveOP7y ago

Quilt isn't doing the inference (the PyTorch model is). But, in any case, no. Super-resolution is more than blurring, it's pixel inference. https://arxiv.org/abs/1609.05158

dkobran7y ago

In case you missed it, here's a link to the full training example that you can run yourself: https://www.paperspace.com/console/jobs/jvqssfqawv5zn/logs

Inference example: https://www.paperspace.com/console/jobs/js4mqzm91fj2lg

Disclosure: I work on Paperspace

cwyers7y ago

It seems to me like the machine learning algorithm here is mostly learning how to add JPEG compression artifacts to images.

rhacker7y ago

Please please please don't kill our favorite plot device. Make sure the process takes exactly 3 days.

j / k navigate · click thread line to collapse