undefined | Better HN

0 pointsakarve7y ago0 comments

Interesting thoughts. Quilt has a ways to grow. You correctly point out that, in some cases, S3 is lighter weight. You'll see future versions of Quilt get lighter, and offer more S3-like "just store this" functionality. In its next minor revision, Quilt simplifies point updates (i.e. it will be possible to update a single training example without materializing the entire package).

That said, there are a few areas where your system glosses over the needs of a data pipeline:

* "immutable by convention" is not a data preservation strategy; the system should enforce immutability

* what about deserialization? it's not enough to store and move bits. there are so many examples of "serdes" headaches. pickling (yes, pickle is a horrible format) in python 2 vs python 3 is one example. not to mention performance. my point is not that scripts can't do serdes, but that serdes information should travel with the data, so it's (mostly) transparent to the consumer.

* multiple writers (e.g. suppose you are generating training data in a distributed manner) requires write atomicity at the bucket level, which S3 doesn't provide

* deduplication of data fragments - I can see how one might do this with a "scripts over S3" strategy, but it's complicated enough that it's far easier to rely on a third-party app that just works in this regard

* fine-grained permissions - what if each data package has a different audience? sure, you can roll this with S3, but is that the best use of developer time?

* change history and access auditing

* querying and filtering - in many cases there is an enormous data corpus which needs to be sliced a different way by each user, e.g. Google Open Images. it is much more robust to have a single query mechanism that understands data layout than to write a fresh script for each slice.

* indexing data so they are searchable, etc.

PS - I am a contributor to Quilt.

0 comments

1 comments · 1 top-level

mlthoughts20187y ago

> “immutable by convention" is not a data preservation strategy; the system should enforce immutability”

I actually disagree with this. In a Python-like “consenting adults” philosophy, I think it’s worse to spend engineering effort to guarantee immutability rather than to trust people not to and just have a reasonable system of backups.

Immutability by convention is 99.99999999% as good as enforced immutability for this particular type of task, and there’s even less risk with a good backup strategy to fall back if there is an accident.

Change history and access auditing are super easy on S3, as is fine-grained access control. With immutability by convention, change history is just git history, and you can customize access groups on a file-by-file basis if you want. You could also instrument logging in the shell scripts themselves if you really want, and I’m not convinced that’s worse than a third party doing it, especially if your logging backend is prone to change, or you wantbyo pipe stuff to Grafana, etc., which are quite common needs.

Querying and filtering are separate post-processing tasks. They should be expressed as source code that mutates a data set after downloading a local copy, and in fact version controlling any data cleaning, post-processing, etc., should be kept completely separate from the management of a data package. They are logically hugely different parts of the process. Slicing data ought to be up to the individual developer or researcher, to choose their tools, to optimize, etc. Version control of that source code is the right way to make that part of the work reproducible, not trying to tie custom treatments into a version of a data package.

Your points about dedup and deserialization are good ones. I can imagine problem cases for a simple script approach, but I can also say even for gigantic in-house image data sets, creating multiple slightly different materialized copies has rarely been an issue.

j / k navigate · click thread line to collapse

0 comments

1 comments · 1 top-level

mlthoughts20187y ago

> “immutable by convention" is not a data preservation strategy; the system should enforce immutability”

j / k navigate · click thread line to collapse