undefined | Better HN

0 pointszwaps3y ago0 comments

Uber needed to expend a huge effort to make PyTorch play with Spark aka Holovord/Petastorm and, considering, it's still a complete and utter mess if you do anything but load a preprocessed parquet custom partitioned/row optimized for batch/cluster size. So, I am not about to blame you for a lack of Spark integration, as everyone pretty much rolls their own solution anyway at this point.

What does bother me here, much as with many other ML orchestration frameworks, is that examples and even docs don't tell me how you play with serious situations in terms of scale.

In fact, what you describe as "a real ML pipeline" is - in my view - not a good example of a real ML pipeline because, for instance, it doesn't tell me how I'd solve the issue of scaling out multi-node training when the data doesn't fit in a neat and standard PyTorch Map Dataset that loads some csv from the web.

I mean, maybe it's because I am dumb, and maybe the majority of people do in fact train single-node models on MNIST data, but I'd appreciate some more information on how you deal with more diverse sources in your pipeline. Will I have to squeeze cloud provider X's data solutions (which I am obligated to use by the client, say) into submission until they fit your examples? Because these days I get the feeling that claims of "easily orchestrating your ML pipeline" often amount to that. I see you started some of these topics in the "integrations" part of the docs. However, these pages do not seem to exist yet (for me). Furthermore, the "roadmap" link goes to a 404.

For me, these are the important topics. I can get a nice simple ML pipeline easily on Azure, AWS or Databricks if I am willing to conform to whatever they are doing already. It seems you are in a position to tackle more challenging problems, so that would be nice to show.

Cool product, and good luck!

0 comments

1 comments · 1 top-level

neutralino13y ago

Thank you – You are right that these are very important topics, and we also had to expend a lot of work at Cruise to scale training beyond single node. We had training jobs running over dozens of GPU nodes for many days. For example, we had a dedicated team to optimize streaming of training data into PyTorch dataloaders. This evidently requires more infrastructure, and also many features around fault tolerance, checkpointing, warm restarts, etc.

We are a very new framework (launched publicly July 1st :-), so there is much work to be done to cover many more example use cases.

What we have found powerful about this plain function approach is that users can submit jobs on remote platforms (e.g. Spark, Google Dataflow, etc.), and use heterogenous resources (e.g. standard nodes to launch third-party jobs, then GPU nodes for training, etc.). So whatever "cloud provider X's data solutions" you have to use, if it has a Python API to submit and wait for jobs, you should be fine.

j / k navigate · click thread line to collapse

0 comments

1 comments · 1 top-level

neutralino13y ago

We are a very new framework (launched publicly July 1st :-), so there is much work to be done to cover many more example use cases.

j / k navigate · click thread line to collapse