I see from looking at the source here, run.house is using the same approach of cloudpickling the function. That works, but one struggle we are having is it's quite brittle. It's all gravy assuming everyone is operating in perfectly fresh environments that mirror the cluster, but this is rarely the case. Even subtle changes in the execution environment locally can produce segfaults when run on the server. Very hard to debug. The code here looks a lot more mature, so I'm assuming this is more robust than what we have. But would be curious if the developers have run into similar challenges.
In fact we totally agree and are not cloudpickling the function because of the package minor version issues. We sync over the code to the destination environment and the server imports it fresh, which is much more robust. The one piece of code that cloudpickles functions is a trap door for certain weird situations, but frankly we haven't had to use it in months.
Very interesting about the implementation. I admittedly did not read that closely and clearly did not grok the what the actual hot path was there, will check it out more. May have to borrow your approach or perhaps just adopt this wholesale :) Regardless, super cool project, will be following.
From an SRE perspective, this sounds like a nightmare. Controlled releases are really important for reliability. I definitely don't want my devs doing manual rollouts from a notebook.
We've also built a basic permissioning system to control who can actually overwrite the saved version of a resource, so there are no accidents. E.g. if the prod inference blob is saved at "mikes_pizza/nlp/bert/bert_prod", you can set it so only x accounts can overwrite that metadata to point to a new model. Ideally we just inherit existing RBAC groups sometime soon.
Does that make sense? Curious if you had something else in mind as far as the danger.
Thanks, I was misunderstanding the purpose of the feature.
EDIT: looks like this actually uses it under the hood: https://github.com/run-house/runhouse/blob/main/requirements...
This seems like a major limitation and pretty antithetical to the PyTorch approach.
In general making code itself more portable is great (which is the objective of many ML compilers) and will make Runhouse even more valuable, because the ability to take the same code and send it to different places shines when those different places can be different compute types.