For people who have used either of the projects, what are your opinions and are there any hidden issues that you ran into?
Ideally we'd like to have a platform that makes it easy to schedule runs on the desktops or GCP depending on requirements and available resources. Seems like kubernetes might be the best option for that and it doesn't look like MLflow supports it out of the box yet.
There are many feature specific reasons, but the biggest thing is that reproduction of experiments needs to be synonymous with code review and the identically same version control system you use for other code or projects.
This way reproducibility is a genuine constraint on deployment and deployment of an experiment, whether just training a toy model, incorporating new data, or really launching a live experiment, is conditional on reproducibility and code review of the code, settings, runtime configs, etc., that embodies it totally.
This is much better solved with containers, so that both runtime details and software details are located in the same branch / change set, and a full runtime artifact like a container can be built from them.
Then deployment is just whatever production deployment already is, usually some CI tool that explains where a container (built from a PR of your experiment branch for example) is deployed to run, along with whatever monitoring or probe tracking tools you already use.
You can treat experiments just like any other deployable artifact, and monitor their health or progress exactly the same.
Once you think of it this way, you realize that tools like ML Flow are categorically the wrong tool for the job, almost by definition, and they exist mostly just to foster vendor lock-in or support reliance on some commercial entity, in this case Databricks.
The project was a suite of neural network models that provided face & object detection results in a low-latency web interface where customers can manipulate photos and want automated metadata about people or objects.
In our case, to optimize for performance we need to frequently experiment with compile-time details of the runtime environment (in our case a container) where the application will run in production.
So the axis of our experiments wasnot usually anything to do with neural network layers or data or parameters. It was different compiler optimization flags, different precision approximations and GPU settings that needed to be rolled into a huge number of different underlying runtime environments, and then for each distinct runtime environment the more mundane experiments would be carried out for layer topology, number of neurons, width of CNN filters, etc.
We found that unless youbasically build your own entire “meta” version of ML Flow that wraps around ML Flow, then it falls apart at use cases where custom compile time details of the runtime are themselves aspects of the experiment. Not to mention that the Projects formatting violates good practices, like 12 Factor stuff, for how to inject settings from the environment, which again leads to wasted effort making special case deployment handling for ML Flow jobs.
Whatever deploys and measures your tasks should not also impose any type of special case packaging structure, which is a big reason why MLFlow conceptually fails. Any attempt to make anything at all like a DSL packaging layer for experiments that causes it to diverge from “regular deployment of any old job” is immediately a failed idea. The only thing it’s good for is creating unwitting vendor lock-in once you’re highly dependent on this bespoke, weird packaging template for Projects that makes your ML jobs weirdly (and needlessly) different from other deployment tasks.
Basically if someone shows me a supposed ML experiment tracking system, the first question is, “If I replace the phrase ‘ML experiment’ with ‘generic computing task’, does the tool still handle everything exactly the same?”
If not, it’s a failed idea, because you’re trying to break model training or tuning jobs out of the regular deployment model and you’re not using consistent tooling to manage deployment of experiment runs and all other types of “jobs” that you can “run.”
I do agree that having things tied to a commit might not be ideal if you're running a lot of experiments in a large shared codebase.
I've been tempted to use git to version my model runs but always avoid it because it's usually just extra work.
It should force the concept of “running an experiment” to be just another instance of a deployment. Any part of running an experiment that happens outside of the scope of that, such as with “mlflow run ...” for example, is immediately violating the most basic property of the whole thing (I guess unless “mlflow run ...” is hacked to perform actual production deployments of all types of programs).