Ask HN: How do you track configurations across a machine learning pipeline?
1. Some initial ETL to get most of the building-of-the-analytic-dataset out of the way, say in SQL.
2. The rest of the ETL, which mainly consists of creating the feature matrices to the exact specifications I need. This includes, but is not limited to: imputation of missing values, applying the rule of n for categorical predictors, raw-text-to-embedding cleaning and transformations, standardizing/centering, etc.
3. A retraining script which consists of: loading the "assets" saved from step 2, performing hyperparameter optimization, cross-validating, and saving the best model, according to some performance metric.
4. A CI/CD script to deploy the latest model on K8S.
There are certain configurations that are useful to track throughout this pipeline, especially since I'm trying to minimize CI/CD friction. For example, if I choose to impute value v for column X, not only do I need to know that in my retraining process, but I also need to know that in my scoring REST API in order to properly transform input. My goal is to be able to quickly define new ways of transforming input and test them in my core retraining framework (e.g. using simple count embeddings versus TF-IDF for text mining).
Here is the main way I've thought about doing this - I'm curious to see what you all prefer personally:
A. Have a master config JSON file that is loaded in each step of the pipeline and changed depending on what command-line arguments were passed (e.g. `--embedding_type count` versus `--embedding_type tfidf`.
Apologies if this problem isn't specific to machine learning, but just generally to SWE.
Edit: I suppose another way of doing this, without explicitly have configuration files, would be to go all-in on scikit-learn (or a similar tool) with something like its `Pipeline` object.