How do you handle historical backfill for new features? As in, some feature that can be updated in streaming fashion but whose initial value depends on data from the last X years, e.g., total # of courses completed since sign-up.
Also, who is responsible for keeping the Flink jobs running: the data scientists, or do you have a separate streaming platform team?
> How do you handle historical backfill for new features?
Currently, our feature store doesn't come with inbuilt feature backfilling. In order to do that, some manual work needs to be done. We're working on a brand new version of feature store that hopefully addresses this need.
> Who is responsible for keeping the Flink jobs running: the data scientists, or do you have a separate streaming platform team?
We have a separate data infra team who is responsible for managing the YARN cluster for us.
Disintermediation of data pipeline creation is definitely nothing new at this point and the technologies aren't that novel at this point either. I'd be surprised that this is on the front page, but it takes time for the lessons in this article to be learnt by a large enough amount of people that it becomes humdrum.
Above all, it reminds me of a consultant friend telling me he had two clients who built feature stores - one with an open-ended goal of enabling people and one because they had some specific things they wanted to achieve. The outcomes they got were as dissimilar as their motives!
> This thing reads like it was written a few years ago.
Yah the technology here is nothing novel, Hive, Kafka, Flink, Redis are around for years. What I find missing in the internet is that people who have been doing this for years are not writing about this. Uber has done a relatively good job on publishing how they build Michelangelo, but still, not enough details for outsiders to replicate.
> it takes time for the lessons in this article to be learnt by a large enough amount of people that it becomes humdrum.
Maybe :)
> two clients who built feature stores - one with an open-ended goal of enabling people and one because they had some specific things they wanted to achieve.
Could you add more color to this part, what are their goals and what do they end up achieving? I didn't fully get it.
In my last job I implemented a feature store from scratch with ca. 500 hand crafted and ca 2.500 with code generator automatically generated features. It didn't only serve the current value of the features, but the data scientists could populate an 'init' table manually with (customer_id, reference_date, target_value) tuples, and the pipeline re-calculated the historic feature values for the given customer and reference_date. So if the data scientists came up with a new fature definition, after implementation (5 mins - 2 hours per feature), he - and all other data scientists - immediatelly got access to the features's history. We had so many features, that I had to implement an automatic feature-prunning, otherwise the users got lost. We could train, test, validate and deploy models within 24 hours (model fitting run over-night). When I left the company, we had ca 40 models in production, managed by 1 person (by me) in part-time (3-4 hours a week).
This was in an off-line business, so we didn't had to deal with latency by feature serving and didn't had to be able to change a feature's value during the day, so everything could run batch based over night.
Why I didn't write about it? Because it was implemented in PL/SQL running on Oracle ExaData and in SAS. No one cares about feature stores implemented with tech like that. People care about models trained in python, ported to Scala by Java devs, running in docker on k8s, features coming from HiveQL, sqoop, oozie or Spark and stored on cassandra, MySQL or Elasticsearch. But do they have a feature store with built in time-travel functionality?
Is it another name for an OLAP or BI cube? Ie. a huge precomputed group by query with rollups.
The only new thing I see is that it combines both historical and recent data. Kinda like an olap cube with lambda architecture.
I'm not sure how this relates to OLAP cubes since I am not aware of that term.