For example, If you're building a recommendation model at Spotify, you'll transform a stream of user listens into features like: user's top genre in last 30 days.
Featureform orchestrates the transformations on your infrastructure, manages the metadata like versioning, and allows you to serve them for training and inference.
I'm Simba Khadder, Co-Founder & CEO of Featureform. I'm super stoked to be sharing our open-source feature store with you all. At my last company, we were building models that served to <100M MAU. Most of our time was spent feature engineering and using off-the-shelf model architectures. I remember having google docs that got shared around with useful SQL snippet, and digging in my file system to find untitled_128.ipynb which had a super useful transformation. We built Featureform so no one would ever have to deal with that again.
Featureform is a virtual feature store. It enables data scientists to define, manage, and serve their ML model's features. Featureform sits atop your existing infrastructure and orchestrates it to work like a traditional feature store.
By using Featureform, a data science team can solve the organizational problems:
- Enhance Collaboration Featureform ensures that transformations, features, labels, and training sets are defined in a standardized form, so they can easily be shared, re-used, and understood across the team.
- Organize Experimentation The days of untitled_128.ipynb are over. Transformations, features, and training sets can be pushed from notebooks to a centralized feature repository with metadata like name, variant, lineage, and owner.
- Facilitate Deployment - Once a feature is ready to be deployed, Featureform will orchestrate your data infrastructure to make it ready in production. Using the Featureform API, you won't have to worry about the idiosyncrasies of your heterogeneous infrastructure (beyond their transformation language).
- Increase Reliability Featureform enforces that all features, labels, and training sets are immutable. This allows them to safely be re-used among data scientists without worrying about logic changing. Furthermore, Featureform's orchestrator will handle retry logic and attempt to resolve other common distributed system problems automatically. Finally, Featureform will monitor and notify you of infrastructure problems and data drift.
- Preserve Compliance With built-in role-based access control, audit logs, and dynamic serving rules, your compliance logic can be enforced directly by Featureform.
You can check out our repo: https://github.com/featureform/featureform
Our docs: https://docs.featureform.com
Our quickstart guide: https://docs.featureform.com/quickstart-local
Read more about feature stores: https://featureform.com/post/feature-stores-explained-the-th...
Feast is a literal feature store. it exclusively stores features, it does not manage the transformations used to compute them. The pros and cons of Feast are more obvious when examining the process to change a feature. It happens in three steps:
1. Write and run your new data transformation in your existing transformation pipeline. Note that this happens outside of Feast.
2. A new feature table must be created in Feast, since the old one cannot be directly overwritten. Once the new feature is created the transformation pipeline should be re-run and write all the features to the new table.
3. All the models that use this new feature should be updated to point at the new feature.
Feast also has other problems, for example, it can’t copy your features from the offline to the online store, you have to download the features and upload them to the online store yourself using their CLI tool. You also have to manage retries and failure yourself.
Featureform treats the transformation lineage as part of the feature and orchestrates your infrastructure to create and change your features.
Feast currently supports a few kinds of transformations: on demand transformations and streaming transformations. We’re adding batch transformations soon though! (and have an RFC out already).
I like to think that Feast’s goal is to be more of a pluggable framework for platform teams to be able to build towards a platform like Tecton’s (which fully orchestrates both batch and streaming transformations while abstracting complexity away from data scientists). We’re being mindful of trying to keep things as simple as possible though because our users have told us repeatedly they don’t want to forced to manage a complex system.
With moving features from the offline to online store (we call this materialization), users today can (and often quite successfully) use Feast’s CLI or SDK to trigger in memory materialization to the online store. We do have ongoing work to enable out of process materialization (e.g. using Ray, Spark, Bytewax, etc) that should be ready soon. Being able to manually trigger materialization via Airflow though has proven to be very useful for users in integrating with their existing workflows (such as triggering this when they detect changes to their raw data sources).
Simba’s correct though in calling out that a lot of the orchestration in Feast is left to the user. It hasn’t fully emerged as a key painpoint users want addressed, but if that changes… well we’re working to make our community happy :)
Cheers, Danny
In the past I've always opted for a feature store as a library that operates over an existing database/data warehouse/data lake in the offline case, and computing features on the fly in the online case. The internal feature cache for scaling an online service is nicely implemented here using Redis. Bravo, that's probably how I would do it too.
My one bit of feedback is the API. The code just doesn't look nice, out the gate there's a bunch of objects and methods I don't immediately understand the need for. I'm sure they're useful, but for starting out I'd expect a lot more from that interface. I'd suggest something higher level that looks pretty and is easy to understand. That would be my one hesitation.
Featureform's library allows you to define your transformations, feature, and training sets. It will interface with Spark, Redis, etc. on your behalf to achieve your desired state. It'll also keep track of all the metadata for you and easily make it share-able and re-usable.
All three of these work across different infrastructure by design. We already have users who use Google Cloud services like BigQuery for experimentation and DynamoDB on AWS for serving in productions.
The data mesh analogy is interesting. In a way, we're an applied form of data mesh, as opposed to a theoretical argument for it. By separating the abstraction/workflow layer from the data infra layer, you can put your data where it makes sense and access it all through featureform's feature store abstraction.
you can read more about the virtual feature store architecture here: https://www.featureform.com/post/feature-stores-explained-th...
Also check out these articles if you are interested about learning more about feature stores in general: - https://www.featureform.com/post/feature-stores-explained-th... - https://redis.com/blog/building-feature-stores-with-redis-in... - https://feast.dev/blog/feast-benchmarks/