Over the years, I've found myself building hacky solutions to serve and manage my embeddings. I’m excited to share Embeddinghub, an open-source vector database for ML embeddings. It is built with four goals in mind:
Store embeddings durably and with high availability
Allow for approximate nearest neighbor operations
Enable other operations like partitioning, sub-indices, and averaging
Manage versioning, access control, and rollbacks painlessly
It's still in the early stages, and before we committed more dev time to it we wanted to get your feedback. Let us know what you think and what you'd like to see!
Repo: https://github.com/featureform/embeddinghub
Docs: https://docs.featureform.com/
What's an Embedding? The Definitive Guide to Embeddings: https://www.featureform.com/post/the-definitive-guide-to-emb...
I see you've got examples for NLP use cases in your docs. Can't wait to read them. Embeddings are a constant source of complexity when I'm trying to move certain operations to Lambda, this looks like it would speed the initializations up big time.
Faiss is solving the approximate nearest neighbor problem, not the storage problem. It’s not a database, it’s an index. We use a lightweight version of Faiss (HNSWLIB) to index embeddings in Embeddinghub.
The biggest difference, as cyrusthegreat pointed out, is that we're a fully managed service. You sign up, spin up a database service with a single API call[0], and go from there. There's no infrastructure to build and keep available, even as you scale to billions of items.
Pinecone also comes with features like metadata filtering[1] for better control over results, and hybrid storage for up to 10x lower compute costs. EmbeddingHub has a few features Pinecone doesn't yet have, like versioning -- though with our architecture it's straightforward to add if someone asks.
Hope that helps! And I'm glad to see more projects in this space, especially from the feature-store side.
[0] https://www.pinecone.io/docs/api/operation/create_index/
Specifically around nearest neighbor computation latency, a regular get embedding latency, read/write rate achieved on a machine?
I was wondering if there was a reasonable way to store raw data next to the embeddings such that: 1. Analysts can run queries to filter down to a space they understand (the raw data). 2. Nearest neighbors can be run on top of their selection on the embedding space.
Our main use case is segmentation, so giving analysts access to the raw feature space is very important.
Would be interesting to see how it compares to Postgres or LevelDB for read/write of exact values
And how it compares to Faiss/Annoy for KNN