Skip to content

Top Best Ask Show New Jobs

Show HN: Embeddinghub: A vector database built for Machine Learning embeddings (opens in new tab)

(github.com)

118 pointscyrusthegreat4y ago33 comments

33 comments

29 comments · 11 top-level

cyrusthegreatOP4y ago· 6 in thread

Hi everyone!

Over the years, I've found myself building hacky solutions to serve and manage my embeddings. I’m excited to share Embeddinghub, an open-source vector database for ML embeddings. It is built with four goals in mind:

Store embeddings durably and with high availability

Allow for approximate nearest neighbor operations

Enable other operations like partitioning, sub-indices, and averaging

Manage versioning, access control, and rollbacks painlessly

It's still in the early stages, and before we committed more dev time to it we wanted to get your feedback. Let us know what you think and what you'd like to see!

Repo: https://github.com/featureform/embeddinghub

Docs: https://docs.featureform.com/

What's an Embedding? The Definitive Guide to Embeddings: https://www.featureform.com/post/the-definitive-guide-to-emb...

ypcx4y ago

In the "Definitive Guide to Embeddings", in the figure "An illustration of One Hot Encoding", the "One Hot Encoding" table doesn't make any sense whatsoever. Am I wrong?

make34y ago

no you're right ahahah wth are these

JPKab4y ago

Holy shit, this looks amazing!

I see you've got examples for NLP use cases in your docs. Can't wait to read them. Embeddings are a constant source of complexity when I'm trying to move certain operations to Lambda, this looks like it would speed the initializations up big time.

cyrusthegreatOP4y ago

We're so glad to hear that! We'd love your feedback as we keep building. Please join our community on Slack: https://join.slack.com/t/featureform-community/shared_invite...

localhost4y ago

Curious about how your solution is different / better than nmslib which I've tried in the past?

cyrusthegreatOP4y ago

We actually use HNSWLIB by NMSLIB on the backend. NMSLIB is solving the approximate nearest neighbor problem, not the storage problem. It’s not a database, it’s an index. We handle everything needed to turn their index into a full fledged database with a data science workflow around it (versioning, monitoring, etc.)

shabbyjoon4y ago· 2 in thread

How is this different from Pinecone, Milvus, and Faiss?

cyrusthegreatOP4y ago

Pinecone is closed source and only available as a SaaS service. Milvus and us have more overlap, we’re focused on the embeddings workflow like versioning and using embedding with other features. Milvus is entirely focused on nearest neighbor operations.

Faiss is solving the approximate nearest neighbor problem, not the storage problem. It’s not a database, it’s an index. We use a lightweight version of Faiss (HNSWLIB) to index embeddings in Embeddinghub.

gk14y ago

I'm from Pinecone so I can chime in...

The biggest difference, as cyrusthegreat pointed out, is that we're a fully managed service. You sign up, spin up a database service with a single API call[0], and go from there. There's no infrastructure to build and keep available, even as you scale to billions of items.

Pinecone also comes with features like metadata filtering[1] for better control over results, and hybrid storage for up to 10x lower compute costs. EmbeddingHub has a few features Pinecone doesn't yet have, like versioning -- though with our architecture it's straightforward to add if someone asks.

Hope that helps! And I'm glad to see more projects in this space, especially from the feature-store side.

[0] https://www.pinecone.io/docs/api/operation/create_index/

[1] https://www.youtube.com/watch?v=r5CsJ_S9_w4

tourist_on_road4y ago· 2 in thread

Great work! Looks like you are using HNSWLIB. From what I understand HNSW graph based approach can be memory intensive compared PQ code based approach. FAISS has support for both HNSW and PQ codes. Any plans on extending your work to support PQ code based index in future?

cyrusthegreatOP4y ago

Yes! We plan to bring Faiss in and utilize a lot of its functionality, our goal for this release was to get an end-to-end working to get feedback on the API. HNSW was a good default with this in mind.

jamesblonde4y ago

How does it compare to the OpenDistro for Elastic KNN plugin - which also uses HNSW (and also includes scalable storage, high availability, backups, and filtering)?

planetsprite4y ago· 2 in thread

What makes this different from something like gensim? They have vector search for doc2vec embeddings.

cyrusthegreatOP4y ago

Gensim is great for generating certain types of embeddings, but not for operationalizing them. It doesn’t do approximate nearest neighbor lookup which is a deal breaker for most models that use embeddings at scale. It also do not manage versioning so you end up having to hack a workflow around it to manage embedding. Finally, it’s not really data infrastructure like this is, so you end up doing hacky things like copying all your embeddings to every docker file. With regards to serving embeddings, gensim is just a library that supports in-memory brute force nearest neighbour look ups.

gensim actually allows you to use both annoy and nmslib with gensim generated vectors as part of the api.

https://radimrehurek.com/gensim/similarities/nmslib.html

https://radimrehurek.com/gensim/similarities/annoy.html

sathergate4y ago· 2 in thread

which search algorithm does it use?

cyrusthegreatOP4y ago

We use HNSW internally via HNSWLIB, it's the same algorithm that Facebook uses to power their embedding search.

sathergate4y ago

thanks! how did you make the decision to use hnsw over faiss and other search algorithms?

barefeg4y ago· 1 in thread

Where can I find documentation on versioning? My first use case would be to versión different embeddings and use it more like a storage backend than to search for KNN. Would it be possible to not create the NN graph and just use it for versioned storage? We currently use opendistro and it nicely allows doing pre and post filtering based on other document fields (other than the embedding). Therefore I think this could never be a full replacement without figuring out how to combine the rest of the document structure

cyrusthegreatOP4y ago

Hey! We're actually polishing up a PR that'll add documentation and finalize the versioning API, it should be merged in this weekend. Would you be up for a quick chat with someone on our team? It would be interesting to get your feedback and see what else we're missing to be a drop-in replacement to opendistro, join our slack if so. We'll dm you :) https://join.slack.com/t/featureform-community/shared_invite...

nelsondev4y ago· 1 in thread

Cool! Nice work! Do you have any performance numbers you could share?

Specifically around nearest neighbor computation latency, a regular get embedding latency, read/write rate achieved on a machine?

cyrusthegreatOP4y ago

Not yet, this is very much an early release to get it in people's hands and to get feedback on the API and the functionality. We've purposely held off optimizing too much until we feel more confident that this is useful and our API approach makes sense for people. That said, Simba who's one of the main devs actually comes from a performance tuning background at Google. Also, it's built on HNSWLIB and RocksDB, and is being used in real world workloads today.

kevin9484y ago· 1 in thread

This is really great! It speaks very much to my use-case (building user embeddings and serving them both to analysts + other ML models).

I was wondering if there was a reasonable way to store raw data next to the embeddings such that: 1. Analysts can run queries to filter down to a space they understand (the raw data). 2. Nearest neighbors can be run on top of their selection on the embedding space.

Our main use case is segmentation, so giving analysts access to the raw feature space is very important.

cyrusthegreatOP4y ago

This is in the works! We'd love you feedback on the API and to learn a bit more about your use-case so we build the right thing, mind joining our slack? https://join.slack.com/t/featureform-community/shared_invite...

deploy4y ago· 1 in thread

This looks awesome - psyched to try! Embeddings are a bitch, nice to see some new tools for managing them :)

cyrusthegreatOP4y ago

Thanks for the kind words! We'd love to get your feedback as we iterate. Please join our slack community: https://join.slack.com/t/featureform-community/shared_invite...

elephantum4y ago

Nice, are there any benchmarks?

Would be interesting to see how it compares to Postgres or LevelDB for read/write of exact values

And how it compares to Faiss/Annoy for KNN

andreawangahead4y ago

keep up with the good work!

j / k navigate · click thread line to collapse