Faiss: Facebook's open source vector search library (opens in new tab)

(github.com)

175 pointsai_ja_nai4y ago38 comments

38 comments

32 comments · 12 top-level

Kydlaw4y ago· 5 in thread

There is a lot done vector search technology right now. I was less fortunate when looking at ways to store the vectors in databases. I already looked at Pinecone or Weaviate but they are all paid products.

Is there some people having feedback on this?

lmeyerov4y ago

A lot of VCs & founders trying to commercialize other people's SW are wondering this too :)

AFAICT: Most of those are basically UIs, data management, & integrations around the same set of vector index libs like FAISS + same set of models like HF, and even the same set of inference server libs (triton, aws/gcp, fastapi, ...). So you'd be evaluating different commercializations of the same core OSS tech. There are useful evaluations to do at that level, but more like licensing, business model, UI, model management, etc.

Another commenter below recognized that regular DBs (ES, Postgres, ..) are starting to add vector indexes. As someone doing a lot of architecting for log/event/graph correlation/investigation systems, I've been tracking whether managed neural search db is a feature vs new category, and how big. Ex: those corporate sides definitely feels like an Algolia competitor, but for the 99% case we normally do, maybe just an OSS feature/lib of whatever DB/framework you're already using? Not obvious!

kateshao05104y ago

Another option is Milvus:https://github.com/milvus-io. It's an open-source vector database.

forgotmyoldacc4y ago

Milvus is a layer on top of FAISS.

gk14y ago

I don’t know when you last looked but as of a few months ago Pinecone has a free tier that fits 1M items with <100ms latency.

homarp4y ago

Have you looked at https://github.com/jina-ai/executor-hnsw-postgres ?

jean_valje4n4y ago· 5 in thread

Can someone tell me if these vector search things have anything in common with postgres's text search vectors, which have been implemented in postgres for quite a while now?

gk14y ago

Not much in common. A "vector" in Postgres is a tokenized and normalized array of words. So 'a fat cat sat on a mat and ate a fat rat' becomes 'a' 'and' 'ate' 'cat' 'fat' 'mat' 'on' 'rat' 'sat'. This just makes keyword searches a bit easier. (Source: https://www.postgresql.org/docs/9.4/datatype-textsearch.html)

A "vector" in vector search solutions is a dense vector generated by a transformer model. A sentence like 'a fat cat sat on a mat and ate a fat rat' put through a model becomes an array of floating-point numbers like [0.183, -0.774, ...], with hundreds of values (often 768). The point is that this vector is positioned in a 768-dimensional space in close proximity to semantically similar sentences. So then searching by semantic meaning is a simple (well...) exercise in measuring the distance between your query and the surrounding vectors.

We have a whole course on the topic, coincidentally also on the front page of HN today: https://www.pinecone.io/learn/nlp

rm9994y ago

>A "vector" in vector search solutions is a dense vector generated by a transformer model.

Just a clarification that vectors are much broader than text vectors generated by transformer models. A more common application has been recommender models built on matrix factorization and other similar approaches. Word 2 Vec was another popular way to generate vectors dating back to 2013. Vectors are a very general approach with many benefits regardless of how they were generated. That is what makes these vector search libraries so exciting.

I wrote an article about the power of matrix factorization vectors for music recommendations back in 2016: https://tech.iheart.com/mapping-the-world-of-music-using-mac...

We also discussed how we used convolutional neural networks (deep networks, but not transformers) to build vectors on the acoustic content of music: https://tech.iheart.com/mapping-the-world-of-music-using-mac...

forgingahead4y ago

Thanks for the nice and clear summary!

At what point does it make sense to switch to using vector search solutions (compared to keyword searches)? Obviously Google et al need it, but for regular apps, maybe academic repositories and so on, is there a threshold in which we can start the discussion to switch?

Or phrased another way, when do we know we should be looking at vector search solutions to enhance the search in our application?

1 more reply

unbanned4y ago

Why 768? What are these dimensions

1 more reply

Scaevolus4y ago

No, those are standard full text indexing datastruftured of vectors representing word positions.

This is for finding nearby N-dimensional points, where N is typically greater than 50 and the point is the output of an ML or NN process.

animanoir4y ago· 5 in thread

Suddenly vector search is relevant. Was this orchestrated?

gk14y ago

This happens a lot. Usually starts with one popular post about a topic. Then someone explores deeper and finds another interesting article on the same topic, and submits it.

sangnoir4y ago

It's an old, mutually beneficial PR arrangement known as the Newton-Leibniz calculus

unhammer4y ago

It has happened that I have read about one article and find something related (following links or looking things up) that I find even more interesting and then submit that.

tonetheman4y ago

I feel the same. It is not by chance I would think.

ladon864y ago

Not everything is a conspiracy.

Google announced their new vector search service today, and that prompted someone on HN to post a related library that they knew about. It’s certainly possible that that person works at FB, but that doesn’t mean it’s some kind of orchestrated PR maneuver.

gk14y ago· 2 in thread

Fun seeing vector search all over the front page today. :)

The official documentation on Faiss is rather light, so we made “The Missing Manual to Faiss”: https://www.pinecone.io/learn/faiss-tutorial/

Previous discussion: https://news.ycombinator.com/item?id=29291047

Radim4y ago

I've built a few vector search engines too, so this was an immediate red flag: "Brute force takes 12 seconds / query on 1 million vectors of 768 dim".

No, a sane brute-force search (via BLAS) that size should be a ~200ms / query. I.e. SIXTY TIMES faster!

If they (Faiss?) got this wrong, what else did they get wrong?

I understand researchers want to showcase their "best and fastest" approach, so they fudge the baselines. Approximate search can be genuinely useful – orders of magnitude faster than (even non-fudged) brute force, and using less RAM too.

But as a user, tech stack complexity is also a consideration. Because the trade-off is not only "speed vs accuracy". Brute force is a trivial algorithm, easy to implement and maintain with no corner cases. It has completely predictable data access patterns (linear, sequential, fixed response time, 100% accuracy). It supports operations (update, range, dynamic k-NN) that complex indexes struggle with.

So if your dataset is tiny – and anything under 1 million counts as tiny – do you really need to maintain an external dependency of fancy data structures and approximate algorithms?

jhj4y ago

1 vector against 1 million vectors in 768 dims at k = 10 takes 259 ms for me using Faiss CPU IndexFlatL2 with Intel MKL:

https://gist.github.com/wickedfoo/165b69075cfcceba872aec1c46...

1 more reply

threatofrain4y ago· 1 in thread

Another HN post from today on vector search.

https://news.ycombinator.com/item?id=29554986

prophesi4y ago

And I'm wondering if these are all making the front page because of this Show HN from earlier.

https://news.ycombinator.com/item?id=29551947

s-xyz4y ago· 1 in thread

Funny to see this thread, as I recall Google just recently posted a similar vector search library.

amelius4y ago

But nobody is posting benchmarks.

sydthrowaway4y ago· 1 in thread

What is vector search

gk14y ago

https://www.pinecone.io/learn/vector-search-basics/

kristjansson4y ago

See also: Ann-benchmarks[0], Annoy[1], ScaNN[2]

[0]: http://ann-benchmarks.com [1]: https://github.com/spotify/annoy [2]: https://github.com/google-research/google-research/tree/mast...

dmitriid4y ago

How does on update the index? All I see at a quick glance is how to create/re-create/train/re-train the dataset from scratch

monkeybutton4y ago

Are Facebook's look-alike audiences implemented using faiss?

amelius4y ago

Show me the benchmark results!

timdaub4y ago

fuck facebook*

*Facebook is a bad company that puts profit over people and they should be boycotted. People that work at Facebook are complicit to its policy and so also their publications should be boycotted.

j / k navigate · click thread line to collapse

38 comments

32 comments · 12 top-level

Kydlaw4y ago· 5 in thread

Is there some people having feedback on this?

lmeyerov4y ago

A lot of VCs & founders trying to commercialize other people's SW are wondering this too :)

kateshao05104y ago

Another option is Milvus:https://github.com/milvus-io. It's an open-source vector database.

forgotmyoldacc4y ago

Milvus is a layer on top of FAISS.

gk14y ago

I don’t know when you last looked but as of a few months ago Pinecone has a free tier that fits 1M items with <100ms latency.

homarp4y ago

Have you looked at https://github.com/jina-ai/executor-hnsw-postgres ?

jean_valje4n4y ago· 5 in thread

Can someone tell me if these vector search things have anything in common with postgres's text search vectors, which have been implemented in postgres for quite a while now?

gk14y ago

We have a whole course on the topic, coincidentally also on the front page of HN today: https://www.pinecone.io/learn/nlp

rm9994y ago

>A "vector" in vector search solutions is a dense vector generated by a transformer model.

I wrote an article about the power of matrix factorization vectors for music recommendations back in 2016: https://tech.iheart.com/mapping-the-world-of-music-using-mac...

forgingahead4y ago

Thanks for the nice and clear summary!

Or phrased another way, when do we know we should be looking at vector search solutions to enhance the search in our application?

1 more reply

unbanned4y ago

Why 768? What are these dimensions

1 more reply

Scaevolus4y ago

No, those are standard full text indexing datastruftured of vectors representing word positions.

This is for finding nearby N-dimensional points, where N is typically greater than 50 and the point is the output of an ML or NN process.

animanoir4y ago· 5 in thread

Suddenly vector search is relevant. Was this orchestrated?

gk14y ago

This happens a lot. Usually starts with one popular post about a topic. Then someone explores deeper and finds another interesting article on the same topic, and submits it.

sangnoir4y ago

It's an old, mutually beneficial PR arrangement known as the Newton-Leibniz calculus

unhammer4y ago

It has happened that I have read about one article and find something related (following links or looking things up) that I find even more interesting and then submit that.

tonetheman4y ago

I feel the same. It is not by chance I would think.

ladon864y ago

Not everything is a conspiracy.

gk14y ago· 2 in thread

Fun seeing vector search all over the front page today. :)

The official documentation on Faiss is rather light, so we made “The Missing Manual to Faiss”: https://www.pinecone.io/learn/faiss-tutorial/

Previous discussion: https://news.ycombinator.com/item?id=29291047

Radim4y ago

I've built a few vector search engines too, so this was an immediate red flag: "Brute force takes 12 seconds / query on 1 million vectors of 768 dim".

No, a sane brute-force search (via BLAS) that size should be a ~200ms / query. I.e. SIXTY TIMES faster!

If they (Faiss?) got this wrong, what else did they get wrong?

So if your dataset is tiny – and anything under 1 million counts as tiny – do you really need to maintain an external dependency of fancy data structures and approximate algorithms?

jhj4y ago

1 vector against 1 million vectors in 768 dims at k = 10 takes 259 ms for me using Faiss CPU IndexFlatL2 with Intel MKL:

https://gist.github.com/wickedfoo/165b69075cfcceba872aec1c46...

1 more reply

threatofrain4y ago· 1 in thread

Another HN post from today on vector search.

https://news.ycombinator.com/item?id=29554986

prophesi4y ago

And I'm wondering if these are all making the front page because of this Show HN from earlier.

https://news.ycombinator.com/item?id=29551947

s-xyz4y ago· 1 in thread

Funny to see this thread, as I recall Google just recently posted a similar vector search library.

amelius4y ago

But nobody is posting benchmarks.

sydthrowaway4y ago· 1 in thread

What is vector search

gk14y ago

https://www.pinecone.io/learn/vector-search-basics/

kristjansson4y ago

See also: Ann-benchmarks[0], Annoy[1], ScaNN[2]

[0]: http://ann-benchmarks.com [1]: https://github.com/spotify/annoy [2]: https://github.com/google-research/google-research/tree/mast...

dmitriid4y ago

How does on update the index? All I see at a quick glance is how to create/re-create/train/re-train the dataset from scratch

monkeybutton4y ago

Are Facebook's look-alike audiences implemented using faiss?

amelius4y ago

Show me the benchmark results!

timdaub4y ago

fuck facebook*

*Facebook is a bad company that puts profit over people and they should be boycotted. People that work at Facebook are complicit to its policy and so also their publications should be boycotted.

j / k navigate · click thread line to collapse