Is there some people having feedback on this?
AFAICT: Most of those are basically UIs, data management, & integrations around the same set of vector index libs like FAISS + same set of models like HF, and even the same set of inference server libs (triton, aws/gcp, fastapi, ...). So you'd be evaluating different commercializations of the same core OSS tech. There are useful evaluations to do at that level, but more like licensing, business model, UI, model management, etc.
Another commenter below recognized that regular DBs (ES, Postgres, ..) are starting to add vector indexes. As someone doing a lot of architecting for log/event/graph correlation/investigation systems, I've been tracking whether managed neural search db is a feature vs new category, and how big. Ex: those corporate sides definitely feels like an Algolia competitor, but for the 99% case we normally do, maybe just an OSS feature/lib of whatever DB/framework you're already using? Not obvious!
A "vector" in vector search solutions is a dense vector generated by a transformer model. A sentence like 'a fat cat sat on a mat and ate a fat rat' put through a model becomes an array of floating-point numbers like [0.183, -0.774, ...], with hundreds of values (often 768). The point is that this vector is positioned in a 768-dimensional space in close proximity to semantically similar sentences. So then searching by semantic meaning is a simple (well...) exercise in measuring the distance between your query and the surrounding vectors.
We have a whole course on the topic, coincidentally also on the front page of HN today: https://www.pinecone.io/learn/nlp
Just a clarification that vectors are much broader than text vectors generated by transformer models. A more common application has been recommender models built on matrix factorization and other similar approaches. Word 2 Vec was another popular way to generate vectors dating back to 2013. Vectors are a very general approach with many benefits regardless of how they were generated. That is what makes these vector search libraries so exciting.
I wrote an article about the power of matrix factorization vectors for music recommendations back in 2016: https://tech.iheart.com/mapping-the-world-of-music-using-mac...
We also discussed how we used convolutional neural networks (deep networks, but not transformers) to build vectors on the acoustic content of music: https://tech.iheart.com/mapping-the-world-of-music-using-mac...
At what point does it make sense to switch to using vector search solutions (compared to keyword searches)? Obviously Google et al need it, but for regular apps, maybe academic repositories and so on, is there a threshold in which we can start the discussion to switch?
Or phrased another way, when do we know we should be looking at vector search solutions to enhance the search in our application?
This is for finding nearby N-dimensional points, where N is typically greater than 50 and the point is the output of an ML or NN process.
Google announced their new vector search service today, and that prompted someone on HN to post a related library that they knew about. It’s certainly possible that that person works at FB, but that doesn’t mean it’s some kind of orchestrated PR maneuver.
The official documentation on Faiss is rather light, so we made “The Missing Manual to Faiss”: https://www.pinecone.io/learn/faiss-tutorial/
Previous discussion: https://news.ycombinator.com/item?id=29291047
No, a sane brute-force search (via BLAS) that size should be a ~200ms / query. I.e. SIXTY TIMES faster!
If they (Faiss?) got this wrong, what else did they get wrong?
I understand researchers want to showcase their "best and fastest" approach, so they fudge the baselines. Approximate search can be genuinely useful – orders of magnitude faster than (even non-fudged) brute force, and using less RAM too.
But as a user, tech stack complexity is also a consideration. Because the trade-off is not only "speed vs accuracy". Brute force is a trivial algorithm, easy to implement and maintain with no corner cases. It has completely predictable data access patterns (linear, sequential, fixed response time, 100% accuracy). It supports operations (update, range, dynamic k-NN) that complex indexes struggle with.
So if your dataset is tiny – and anything under 1 million counts as tiny – do you really need to maintain an external dependency of fancy data structures and approximate algorithms?
https://gist.github.com/wickedfoo/165b69075cfcceba872aec1c46...
[0]: http://ann-benchmarks.com [1]: https://github.com/spotify/annoy [2]: https://github.com/google-research/google-research/tree/mast...
*Facebook is a bad company that puts profit over people and they should be boycotted. People that work at Facebook are complicit to its policy and so also their publications should be boycotted.