HNSW has been great for relatively small datasets (a website with 30K pages), but it’s a dangerous stretch for anything bigger
At the <100k scale I just full compute / inner product directly, and I don’t mess with vector stores or added complexity. No ANN algo needed — they’ll all be slower than actual exact kNN re ranking. (10k7684 =30MB, a scan over it and a sort is on the ~100us or faster). frankly, I’ve even sent at the 5k scale to client and done that client side in JS.
Often, I find i use an ANN algo / index to get me my nearest 10k then I do final re ranking with more expensive algorithms/compute in that reduced space.
The original HNSW paper was testing/benchmarking at the 5M-15M scales. That’s where it shines compared to alternatives.
When pushing to the 1B scale (I have an instance at 200M) the memory consumption does become a frustration (100GB of ram usage). Needing to vertically scale nodes that use the index. But it’s still very fast and good. I wouldn’t call it “dangerous” just “expensive”.
Interestingly though, I found that usearch package worked great and let me split and offload indexes into separate files on disk, greatly lowered ram usage and latency is still quite good on average, but has some spikes (eg. sometimes when doing nearest 10k though can be ~1-3 seconds on the 200M dataset)
[1] https://blog.pgvecto.rs/vectorchord-store-400k-vectors-for-1...
The article suggests IVF for larger datasets - this is the direction I'd certainly explore, but I've not personally had to deal with it. HNSW sharding/partitioning might actually work even for a very large - sharded/partitioned - dataset, where each query is a parallelized map/reduce operation.
However, CTOs are also terrible at assessing their technical necessities and vector databases will have customers the same way web scale databases did.
This is a great article, and all the points are there, but the truth is that most teams running million scale vectors don’t want the operational complexity of quantizing offline in some frequency. They’ll gladly outsource the costs to paying for RAM instead of some IVFPQ calculation.
However if there were a solution that “just worked” to handle PQ for shards in near real time for updates that also had sane filtering, that would be really nice.
https://planetscale.com/blog/announcing-planetscale-vectors-...
This remains an unsolved problem in cluster-based indices.
So use SSD? Are people seriously still trying to use spinning disk for database workloads in 2024?
This is solved with latency-hiding I/O schedulers, which don’t rely on cache locality for throughput.
I tried a few alternatives and found that SQLite + usearch extension wins for my use cases (< 1 million records), as measured by latency and simplicity. I believe this approach can scale up to hundreds of millions of records.
[1] https://blog.pgvecto.rs/vectorchord-store-400k-vectors-for-1...
Depending on how the wrapper is implemented you can get different numbers. With the raw libraries, on larger machines, I'd expect 200'000 requests per second for 1 Billion entries.
This isn't right, cosine distance between two vectors is definitely O(D)…
Of course replacing float multiplications with xor+popcount is a nice speedup in computation, but assuming you're memory bandwidth limited, speedup should be linear.
> Leverages concentration of measure phenomena
> Uses anisotropic vector quantization to optimize inner product accuracy by penalizing errors in directions that impact high inner products, achieving superior recall and speed.
I only skimmed the article, but the 2 words I emphasized seems to imply they apply a quadratic metric for distance, i.e. they assume the data coordinates are with respect to non-orthogonal basis vectors, resulting in off-diagonal distance metric terms.
Surprised that the article doesn't mention filtering use-case at all.
The difference in recall is also significant — what you really get with HNSW is a system made to give good cost:approximation quality. These IVFPQ based systems are ones I’ve seen people rip and replace if the use case is high value.
I really don’t understand the push to make pg do everything. It wasn’t designed for search, and trying to shove these features into the platform feels like some misguided cost optimization that puts all of your data infrastructure on the same critical path.
Many users choose PostgreSQL because they want to query their data across multiple dimensions, including leveraging time indexes, inverted indexes, geographic indexes, and more, while also being able to reuse their existing operational experiences. From my perspective, vector search in PostgreSQL does not have any disadvantages compared to specialized vector databases so fat.
Why is the graph measuring precision and not recall?
The feature dump is entirely a subset of Vespa features.
This is just an odd benchmark. I can tell you in the wild, for revenue attached use cases, I saw _zero_ companies choose pg for embedding retrieval.
Is this true for SSD storage, or does it only apply to spinning metal platters?
Splitting a centroid is a pretty complex issue.
As are clustering in an area. For example, let's assume that you hold StackOverflow questions & answers. Now you have a massive amount of additional data (> 25% of the existing dataset) that talks about Rust.
You either need to re-calculate the centroids globally, or find a good way to split.
The posting list are easy to use, but if you are unbalanced, it gets really bad.
About insertion / deletion cost. Sure, they are costly, but if instead of grabbing one of the available implementations you start from scratch, extend the paper in sensible ways, and experiment with it, I think it is possible to do much better than one could initially believe.
Graph-Based (HNSW) vs Cluster-Based (IVF)
Memory Usage: Higher for HNSW due to graph storage requirements vs Lowerfor IVF; stores cluster centroids and subsets of vectors
Accuracy: High recall and precision, close to exact search vs Recall depends on the number of clusters and visited partitions
Build Time: Slower for HNSW, Faster for IVF
Best for: Real-time, high-accuracy tasks requiring dynamic updates vs Large, static datasets with strict memory constraints
Notable Graph-Based (HNSW) Variants - Faiss HNSW or PANNG (Proximity and Navigable Neighbor Graph)
Cluster-Based (IVF) Variants - IVF-PQ for memory-constrained environments, Multi-Index IVF partitions, IVF-GPU