Skip to content

Top Best Ask Show New Jobs

Why HNSW is not the answer and disk-based alternatives might be more practical (opens in new tab)

(blog.pgvecto.rs)

138 pointskevlened1y ago64 comments

64 comments

47 comments · 14 top-level

rekoros1y ago· 6 in thread

Yep, I believe it

HNSW has been great for relatively small datasets (a website with 30K pages), but it’s a dangerous stretch for anything bigger

bluecoconut1y ago

I don’t quite understand this - by 30k pages, is this the number of entries in your index? Did you mean 30M?

At the <100k scale I just full compute / inner product directly, and I don’t mess with vector stores or added complexity. No ANN algo needed — they’ll all be slower than actual exact kNN re ranking. (10k7684 =30MB, a scan over it and a sort is on the ~100us or faster). frankly, I’ve even sent at the 5k scale to client and done that client side in JS.

Often, I find i use an ANN algo / index to get me my nearest 10k then I do final re ranking with more expensive algorithms/compute in that reduced space.

The original HNSW paper was testing/benchmarking at the 5M-15M scales. That’s where it shines compared to alternatives.

When pushing to the 1B scale (I have an instance at 200M) the memory consumption does become a frustration (100GB of ram usage). Needing to vertically scale nodes that use the index. But it’s still very fast and good. I wouldn’t call it “dangerous” just “expensive”.

Interestingly though, I found that usearch package worked great and let me split and offload indexes into separate files on disk, greatly lowered ram usage and latency is still quite good on average, but has some spikes (eg. sometimes when doing nearest 10k though can be ~1-3 seconds on the 200M dataset)

Hi, I'm the author of the article. Please check out our vector search extension in postgres, VectorChord [1]. It's based on RabitQ (a new quantization method) + IVF. It achieves 10ms-level latency for top-10 searches on a 100M dataset and 100ms-level latency when using SSD with limited memory.

[1] https://blog.pgvecto.rs/vectorchord-store-400k-vectors-for-1...

You're dealing with much larger datasets than I have, so far - mine is only a few million vectors. I have a hard constraint on resources, so had to get things working quickly in a relatively gutless environment.

aabhay1y ago

What systems would you recommend for larger deployments?

I ended up breaking up/sharding HNSW across multiple tables, but I'm dealing with many distinct datasets, each one just small enough to make HNSW effective in terms of index build/rebuild performance.

The article suggests IVF for larger datasets - this is the direction I'd certainly explore, but I've not personally had to deal with it. HNSW sharding/partitioning might actually work even for a very large - sharded/partitioned - dataset, where each query is a parallelized map/reduce operation.

redskyluan1y ago

why not try milvus? you get multiple index types, SIMD based brute force search, IVF, HNSW and DiskANN and you never bother to scale

PaulHoule1y ago· 5 in thread

I've been doing vector search since 2002 or so and it's amazing how far you can get keeping vectors in RAM and using primitive search algorithms, enough that I'm afraid the market for vector databases is 1% of what VC's think it is. (e.g. full scan was good enough for a search engine of international patents and non-patent literature)

montebicyclelo1y ago

This 100%. Vector DBs have been heavily pushed by vector DB companies and cloud providers, and despite companies often having mere MBs of documents to search through for their use cases, amounts that trivially fit into RAM / can be searched via dot product in milliseconds, managers and engineers less familiar with the space think people are doing it wrong if they don't use a vector db. So.. vector DBs end up getting used when actually a simpler in-memory, non-approximate, solution would be fine, (but less monetizable)

VC are terrible at technical due diligence so the money will keep pouring in.

However, CTOs are also terrible at assessing their technical necessities and vector databases will have customers the same way web scale databases did.

zackangelo1y ago

This is so true. A plain old exhaustive SIMD-optimized similarity search will do just fine in many cases and not have any of the approximation tradeoffs of HNSW.

PhilippGille1y ago

In chromem-go [1] I'm searching through 100,000 vectors in 40ms on a mid-range laptop CPU, even without SIMD. It's quick enough for many use cases.

[1] https://github.com/philippgille/chromem-go

raverbashing1y ago

This is for which vector dimension and number of vectors? (just curious)

binarymax1y ago· 4 in thread

HNSW became the defacto default specifically because you don’t need to precalculate the index and it updates as writes come in.

This is a great article, and all the points are there, but the truth is that most teams running million scale vectors don’t want the operational complexity of quantizing offline in some frequency. They’ll gladly outsource the costs to paying for RAM instead of some IVFPQ calculation.

However if there were a solution that “just worked” to handle PQ for shards in near real time for updates that also had sane filtering, that would be really nice.

At PlanetScale, we have a really nice solution for this in MySQL using SPANN + SPFresh. The way we've implemented it allows for pre and post filtering, full compatibility with where clause filtering, and has the same kind of acid compliance you'd expect from a relational database. You can read about it here:

https://planetscale.com/blog/announcing-planetscale-vectors-...

Hi, I'm the author of the article. In our product, VectorChord, we use a quantization algorithm called RaBitQ, which doesn’t require a separate codebook. Unlike IVFPQ, it avoids the need to maintain and update the corresponding codebook, so the update issue you mentioned is not a problem. Regarding filtering, I’m not sure which specific scenario you’re referring to, but we currently support iterative post-filtering and are technically capable of perfectly supporting pre-filtering as well.

binarymax1y ago

Pre and post filtering are both not great. Some HNSW implementations in products like Vespa and Qdrant have filter-during-search.

This remains an unsolved problem in cluster-based indices.

nostrebored1y ago

Yes, real time updates, filtering, and multi vector support make most of these on device, in memory approaches untenable. If you really are just doing a similarity search against a fixed set of things, often you know the queries ahead of time and can just make a lookup table.

tw041y ago· 4 in thread

> Its reliance on frequent random access patterns makes it incompatible with the sequential access nature of disk storage.

So use SSD? Are people seriously still trying to use spinning disk for database workloads in 2024?

jandrewrogers1y ago

An SSD does not solve the problem of page fault chasing, it just makes it slightly less bad. This is fundamentally a software architecture problem.

This is solved with latency-hiding I/O schedulers, which don’t rely on cache locality for throughput.

Hi, I'm the author of the article. The sequential access pattern of IVF makes prefetching and large block sequential reads much easier, whereas it's almost impossible for HNSW to achieve efficient prefetching.

redskyluan1y ago

Even SSD won't be fast enough for most indexes due to the random access nature. I've seen more than 1M iops on a huge nvme disk when use DiskANN index

JoeAltmaier1y ago

Data farms are all about cost per GB. Spinning media is a fraction of the cost per.

intalentive1y ago· 3 in thread

SIMD-accelerated search over binary vectors is very fast and you can fit a lot in memory. Then rerank over f32 from disk.

I tried a few alternatives and found that SQLite + usearch extension wins for my use cases (< 1 million records), as measured by latency and simplicity. I believe this approach can scale up to hundreds of millions of records.

I've been using DuckDB similarly and loving the simplicity. What sort of latency are you seeing with your setup? I'm hitting 300ms to query against 10M vectors and could see that becoming a bottleneck if going into the hundreds of millions.

Hi, I'm the author of the article. Please check out our vector search extension in postgres, VectorChord [1]. It achieves 10ms-level latency for top-10 searches on a 100M dataset and 60ms-level latency when using SSD with limited memory.

[1] https://blog.pgvecto.rs/vectorchord-store-400k-vectors-for-1...

ashvardanian1y ago

DuckDB also uses USearch and the underlying SimSIMD library for kernels, but both were originally designed for a Billion+ scale.

Depending on how the wrapper is implemented you can get different numbers. With the raw libraries, on larger machines, I'd expect 200'000 requests per second for 1 Billion entries.

tveita1y ago· 3 in thread

> For example, the typical distance computation complexity between two D-dimensional vectors is O(D^2), but compressing floats into bits reduces this by a factor of 1024 (32x32).

This isn't right, cosine distance between two vectors is definitely O(D)…

Of course replacing float multiplications with xor+popcount is a nice speedup in computation, but assuming you're memory bandwidth limited, speedup should be linear.

Hi, I'm the author of the article. I think you're right, and we’ll update the description in the blog. Since binary operations are simpler than floating-point operations, the speedup could indeed be greater than 32x.

DoctorOetker1y ago

Some snippets:

> Leverages concentration of measure phenomena

> Uses anisotropic vector quantization to optimize inner product accuracy by penalizing errors in directions that impact high inner products, achieving superior recall and speed.

I only skimmed the article, but the 2 words I emphasized seems to imply they apply a quadratic metric for distance, i.e. they assume the data coordinates are with respect to non-orthogonal basis vectors, resulting in off-diagonal distance metric terms.

Hi, I'm the author of the article. We actually rely on a new quantization method called RaBitQ instead of ScaNN. You can read more about it at https://dev.to/gaoj0017/quantization-in-the-counterintuitive....

generall1y ago· 2 in thread

IVF, unfortunately, is barely compatible with filtered search. It have to rely on post-filtering and retrieve more and more candidates if the result set is not big enough. If the query is in some form correlated with the filter, this approach quickly degrades to brute-force.

Surprised that the article doesn't mention filtering use-case at all.

Hi, I'm the author of the article. I actually think the opposite of what you mentioned. IVF is more suitable for prefiltering compared to HNSW. For prefiltering in HNSW, there is a certain requirement for the filtering rate—it can't be too low, or the entire graph may become disconnected. For instance, with the commonly used parameter m=16, each node can have at most 16 neighbors. If the filtering rate is below 5%, it can directly result in no neighbors meeting the condition, causing the algorithm to fail entirely. This is why the academic community has proposed alternatives like ACRON[1]. On the other hand, IVF doesn't have this problem at all—you can check whether a candidate meets the condition before calculating distances.

[1] https://arxiv.org/abs/2403.04871

In IVF you can start checking conditions only in the final bucket. There are no guarantees if the bucket has any acceptable value, and there are no procedures to find the bucket which has acceptable values before scanning it.

nostrebored1y ago· 2 in thread

Nit: the drawback of “not working well in disk based systems” isn’t a drawback unless you’re already using disk based systems.

The difference in recall is also significant — what you really get with HNSW is a system made to give good cost:approximation quality. These IVFPQ based systems are ones I’ve seen people rip and replace if the use case is high value.

I really don’t understand the push to make pg do everything. It wasn’t designed for search, and trying to shove these features into the platform feels like some misguided cost optimization that puts all of your data infrastructure on the same critical path.

Hi, I'm the author of the article. In our actual product, VectorChord, we adopted a new quantization algorithm called RaBitQ. The accuracy has not been compromised by the quantization process. We’ve provided recall-QPS comparison curves against HNSW, which you can find in our blog: https://blog.pgvecto.rs/vectorchord-store-400k-vectors-for-1....

Many users choose PostgreSQL because they want to query their data across multiple dimensions, including leveraging time indexes, inverted indexes, geographic indexes, and more, while also being able to reuse their existing operational experiences. From my perspective, vector search in PostgreSQL does not have any disadvantages compared to specialized vector databases so fat.

nostrebored1y ago

But why are you benchmarking against pgvector HNSW, which is known to struggle with recall and performance at large numbers of vectors?

Why is the graph measuring precision and not recall?

The feature dump is entirely a subset of Vespa features.

This is just an odd benchmark. I can tell you in the wild, for revenue attached use cases, I saw _zero_ companies choose pg for embedding retrieval.

cratermoon1y ago· 2 in thread

"Its reliance on frequent random access patterns makes it incompatible with the sequential access nature of disk storage."

Is this true for SSD storage, or does it only apply to spinning metal platters?

Hi, I'm the author of the article. The sequential access pattern of IVF makes prefetching and large block sequential reads much easier, whereas it's almost impossible for HNSW to achieve efficient prefetching.

cratermoon1y ago

Yes, I get that, but does the large block sequential read pattern matter with SSDs, or do the benefits only accrue with spinning disk drives?

ayende1y ago· 1 in thread

The problem with IVF is that you need to find the right centroids. And that doesn't work well if your data grow and mutate over time.

Splitting a centroid is a pretty complex issue.

As are clustering in an area. For example, let's assume that you hold StackOverflow questions & answers. Now you have a massive amount of additional data (> 25% of the existing dataset) that talks about Rust.

You either need to re-calculate the centroids globally, or find a good way to split.

The posting list are easy to use, but if you are unbalanced, it gets really bad.

Hi, I'm the author of the article. Meta have conducted some experiments on dynamic IVF with datasets of several hundred million records. The conclusion was that recall can be maintained through simple repartitioning and rebalancing strategies. You can find more details here: DEDRIFT: Robust Similarity Search under Content Drift https://arxiv.org/pdf/2308.02752. Additionally, with the help of GPUs, KMeans can be computed quickly, making the cost of rebuilding the entire index acceptable in many cases.

ahevenor1y ago· 1 in thread

HNSW can be used for large indexes when combined with an effective caching strategy. Aerospike uses a distributed cache with query steering approach which allows you to have many indexes, including large ones and load them into memory as your application needs them.

https://aerospike.com/docs/vector/learn/caching

Quantization can be applied exactly in the same way in HNSW. I'm using quantization in the implementation of Redis vector sets and it works very well. I have very big issues with the other points as well, but I prefer to reply to these concerns with the implementation I hope to have into Redis in a short time (early 2025).

About insertion / deletion cost. Sure, they are costly, but if instead of grabbing one of the available implementations you start from scratch, extend the paper in sensible ways, and experiment with it, I think it is possible to do much better than one could initially believe.

Highly recommend to prompt the following in LLM if you are trying to grasp Hybrid(Keyword, Filter + Vector) search or HNSW vs IVF flavour topics

Graph-Based (HNSW) vs Cluster-Based (IVF)

Memory Usage: Higher for HNSW due to graph storage requirements vs Lowerfor IVF; stores cluster centroids and subsets of vectors

Accuracy: High recall and precision, close to exact search vs Recall depends on the number of clusters and visited partitions

Build Time: Slower for HNSW, Faster for IVF

Best for: Real-time, high-accuracy tasks requiring dynamic updates vs Large, static datasets with strict memory constraints

Notable Graph-Based (HNSW) Variants - Faiss HNSW or PANNG (Proximity and Navigable Neighbor Graph)

Cluster-Based (IVF) Variants - IVF-PQ for memory-constrained environments, Multi-Index IVF partitions, IVF-GPU

What about DiskANN? https://milvus.io/docs/disk_index.md

j / k navigate · click thread line to collapse