This is an indication to me that something has gone very wrong in your code base.
I'm not sure on what planet all of these people here live that they have success with Linux swap. It's been broken for me forever and the first thing I do is disable it everywhere.
echo y >/sys/kernel/mm/lru_gen/enabledTBH this was sloppy on my part. I tested multiple runs of the index build and early on kswapd was super busy. I assumed Linux was just caching recently read parts of the source dataset, but it's also possible it was something external to the index build since it's my daily driver machine. After I turned off swap I had no issues and didn't look into it harder.
Edit: see for instance https://insights.oetiker.ch/linux/fadvise.html
I don't know how Linux does this in particular, but intuitively swapping can make sense if part of your allocated RAM isn't being accessed often and the disk is. The kernel isn't going to know for sure of course, and seems in my case it guessed wrong.
I'll let my personal laptop swap, though. Especially if my wife is also logged in and has tons of idle stuff open.
I suspect they just need to pass in -xmx options to the jvm to avoid this.
If you're referring to full GC you can configure how often that happens and by default it doesn't just wait until memory is nearly full.
It's insane to me that someone, this early in the gold rush, would be mining in someone else's mine, so to speak
As a first step they are using PQ anyways. It seems natural to just assume all English docs have the same centroid and search that subspace with hnswlib.
Are there some benchmarks available that compare it with the openai model?
He isn’t according the Wikipedia, my friend who works there, and their company website. https://www.datastax.com/our-people
That’s kind of weird
Wikipedia lists them as a founder. Perhaps their author bio is outdated, or Wikipedia is. Not sure about your friend.
> SANTA CLARA, Calif. – September 28, 2020 – DataStax today announced that DataStax Co-Founder and CTO Jonathan Ellis will deliver a keynote address at ApacheCon @Home 2020
https://www.datastax.com/press-release/datastax-co-founder-a....
As an aside, I'm an ApacheCon presenter but there was no press release about the hot excitement of my involvement. Maybe next time :)
GH Project: https://github.com/jbellis/jvector
The source to build and serve the index are at https://github.com/jbellis/coherepedia-jvector
How is a log N search over S segments O(N)?
To be more correct it’s O(N/C log C) where C is the capacity of a segment. In this case you can ignore 1/C and log C as constant. So now sure, you actually just have O(N). But this is not super useful as it says that a segmented hnsw approach and brute force approach are the same - when this is really not the case in practice.
Also O(N log N) > O(N) so I’m not sure why we would ever do anything with segmentation according to that analysis if it were correct.
What's your alternative when you can't build an index larger than C?
I have a few projects I'd like to work on. For typical web projects, I have a "go to" stack and I'd like to add something sensible for vector based search to that.
[article author, I work on JVector and Astra]
A tool that (hopefully) surfaces interesting HN discussion threads; I wanted an excuse to investigate (hybrid) full text and vector search at a substantial scale beyond toy datasets.
Sadly (well not really) I changed jobs soon after building the first version. Life caught up and I never got around to adding more features and polishing up the frontend (eg. the broken back button
Ideas for new features are very welcome :)
I haven't seen any news that indicates this has changed, but by all means give it a try!
Can someone explain why?
It's interesting to note that JVector accomplishes this differently than how DiskANN described doing it. My understanding (based on the links below, but I didn't read the full diff in #244) is that JVector will incrementally compress the vectors it is using to construct the index; whereas DiskANN described partitioning the vectors into subsets small enough that indexes can be built in-memory using uncompressed vectors, building those indexes independently, and then merging the results into one larger index.
OP, have you done any quality comparisons between an index built with JVector using the PQ approach (small RAM machine) vs. an index built with JVector using the raw vectors during construction (big RAM machine)? I'd be curious to understand what this technique's impact is on the final search results.
I'd also be interested to know if any other vector stores support building indexes in limited memory using the partition-then-merge approach described by DiskANN.
Finally, it's been a while since I looked at this stuff, so if I mis-wrote or mis-understood please correct me!
- DiskANN: https://dl.acm.org/doi/10.5555/3454287.3455520
- Anisotropic Vector Quantization (PQ Compression): https://arxiv.org/abs/1908.10396
- JVector/#168: How to support building larger-than-memory indexes https://github.com/jbellis/jvector/issues/168
- JVector/#244: Build indexes using compressed vectors https://github.com/jbellis/jvector/pull/244
One interesting property in benchmarking is that the distance comparison implementations for full-dim vectors can often be more efficient than those for PQ-compressed vectors (straight-line SIMD execution vs table lookups), so on some systems cluster-and-merge is relatively competitive in terms of build performance.
I've tested the build-with-compression approach used here with all the datasets in JVector's Bench [1] and there's near zero loss in accuracy.
I suspect that the reason the DiskANN authors used the approach they did is that in 2019 Deep1B was about the only very large public dataset around, and since the vectors themselves are small your edge lists end up dominating your memory usage. So they came up with a clever solution, at the cost of making construction 2.5x as expensive. (Educated guess: 2x is from adding each vector to multiple partitions and the extra 50% to merge the results.)
So JVector is just keeping edge lists in memory today. When that becomes a bottleneck we may need to do something similar to DiskANN but I'm hoping we can do better because it's frankly a little inelegant.
[1] https://github.com/jbellis/jvector/blob/main/jvector-example...
Are there laptops like that? Maybe an upgraded MacBook, but I have been looking for Windows/Linux laptops and they generally top out at 32GB. I checked Lenovo's website and everything with 64GB and up is not called a laptop but a "mobile workstation".
https://arxiv.org/abs/2004.12832
https://thenewstack.io/overcoming-the-limits-of-rag-with-col...