undefined | Better HN

0 pointsvegabook1y ago0 comments

Just tried usearch against ol’ faithful np.dot, and found the latter to be 8x faster than usearch on 10m brute force scan as described in their readme [1] for top 50 matches. Identical output result. 1.74 seconds for numpy and around 12 seconds for usearch on an M2 max with enough ram to hold the vectors without swapping.

[1] https://github.com/unum-cloud/usearch?tab=readme-ov-file#exa...

0 comments

3 comments · 1 top-level

ashvardanian1y ago· 2 in thread

Author here :)

This might not be an apples-to-apples comparison. NumPy uses BLAS for matrix multiplication, which benefits from tiling to make better use of CPU caches.

USearch, on the other hand, computes L2 distance directly (not the dot product) and supports a variety of metrics. It doesn't use tiling, so it's expected to be slower than BLAS GEMM routines for single or double-precision vectors.

Things might get more interesting with half-precision, brain-float16, or integer representations, where the trade-offs are less straightforward. Let me know if you decide to try it with those — I'd love to hear how it performs.

PS: You may find related benchmarks here: https://github.com/ashvardanian/SimSIMD

vegabookOP1y ago

It turns out, my bad and I apologise, that although 10e6 x 1e3 FP32 fits well within 96GB of RAM, during the np.random.rand initialization phase intermediate allocations mean we go to about 32GB of swap files. These only get cleared if more ram is demanded and that happens on the first bench run. So whichever gets run first, np or usearch, gets penalised bigtime. So now I have re-run with sizes that never reach swap threshold, and the results are MUCH more impressive for usearch. Basically usearch is twice as fast. 7e6x1e3 scan for 1e3 top 50 is 1.32 seconds for numpy and 0.633 seconds for usearch. Swapped the order of benchmarks as well to rerun, and results check out. Nice work. usearch is now in my toolkit and I apologise again for the misleading comment.

As an aside, it's kind of amazing how it takes essentially just over half a second to scan 7m 1032-size vectors for semantic similarity, on a (beefy but not extraordinary) desktop computer. Modern hardware is so awesome. And I'm guessing I could get another order of magnitude or two speedup if I got Metal involved.

EDIT: Linux on tiny el-cheapo 100 dollar Intel n95 mini PC with 32GIG of (single channel) RAM, and dropping size to 3mx1024: usearch: 0.65 seconds numpy: 0.99 seconds. Amazing.

ashvardanian1y ago

Oh, epic! Thanks for taking the time to double check :)

j / k navigate · click thread line to collapse