It turns out, my bad and I apologise, that although 10e6 x 1e3 FP32 fits well within 96GB of RAM, during the np.random.rand initialization phase intermediate allocations mean we go to about 32GB of swap files. These only get cleared if more ram is demanded and that happens on the first bench run. So whichever gets run first, np or usearch, gets penalised bigtime. So now I have re-run with sizes that never reach swap threshold, and the results are MUCH more impressive for usearch. Basically usearch is twice as fast. 7e6x1e3 scan for 1e3 top 50 is 1.32 seconds for numpy and 0.633 seconds for usearch. Swapped the order of benchmarks as well to rerun, and results check out. Nice work. usearch is now in my toolkit and I apologise again for the misleading comment.
As an aside, it's kind of amazing how it takes essentially just over half a second to scan 7m 1032-size vectors for semantic similarity, on a (beefy but not extraordinary) desktop computer. Modern hardware is so awesome. And I'm guessing I could get another order of magnitude or two speedup if I got Metal involved.
EDIT: Linux on tiny el-cheapo 100 dollar Intel n95 mini PC with 32GIG of (single channel) RAM, and dropping size to 3mx1024:
usearch: 0.65 seconds
numpy: 0.99 seconds.
Amazing.