70M vectors searched in 48ms on a single consumer GPU –results you won't believe
Hardware:
RTX 3090 consumer CPU NVMe SSD
Dataset:
~70 million vectors (384 dimensions)
Performance:
~48 ms search latency for top-k results.
This corresponds to roughly ~1.45 billion vector comparisons per second on a single GPU.
The system uses a custom GPU kernel and a two-stage search pipeline (binary filtering + floating-point reranking).
My goal was to explore whether large-scale vector search could run efficiently on consumer hardware instead of large datacenter clusters.
After thousands of hours of work and many failed attempts the results finally became stable enough to benchmark.
I'm currently exploring how far this approach can scale.
I'm currently exploring how far this approach can scale.
I'd be very interested to hear how others approach large-scale vector search on consumer hardware.
Happy to answer questions.