ANN v3: 200ms p99 query latency over 100B vectors (opens in new tab)

(turbopuffer.com)

109 points_peregrine_5mo ago47 comments

47 comments

23 comments · 10 top-level

jascha_eng5mo ago· 5 in thread

This is legitimately pretty impressive. I think the rule of thumb is now, go with postgres(pgvector) for vector search until it breaks, then go with turbopuffer.

sa-code5mo ago

Qdrant is also a good default choice, since it can work in-memory for development, with a hard drive for small deployments and also for "web scale" workloads.

As a principal eng, side-stepping a migration and having a good local dev experience is too good of a deal to pass up.

That being said, turbopuffer looks interesting. I will check it out. Hopefully their local dev experience is good

nostrebored5mo ago

Qdrant is one of the few vendors I actively steer people away from. Look at the GitHub issues, look at what their CEO says, look at their fake “advancements” that they pay for publicity on…

The number of people I know who’ve had unrecoverable shard failures on Qdrant is too high to take it seriously.

2 more replies

benesch5mo ago

For local dev + testing, we recommend just hitting the production turbopuffer service directly, but with a separate test org/API key: https://turbopuffer.com/docs/testing

Works well for the vast majority of our customers (although we get the very occasional complaint about wanting a dev environment that works offline). The dataset sizes for local dev are usually so small that the cost rounds to free.

4 more replies

jauntywundrkind5mo ago

I'd love to know how they compare versus MixedBread, what relative strengths each has. https://www.mixedbread.com/

I really really enjoy & learn a lot from the mixedbread blog. And they find good stuff to open source (although the product itself is closed). https://www.mixedbread.com/blog

I feel like there's a lot of overlap but also probably a lot of distinction too. Pretty new to this space of products though.

_peregrine_OP5mo ago

seems like a good rule of thumb to me! though i would perhaps lump "cost" into the "until it breaks" equation. even with decent perf, pg_vector's economics can be much worse, especially in multi-tenant scenarios where you need many small indexes (this is true of any vector db that builds indexes primarily on RAM/SSD)

kgeist5mo ago· 4 in thread

Are there vector DBs with 100B vectors in production which work well? There was a paper which showed that there's 12% loss in accuracy at just 1 mln vectors. Maybe some kind of logical sharding is another option, to improve both accuracy and speed.

lmeyerov5mo ago

I don't know at these scales, but at the 1M-100M, we found switching from out-of-box embeddings to fine-tuning our embeddings gave less of a sting in the compression/recall trade-off . We had a 10-100X win here wrt comparable recall with better compression.

I'm not sure how that'd work with the binary quantization phase though. For example, we use Matroyska, and some of the bits matter way more than others, so that might be super painful.

jasonjmcghee5mo ago

So many missing details...

Different vector indexes have very different recall and even different parameters for each dramatically impact this.

HNSW can have very good recall even at high vector counts.

There's also the embedding model, whether you're quantizing, if it's pure rag vs hybrid bm25 / static word embeddings vs graph connections, whether you're reranking etc etc

_peregrine_OP5mo ago

the solution described in the blog post is currently in production at 100B vectors

rahimnathwani5mo ago

For what/who?

2 more replies

vander_elst5mo ago· 2 in thread

> 504MiB shared L3 cache

What CPU are they using here?

benesch5mo ago

The exact CPU depends on the region/cloud provider, but this Granite Rapids CPU is representative: https://www.intel.com/content/www/us/en/products/sku/240777/...

vander_elst5mo ago

Thanks!

mmaunder5mo ago· 1 in thread

For those of us who operate on site, we have to add back network latency, which negates this win entirely and makes a proprietary cloud solution like this a nonstarter.

benesch5mo ago

Often not a dealbreaker, actually! We can spin up new tpuf regions and procure dedicated interconnects to minimize latency to the on-prem network on request (and we have done this).

When you're operating at the 100B scale, you're pushing beyond the capacity that most on-prem setups can handle. Most orgs have no choice but to put a 100B workload into the nearest public cloud. (For smaller workloads, considerations are different, for sure.)

alanwli5mo ago· 1 in thread

Out of curiosity, how is the 92% recall calculated? For a given query, is the recall compared to the true topk of all 100B vectors vs. recall at each of N shards compared to the topk of each respective shard?

nvanbenschoten5mo ago

(author here) The 92% mentioned in this post is showing recall@10 across all 100B vectors, calculated by comparing to the global top_k.

turbopuffer will also continuously monitor production recall at the per-shard level (or on-demand with https://turbopuffer.com/docs/recall). Perhaps counterintuitively, the global recall will actually be better than the per-shard recall if each shard is asked for its own, local top_k!

lmeyerov5mo ago

Fun!

I was curious given the cloud discussion - a quick search suggests default AWS SSD bandwidth is 250 MB/s, and you can pay more for 1 GB/s. Similar for s3, one http connection is < 100 MB/s, and you can pay for more parallel connections. So the hot binary quantized search index is doing a lot of work to minimize these both for the initial hot queries and pruning later fetches. Very cool!

montroser5mo ago

This is at 92% recall. Could be worse, but could definitely be much better. Quantization and hierarchical clustering are tricks that lead to awesome performance at the cost of extremely variable quality, depending on the dataset.

hwspeed5mo ago

The offline/local dev point is underrated. Being able to iterate without network latency or metered API costs makes a huge difference for prototyping. The challenge is making sure your local setup actually matches prod behavior. I've been burned by pgvector working fine locally then hitting performance cliffs at scale when the index doesn't fit in memory anymore.

redskyluan5mo ago

Using Hierarchical Clustering significantly reduces recall; this is a solution we used and abandoned three years ago.

shayonj5mo ago

v cool and impressive!

1 more reply

j / k navigate · click thread line to collapse

47 comments

23 comments · 10 top-level

jascha_eng5mo ago· 5 in thread

This is legitimately pretty impressive. I think the rule of thumb is now, go with postgres(pgvector) for vector search until it breaks, then go with turbopuffer.

sa-code5mo ago

Qdrant is also a good default choice, since it can work in-memory for development, with a hard drive for small deployments and also for "web scale" workloads.

As a principal eng, side-stepping a migration and having a good local dev experience is too good of a deal to pass up.

That being said, turbopuffer looks interesting. I will check it out. Hopefully their local dev experience is good

nostrebored5mo ago

Qdrant is one of the few vendors I actively steer people away from. Look at the GitHub issues, look at what their CEO says, look at their fake “advancements” that they pay for publicity on…

The number of people I know who’ve had unrecoverable shard failures on Qdrant is too high to take it seriously.

2 more replies

benesch5mo ago

For local dev + testing, we recommend just hitting the production turbopuffer service directly, but with a separate test org/API key: https://turbopuffer.com/docs/testing

4 more replies

jauntywundrkind5mo ago

I'd love to know how they compare versus MixedBread, what relative strengths each has. https://www.mixedbread.com/

I really really enjoy & learn a lot from the mixedbread blog. And they find good stuff to open source (although the product itself is closed). https://www.mixedbread.com/blog

I feel like there's a lot of overlap but also probably a lot of distinction too. Pretty new to this space of products though.

_peregrine_OP5mo ago

kgeist5mo ago· 4 in thread

lmeyerov5mo ago

I'm not sure how that'd work with the binary quantization phase though. For example, we use Matroyska, and some of the bits matter way more than others, so that might be super painful.

jasonjmcghee5mo ago

So many missing details...

Different vector indexes have very different recall and even different parameters for each dramatically impact this.

HNSW can have very good recall even at high vector counts.

There's also the embedding model, whether you're quantizing, if it's pure rag vs hybrid bm25 / static word embeddings vs graph connections, whether you're reranking etc etc

_peregrine_OP5mo ago

the solution described in the blog post is currently in production at 100B vectors

rahimnathwani5mo ago

For what/who?

2 more replies

vander_elst5mo ago· 2 in thread

> 504MiB shared L3 cache

What CPU are they using here?

benesch5mo ago

The exact CPU depends on the region/cloud provider, but this Granite Rapids CPU is representative: https://www.intel.com/content/www/us/en/products/sku/240777/...

vander_elst5mo ago

Thanks!

mmaunder5mo ago· 1 in thread

For those of us who operate on site, we have to add back network latency, which negates this win entirely and makes a proprietary cloud solution like this a nonstarter.

benesch5mo ago

Often not a dealbreaker, actually! We can spin up new tpuf regions and procure dedicated interconnects to minimize latency to the on-prem network on request (and we have done this).

alanwli5mo ago· 1 in thread

nvanbenschoten5mo ago

(author here) The 92% mentioned in this post is showing recall@10 across all 100B vectors, calculated by comparing to the global top_k.

lmeyerov5mo ago

Fun!

montroser5mo ago

hwspeed5mo ago

redskyluan5mo ago

Using Hierarchical Clustering significantly reduces recall; this is a solution we used and abandoned three years ago.

shayonj5mo ago

v cool and impressive!

1 more reply

j / k navigate · click thread line to collapse