undefined | Better HN

0 pointswhakim9mo ago0 comments

I do not think data stores are a bottleneck for serving embedding search. I think the raft of new-fangled vector db services (or pgvector or whatever) can be a bottleneck because they are mostly optimized around the long tail of pretty small data. Real internet-scale search systems like ES or Vespa won’t struggle with serving embedding search assuming you have the necessary scale and time/money to invest in them.

0 comments

cfors9mo ago

Sure they can handle the basic case of ANN. But ANN still doesn’t have good stories for lots of real-world problems.

* filterable ANN, decomposes into prefiltering or postfiltering.

* dynamic updates and versioning is still very difficult

* slow building of graph indexes

* adding other signals into the search, such as query time boosting for recent docs.

I don’t disagree these systems can work but innovation is still necessary. We are not in a “data stores are solved” world.

whakimOP9mo ago

* Filterable ANN certainly decomposes into pre- and post-filtering, and there is definitely a lot of interesting innovation occurring around filterable ANN. But large-scale search systems currently do a pretty good job with pre-filtering, falling back to brute force search in the case of restrictive filters.

* You'd have to be a bit more exact re: dynamic updates/versioning for me to understand the challenges you're facing.

* Building graph indices can be slow, but in my experience (billions of embeddings) it is possible to build HNSW indices in tens of minutes.

* How is this any different to combining traditional keyword search with, say, recency boosting?

cfors9mo ago

Might be missing my argument here - I stated that there are workable solutions to this like you have pointed out.

But ANN search is still a sledgehammer and building out hybrid solutions that help bridge the gap between this and traditional data stores still have room for innovation.

whakimOP9mo ago

Fair enough - agreed there's lots of interesting innovations here - but my point is that semantic search and its associated issues don't really differ that much from other types of search problems at scale, and I therefore don't think that the current crop of vector database products add a lot of value from a technical perspective (perhaps they do from an ease-of-use perspective; or they work great at small scale, etc. etc.)

mdaniel9mo ago

> Real internet-scale search systems like ES

Oh, then you must have the secret sauce that allows scaling ES vector search beyond 10,000 results without requiring infinite RAM. I know their forums would welcome it, because that question comes up a lot

Or I guess that's why you included the qualifier about money to invest

whakimOP9mo ago

Would you mind putting aside the snark? I have a couple questions. How large is the corpus? I am also curious about the use-case for top-k ANN, k > 10000?

farsa9mo ago

Not the person you have asked but at work (we are a CRM platform) we allow our clients to arbitrarily query their userbase to find matching users for marketing campaigns (email, sms, whatsapp). These campaigns can some times target a few hundred thousand people. We are on a really ancient version of ES, but it sucks at this job in terms of throughput. Some experimenting with bigquery indicates it is so much better at mass exporting.

whakimOP9mo ago

Fair; my question was mostly in the context of ANN, since that was the discussion point - I have to assume ES (as a search engine) would not necessarily be the right tool for data warehousing types of workloads.

j / k navigate · click thread line to collapse

0 comments

cfors9mo ago

Sure they can handle the basic case of ANN. But ANN still doesn’t have good stories for lots of real-world problems.

* filterable ANN, decomposes into prefiltering or postfiltering.

* dynamic updates and versioning is still very difficult

* slow building of graph indexes

* adding other signals into the search, such as query time boosting for recent docs.

I don’t disagree these systems can work but innovation is still necessary. We are not in a “data stores are solved” world.

whakimOP9mo ago

* You'd have to be a bit more exact re: dynamic updates/versioning for me to understand the challenges you're facing.

* Building graph indices can be slow, but in my experience (billions of embeddings) it is possible to build HNSW indices in tens of minutes.

* How is this any different to combining traditional keyword search with, say, recency boosting?

cfors9mo ago

Might be missing my argument here - I stated that there are workable solutions to this like you have pointed out.

But ANN search is still a sledgehammer and building out hybrid solutions that help bridge the gap between this and traditional data stores still have room for innovation.

whakimOP9mo ago

mdaniel9mo ago

> Real internet-scale search systems like ES

Or I guess that's why you included the qualifier about money to invest

whakimOP9mo ago

Would you mind putting aside the snark? I have a couple questions. How large is the corpus? I am also curious about the use-case for top-k ANN, k > 10000?

farsa9mo ago

whakimOP9mo ago

j / k navigate · click thread line to collapse