undefined | Better HN

0 pointsBoorishBears2y ago0 comments

This is kind of a moot argument, semantic similarity is higher dimensionality than cosine similarity can capture.

If I'm using vectors for question/answer, then:

"What is a cat"

and

"What is a dog"

Should be more dissimilar than the documents answering either.

If I'm using it for FAQ filtering then they should be more similar.

0 comments

3 comments · 1 top-level

reissbaker2y ago· 2 in thread

I've had decent results using a doc2query style approach:

    1. Ask an LLM to return a list of questions answered by the document
    2. Store the embeddings of the questions along with a document ID
    3. On user query, get the embedding of the user query
    4. KNN cosine similarity search the user embedding vs. the corpus of question embeddings
    5. Return the highest ranked documents

You can tweak this approach depending on your use case, so that in step 1 you generate embeddings that are more similar to the types of things you want returned in step 5. If you want the answer to "What is a cat" to be similar to "What is a dog," you'd prompt/finetune the LLM in step 1 to generate broad questions that would encompass both; if you want them to be very different, you'd do the opposite and avoid generalities.

BoorishBearsOP2y ago

You just reinvented a 2 year old technique with a more expensive pipeline and missed performance gains (from the cross-encoder step):

https://www.sbert.net/examples/domain_adaptation/README.html https://arxiv.org/abs/2112.07577

reissbaker2y ago

I'm aware of more efficient ways to do it! (Hence referencing e.g. doc2query.) But you have to train a model, whereas with an LLM you can get a working version in 5mins of work.

1 more reply

j / k navigate · click thread line to collapse