undefined | Better HN

0 pointsminimaxir2y ago0 comments

Embeddings is just SentenceTransformers: https://www.sbert.net/

I used the bge-large-en-v1.5 model (https://huggingface.co/BAAI/bge-large-en-v1.5) because I could, but the common all-MiniLM-L6-v2 model is sufficient. The trick is to batch generate the embeddings on a GPU, which SentenceTransformers mostly does by default.

Other libraries are the typical ones (umap for UMAP, scikit-learn for k-means/DBSCAN, chatgpt-python for ChatGPT interfacing, plotly for viz, pandas for some ETL). You don't need to use a bespoke AI/ML package for these workflows and they aren't too complicated.

0 comments

2 comments · 2 top-level

refulgentis2y ago

It's just SentenceTransformers, but: the wrong model is common because no one read SentenceTransformers. MiniLM-L6-V2 is for symmetric search (target document has same wording as source document) MiniLM-L6-V3 is for asymmetric search (target document is likely to contain material matching query in source document)

tomthe2y ago

Can you share your chatGPT prompt, please? I do something similar at the moment and I try out Bert topic, but chatGPT seems also worth a try.

j / k navigate · click thread line to collapse