For example, mapping embeddings of Llama to GPT-3?
That way you can see how similar the models “understand the world”.
>download all my tweets (about 20k) and build a semantic searcher on top ?
How can utilize 3rd party embeddings with OpenAI's LLM API? Am I correct to understand from this article that this is possible?
My apologies if I am completely mangling the vocabulary here - I have an, at best, rudimentary understanding of this stuff that I am trying to hack my education on.
Edit: If you're at the SF meetup tomorrow, I'd happily buy you a beverage in return for this explanation :)
You first create embeddings. What is this? It's an n-dimensional vector space with your tweets 'embedded' in that space. Each word is an n-dimensional vector in this space. The vectorization is supposed to maintain 'semantic distance'. Basically, if two words are very close in meaning or related (by say frequently appearing next to each other in corpus) they should be 'close' in some of those n-dimensions as well. The result at the end is the '.bin' file, the 'semantic model' of your corpus.
https://github.com/dbasch/semantic-search-tweets/blob/main/e...
For semantic search, you run the same embedding algorithm against the query and take the resultant vectors and do similarity search via matrix ops, resulting in a set of results, with probabilities. These point back to the original source, here the tweets, and you just print the tweet(s) that you select from that result set (here the top 10).
https://github.com/dbasch/semantic-search-tweets/blob/main/s...
Experts can chime in here but there are knobs such as 'batch size' and the functions you use to index. (cosine was used here.)
So the various performance dimensions of the process should also be clear. There is a fixed cost of making the embeddings of your data. There is a per-op embedding of your query, and then running the similarity algorithm to find the result set.
https://github.com/mayooear/gpt4-pdf-chatbot-langchain for example
No code needed :)
I'll have to dig out the notebook that I created for this, but I'll try to post it here once I find it.
The reasons the article listed, namely a) lock-in and b) cost, have given me pause with embedding our whole corpus of data. I'd much rather use an open model but don't have much experience in evaluating these embedding models and search performance - still very new to me.
Like what you did with ada-002 vs Instruct XL, has there been any papers or prior work done evaluating the different embedding models?
Generally MiniLM is a good baseline. For faster models you want this library:
https://github.com/oborchers/Fast_Sentence_Embeddings
For higher quality ones, just take the bigger/slower models in the SentenceTransformers library
The fast-se library uses C++ code and word embeddings being averaged to generate sentence embeddings, so would be similarly fast, or faster on apple silicon than x86.
For the SentenceTransformer library models I'm not sure, but I think it would run off the CPU for a M1/M2 computer
(Although there is a lot more advantages to just having Office 2021 like the flat fee)