You probably shouldn't use OpenAI's embeddings (opens in new tab)

(iamnotarobot.substack.com)

71 pointsdiego3y ago29 comments

29 comments

nico3y ago

Is someone doing embeddings<>embeddings mapping?

For example, mapping embeddings of Llama to GPT-3?

That way you can see how similar the models “understand the world”.

I'm curious about this as well. There are potentially many different versions of embedding models used in production and correlating different versions together could be very important.

sroussey3y ago

I’d be interested to see this as well. I guess you can make a test and see what happens. Ping me if you do!

KRAKRISMOTT3y ago

Isn't this the whole point behind the CLIP architecture?

nico3y ago

Could you explain a bit more what the CLIP architecture is? Any good links to a short video demoing it? Thank you!

KRAKRISMOTT3y ago

https://www.pinecone.io/learn/clip-image-search/

mustacheemperor3y ago

Could anyone point me towards a relatively beginner-friendly guide to do something like

>download all my tweets (about 20k) and build a semantic searcher on top ?

How can utilize 3rd party embeddings with OpenAI's LLM API? Am I correct to understand from this article that this is possible?

diegoOP3y ago

That's exactly what I did here. https://github.com/dbasch/semantic-search-tweets

mustacheemperor3y ago

Thank you! Comparing this and the link the other commenter posted, what handles the actual search querying? Does instructor-xl include an LLM in addition to the embeddings? The other commenter's repo uses Pinecone for the embeddings and OpenAI for the LLM.

My apologies if I am completely mangling the vocabulary here - I have an, at best, rudimentary understanding of this stuff that I am trying to hack my education on.

Edit: If you're at the SF meetup tomorrow, I'd happily buy you a beverage in return for this explanation :)

eternalban3y ago

It's in the repo:

You first create embeddings. What is this? It's an n-dimensional vector space with your tweets 'embedded' in that space. Each word is an n-dimensional vector in this space. The vectorization is supposed to maintain 'semantic distance'. Basically, if two words are very close in meaning or related (by say frequently appearing next to each other in corpus) they should be 'close' in some of those n-dimensions as well. The result at the end is the '.bin' file, the 'semantic model' of your corpus.

https://github.com/dbasch/semantic-search-tweets/blob/main/e...

For semantic search, you run the same embedding algorithm against the query and take the resultant vectors and do similarity search via matrix ops, resulting in a set of results, with probabilities. These point back to the original source, here the tweets, and you just print the tweet(s) that you select from that result set (here the top 10).

https://github.com/dbasch/semantic-search-tweets/blob/main/s...

Experts can chime in here but there are knobs such as 'batch size' and the functions you use to index. (cosine was used here.)

So the various performance dimensions of the process should also be clear. There is a fixed cost of making the embeddings of your data. There is a per-op embedding of your query, and then running the similarity algorithm to find the result set.

1 more reply

celestialcheese3y ago

langchain and llama-index are two big opensource projects which are great for buildign this type of thing.

https://github.com/mayooear/gpt4-pdf-chatbot-langchain for example

mustacheemperor3y ago

Cheers, thank you!

sroussey3y ago

I have a system that download all my data from Google, Facebook, Twitter, and others. Geo data is fun to look at, but now the text and images have some more meaning to gleam. I’m thinking about going back to it. Not sure how to package a bunch of python stuff in an app though.

devxpy3y ago

https://gooey.ai/doc-search/?example_id=8ls7dpf6

No code needed :)

jeadie3y ago

You could try https://github.com/marqo-ai/marqo

fzliu3y ago

I've done some quick-and-dirty testing with OpenAI's embedding API + Zilliz Cloud. The 1st gen embeddings leave something to be desired (https://medium.com/@nils_reimers/openai-gpt-3-text-embedding...), but the 2nd gen embeddings are actually fairly performant relative to many open source models with MLM loss.

I'll have to dig out the notebook that I created for this, but I'll try to post it here once I find it.

shirkey3y ago

Please do and thanks in advance for any insights you can provide -- it would be great to understand any benchmarking improvement with ada-002 from your previous findings, and whether you tested the specific OpenAI text-search-*-{query,doc} models as a comparison for large document search.

celestialcheese3y ago

Very interested in this - I've been using embeddings / semantic search doing information retrieval from PDFs, using ada-002, and have been impressed by the results in testing.

The reasons the article listed, namely a) lock-in and b) cost, have given me pause with embedding our whole corpus of data. I'd much rather use an open model but don't have much experience in evaluating these embedding models and search performance - still very new to me.

Like what you did with ada-002 vs Instruct XL, has there been any papers or prior work done evaluating the different embedding models?

VHRanger3y ago

You can find some comparisons and evaluation datasets/tasks here: https://www.sbert.net/docs/pretrained_models.html

Generally MiniLM is a good baseline. For faster models you want this library:

https://github.com/oborchers/Fast_Sentence_Embeddings

For higher quality ones, just take the bigger/slower models in the SentenceTransformers library

sroussey3y ago

Is there performance comparisons for Apple Silicon machines?

VHRanger3y ago

Performance in terms of model quality would be the same.

The fast-se library uses C++ code and word embeddings being averaged to generate sentence embeddings, so would be similarly fast, or faster on apple silicon than x86.

For the SentenceTransformer library models I'm not sure, but I think it would run off the CPU for a M1/M2 computer

nomadiccoder3y ago

The heat map of availability time, 98.58 (Jan), 99.07 (Feb), and 99.71 (Mar) trends upwards.

reportgunner3y ago

I don't think 3 data points is enough to establish a trend.

breckenedge3y ago

It’s fine to use their embeddings for a proof of concept, but since you don’t own it, you probably shouldn’t rely on it because it could go away at any time.

adeelk933y ago

Couldn’t you make that argument against all SaaS?

anshumankmr3y ago

Well some SaaS purists may believe that to be true about Microsoft Office 365. Hence we have, Microsoft Office 2021.

(Although there is a lot more advantages to just having Office 2021 like the flat fee)

breckenedge3y ago

Yes and no. Sometimes you get contracts that require “source code escrow” so that companies can run your source if you ever go out of business.

j / k navigate · click thread line to collapse

29 comments

nico3y ago

Is someone doing embeddings<>embeddings mapping?

For example, mapping embeddings of Llama to GPT-3?

That way you can see how similar the models “understand the world”.

fzliu3y ago

I'm curious about this as well. There are potentially many different versions of embedding models used in production and correlating different versions together could be very important.

sroussey3y ago

I’d be interested to see this as well. I guess you can make a test and see what happens. Ping me if you do!

KRAKRISMOTT3y ago

Isn't this the whole point behind the CLIP architecture?

nico3y ago

Could you explain a bit more what the CLIP architecture is? Any good links to a short video demoing it? Thank you!

KRAKRISMOTT3y ago

https://www.pinecone.io/learn/clip-image-search/

mustacheemperor3y ago

Could anyone point me towards a relatively beginner-friendly guide to do something like

>download all my tweets (about 20k) and build a semantic searcher on top ?

How can utilize 3rd party embeddings with OpenAI's LLM API? Am I correct to understand from this article that this is possible?

diegoOP3y ago

That's exactly what I did here. https://github.com/dbasch/semantic-search-tweets

mustacheemperor3y ago

My apologies if I am completely mangling the vocabulary here - I have an, at best, rudimentary understanding of this stuff that I am trying to hack my education on.

Edit: If you're at the SF meetup tomorrow, I'd happily buy you a beverage in return for this explanation :)

eternalban3y ago

It's in the repo:

https://github.com/dbasch/semantic-search-tweets/blob/main/e...

https://github.com/dbasch/semantic-search-tweets/blob/main/s...

Experts can chime in here but there are knobs such as 'batch size' and the functions you use to index. (cosine was used here.)

1 more reply

celestialcheese3y ago

langchain and llama-index are two big opensource projects which are great for buildign this type of thing.

https://github.com/mayooear/gpt4-pdf-chatbot-langchain for example

mustacheemperor3y ago

Cheers, thank you!

sroussey3y ago

devxpy3y ago

https://gooey.ai/doc-search/?example_id=8ls7dpf6

No code needed :)

jeadie3y ago

You could try https://github.com/marqo-ai/marqo

fzliu3y ago

I'll have to dig out the notebook that I created for this, but I'll try to post it here once I find it.

shirkey3y ago

celestialcheese3y ago

Very interested in this - I've been using embeddings / semantic search doing information retrieval from PDFs, using ada-002, and have been impressed by the results in testing.

Like what you did with ada-002 vs Instruct XL, has there been any papers or prior work done evaluating the different embedding models?

VHRanger3y ago

You can find some comparisons and evaluation datasets/tasks here: https://www.sbert.net/docs/pretrained_models.html

Generally MiniLM is a good baseline. For faster models you want this library:

https://github.com/oborchers/Fast_Sentence_Embeddings

For higher quality ones, just take the bigger/slower models in the SentenceTransformers library

sroussey3y ago

Is there performance comparisons for Apple Silicon machines?

VHRanger3y ago

Performance in terms of model quality would be the same.

The fast-se library uses C++ code and word embeddings being averaged to generate sentence embeddings, so would be similarly fast, or faster on apple silicon than x86.

For the SentenceTransformer library models I'm not sure, but I think it would run off the CPU for a M1/M2 computer

nomadiccoder3y ago

The heat map of availability time, 98.58 (Jan), 99.07 (Feb), and 99.71 (Mar) trends upwards.

reportgunner3y ago

I don't think 3 data points is enough to establish a trend.

breckenedge3y ago

It’s fine to use their embeddings for a proof of concept, but since you don’t own it, you probably shouldn’t rely on it because it could go away at any time.

adeelk933y ago

Couldn’t you make that argument against all SaaS?

anshumankmr3y ago

Well some SaaS purists may believe that to be true about Microsoft Office 365. Hence we have, Microsoft Office 2021.

(Although there is a lot more advantages to just having Office 2021 like the flat fee)

breckenedge3y ago

Yes and no. Sometimes you get contracts that require “source code escrow” so that companies can run your source if you ever go out of business.

j / k navigate · click thread line to collapse