OpenAI and the Pinecone database are not really needed for this task. A simple SBERT encoding of the product texts, followed by storing the vectors in a dense numpy array or faiss index would be more than sufficient. Especially if one is operating in batch mode, the locality and simplicity can’t be beat and you can easily scale to 100k-1M texts in your corpus on commodity hardware/VPS (though NVME disk will see a nice performance gain over regular SSD)
...but I said WTH, googled SBERT and followed my nose and got it installed in minutes on my Mac and they kindly included a cut/paste example of semantic search.
If i have a lot of data, it's cheaper and more efficient to build your own batch inference application and just use two well-known libraries (s-bert and the FAISS indexing library). That didn't occur to me. I come here for insights - and i got one here.
sounds like you ended up not using GPT3 in the end which is probably wise.
i'm curious if you might see further savings using other cheaper embeddings that are available on huggingface. but its probably not material at this point.
did you also consider using pgvector instead of pinecone? https://news.ycombinator.com/item?id=34684593 any painpoints with pinecone you can recall?
I totally could! I think each use case should dictate which model you should use, in my case I was not super cost or latency sensitive since it was a small dataset and I cared more about accuracy. But I'm planning on using something like https://huggingface.co/sentence-transformers/all-MiniLM-L6-v... for my next project where latency and cost will matter more :)
I have a lot of thoughts around that last question! The Supabase article came out way after I implemented this (August of last year) so I didn't even think to do that, not sure if it was even supported back then, but I'd probably reach for that if I was re-doing the project to reduce the number of systems I needed. I think the power of having the vector search done in the same DB as the rest of the data is that sometimes you may want to have structured filtering before the semantic/vector ranking (ie. only select user N's items and rank by similarity to <query>) which is trickier to do in Pinecone. They support metadata filtering but it feels like an after thought. For the project I'm working on now (https://pinched.io) , we'd like to filter on certain parameters as well as rank by relevance, so I'm going to explore combining structured querying with semantic search (ie. pgvector or something similar on DuckDB if it adds support for this).
requested invite! i have a moderately large twitter so could be a good test heheh. i use https://www.flock.network/ for this stuff normally but the UX isnt that great so hoping for better.
I tested an early version of pgvector against faiss and found faiss had much better performance
user query -> GPT3 response -> Lookup in VectorDB -> send response based on closest embedding in VectorDB
?
The optional step two is used when the lookups are more closely related to an answer's latent space than the original query text. This approach is called HyDE (first published here: https://arxiv.org/abs/2212.10496).
The synthesis is also optional. You can essentially summarize your lookups or refine them or do whatever you want at this stage.
If you skipped steps 2 and 4, it's just a semantic search engine. If you skip step 2, you're either doing it for latency/performance reasons, or because the user query's embeddings are more similar to the docs in the vector db.
Your vector DB has well formed prompts - users write random stuff, map it to the closest well formed prompt?
'which helped launch the movement of those opposed to endocrine disruptors, was retracted and its author found to have committed scientific misconduct'
it’s his blog either way.
Did you consider something like openrefine or fuzzy matching / levenshtein distance?
Seems like a common data cleaning ask with a small amount of data.
In your case and ChatGPT3, does is it provide output based on the data you feed it? If that is the case, is there anything related to training the model to use your data?
I am trying to gauge a sense of what is going on.
edit: don't want to rant. it's not a bad post and i'm sure there is many and far more wasteful examples than this.
wait till its thousands, millions, billions . . .