The second unstated assumption is that the vector index can accurately identify the top K vectors by cosine similarity, and that's not true either. If you retrieve the top K vectors according to the vector index (instead of computing all the pairwise similarities in advance), that set of 10 vectors will be missing documents that have a higher cosine similarity than that of the K'th vector retrieved.
All of this means you'll need to retrieve a multiple of K vectors, figure out some way to re-rank them to exclude the irrelevant ones, and have your own ground truth to measure the index's precision and recall.
> second unstated assumption is that the vector index can accurately identify the top K vectors by cosine similarity, and that's not true either
Its not unstated, its called ANN for a reason
Are they? A learned embedding doesn't guarantee this and a positional embedding certainly doesn't. Our latent embeddings don't either unless you are inferring this through the dot product in the attention mechanism. But that too is learned. There are no guarantees that the similarities that they learn are the same things we consider as similarities. High dimensional space is really weird.
And while we're at it, we should mention that methods like t-SNE and UMAP are clustering algorithms not dimensional reduction. Just because they can find ways to cluster the data in a lower dimensional projection (epic mapping) doesn't mean that they are similar in the higher dimensional space. It all depends on the ability to unknot in the higher dimensional space.
It is extremely important to do what the OP is doing and consider the assumptions of the model, data, and measurements. Good results do not necessarily mean good methods. I like to say that you don't need to know math to make a good model, but you do need to know math to know why your model is wrong. Your comment just comes off as dismissive rather than actually countering the claims. There's plenty more assumptions than OP listed too. But their assumptions don't mean the model won't work, it just means what constraints the model is working under. We want to understand the constraints/assumptions if we want to make better models. Large models have advantages because they can have larger latent spaces and that gives them a lot of freedom to unknot data and move them around as they please. But that doesn't mean the methods are efficient.
They are related, and we frequently assume they are close enough that it doesn’t matter, but they are different.
If I'm using vectors for question/answer, then:
"What is a cat"
and
"What is a dog"
Should be more dissimilar than the documents answering either.
If I'm using it for FAQ filtering then they should be more similar.
hence heuristic.
code: https://github.com/jimmc414/document_intelligence/blob/main/... https://github.com/jimmc414/document_intelligence
I've interpreted transformer vector similarity as 'likelihood to be followed by the same thing' which is close to word2vec's 'sum of likelihoods of all words to be replaced by the other set' (kinda), but also very different in some contexts.
Cosine similar is a useful compromise and yes a lot of authors take this for granted. At the end of the day, an LLM product probably won't be evaluated on accuracy but rather "lift" over an alternative. And the evaluation will be in units of user happiness.
> All of this means you'll need to retrieve a multiple of K vectors, figure out some way to re-rank them to exclude the irrelevant ones, and have your own ground truth to measure the index's precision and recall.
This is usually a Series E problem, not a Series A problem.
- Full SQL support
- Has good tooling around migrations (i.e. dbmate)
- Good support for running in Kubernetes or in the cloud
- Well understood by operations i.e. backups and scaling
- Supports vectors and similarity search.
- Well supported client libraries
So basically Postgres and PgVector.
As a thought-experiment for people who don't understand why you need (for example) regular relational columns alongside vector storage, consider how you would implement RAG for a set of documents where not everyone has permission to view every document. In the pgvector case it's easy - I can add one or more label columns and then when I do my search query filter to only include labels that user has permission to view. Then my vector similarity results will definitely not include anything that violates my access control. Trivial with something like pgvector - basically impossible (afaics) with special-purpose vector stores.
Or think about ranking. Say you want to do RAG over a space where you want to prioritise the most recent results, not just pure similarity. Or prioritise on a set of other features somehow (source credibility whatever). Easy to do if you have relational columns, no bueno if you just have a vector store.
And that's not to mention the obvious things around ACID, availability, recovery, replication, etc.
Maybe someone could pitch in. Is knowledge really a graph (for your problem domain), or is that just some bullshit people made up when they still thought AI could be captured mathematically? It feels to me now knowledge is much more like the way vector embeddings work, it's in a cloud where things are related to each other in an analog or statistical way, not a discrete way.
But, perhaps for similar reasons, vector embeddings haven't been super useful to me in building RAG agents yet. Knowledge is either relevant or it's not, and at least for me if it's relevant it has the keywords or tags I need, and just a straight up SQL query brings it in.
The Python and TS SDKs are designed to support drop-in replacements for the bits of LangChain that don’t scale, but nothing stops you accessing Postgres directly.
Disclosure: I’m the primary author.
In conversational AI, providing search results appended to a long-memory context produces "human-like" results.
Less so IMO when I’m on my phone or in front of the computer.
For customer chatbots, it seems that structured data - from an operational database or a feature store adds more value. If the user asks about an order they made or a product they have a question about, you use the user-id (when logged in) to retrieve all info about what the user bought recently - the LLM will figure out what the prompt is referring to.
Reference:
1. Will that query look like this:
SELECT LLM("{user_question}", order_info)
FROM postgres_data.order_table
WHERE user_id = “101”;
2. How will a feature store, like Hopsworks, help in this app?Shameless self-plug: We are building EvaDB [1], a query engine for shipping fast AI-powered apps with SQL. Would love to exchange notes on such apps if you're up for it!
You can train a small llm on your private data to map the user question to tables in your db.
Then Just select with a limit ( or time bounded). The feature store is just another operational store that could have relevant data for the query.
I would assume the embedding model isn't trained on code and specific words that are industry/company specific.
It's not explained how vector DB is going to help while incumbents like chatgpt4 can already call functions and do API calls.
It doesn't make AI less black box, it's irrelevant and not explained..
There's already existing ways to fine tune models without expensive hardwares such as using LoRA to inject small layers with customized training data, which trains in fractions of the time and resource needed to retrain the model
Everything else may be missing, but not the storage layer.
Surely if you’re posting an article promoting miraculous AI tech you should human edit the article summary so that it’s not really obviously drafted by AI.
Or just use the prompt “tone your writing down and please remember that you’re not writing for a high school student who is impressed by nonsensical hyperbole”. I’ve started using this prompt and it works astonishingly well in the fast evolving landscape of directionless content creation.
I've seen the diagrams in DL papers etc. but I guess everyone invents their own conventions, and the diagrams often don't convey the complete flow of information.
Visualizations are highly context and usage dependent anyway. Generally, there's is no value in showing fully connected or feed forward layers in detail outside of teaching materials.
Well, in electrical circuit diagrams it is customary to draw e.g. a signal bus as a single connection, with the number of wires in the bus written next to it (with a little strike-through line). I'm guessing something similar can be done for DL networks.