LLMs, RAG, and the missing storage layer for AI (opens in new tab)

(blog.lancedb.com)

151 pointsyurisagalov2y ago61 comments

61 comments

40 comments · 11 top-level

panarky2y ago· 12 in thread

The first unstated assumption is that similar vectors are relevant documents, and for many use cases that's just not true. Cosine similarity != relevance. So if your pipeline pulls 2 or 4 or 12 document chunks into the LLM's context, and half or more of them aren't relevant, does this make the LLM's response more or less relevant?

The second unstated assumption is that the vector index can accurately identify the top K vectors by cosine similarity, and that's not true either. If you retrieve the top K vectors according to the vector index (instead of computing all the pairwise similarities in advance), that set of 10 vectors will be missing documents that have a higher cosine similarity than that of the K'th vector retrieved.

All of this means you'll need to retrieve a multiple of K vectors, figure out some way to re-rank them to exclude the irrelevant ones, and have your own ground truth to measure the index's precision and recall.

brigadier1322y ago

The vectors are literally constructed so that cosine similarity is semantic similarity.

> second unstated assumption is that the vector index can accurately identify the top K vectors by cosine similarity, and that's not true either

Its not unstated, its called ANN for a reason

godelski2y ago

> The vectors are literally constructed so that cosine similarity is semantic similarity.

Are they? A learned embedding doesn't guarantee this and a positional embedding certainly doesn't. Our latent embeddings don't either unless you are inferring this through the dot product in the attention mechanism. But that too is learned. There are no guarantees that the similarities that they learn are the same things we consider as similarities. High dimensional space is really weird.

And while we're at it, we should mention that methods like t-SNE and UMAP are clustering algorithms not dimensional reduction. Just because they can find ways to cluster the data in a lower dimensional projection (epic mapping) doesn't mean that they are similar in the higher dimensional space. It all depends on the ability to unknot in the higher dimensional space.

It is extremely important to do what the OP is doing and consider the assumptions of the model, data, and measurements. Good results do not necessarily mean good methods. I like to say that you don't need to know math to make a good model, but you do need to know math to know why your model is wrong. Your comment just comes off as dismissive rather than actually countering the claims. There's plenty more assumptions than OP listed too. But their assumptions don't mean the model won't work, it just means what constraints the model is working under. We want to understand the constraints/assumptions if we want to make better models. Large models have advantages because they can have larger latent spaces and that gives them a lot of freedom to unknot data and move them around as they please. But that doesn't mean the methods are efficient.

2 more replies

spott2y ago

To be fair… semantic similarity isn’t the same as relevance either.

They are related, and we frequently assume they are close enough that it doesn’t matter, but they are different.

2 more replies

BoorishBears2y ago

This is kind of a moot argument, semantic similarity is higher dimensionality than cosine similarity can capture.

If I'm using vectors for question/answer, then:

"What is a cat"

and

"What is a dog"

Should be more dissimilar than the documents answering either.

If I'm using it for FAQ filtering then they should be more similar.

1 more reply

sgt1012y ago

yes - but calculating the consine similarity for all the candidates is prohibitively expensive.

hence heuristic.

Jimmc4142y ago

Switching to Word2Vec embeddings led to a substantial improvement in my cosine similarity evaluations for text similarity, but granted I was looking for actual similarity, not relevance. I tried many different methods and had lots of mediocre results initially.

code: https://github.com/jimmc414/document_intelligence/blob/main/... https://github.com/jimmc414/document_intelligence

Nowado2y ago

Interesting, do you happen to have some quantitative results on this/additional insights/etc?

I've interpreted transformer vector similarity as 'likelihood to be followed by the same thing' which is close to word2vec's 'sum of likelihoods of all words to be replaced by the other set' (kinda), but also very different in some contexts.

1 more reply

sandGorgon2y ago

this is very interesting. you had better results here than the openai ada02 and other embeddings like bge ?

bugglebeetle2y ago

As opposed to sentencebert or what?

1 more reply

NhanH2y ago

Could you please explain a bit on your 2nd paragraph. I couldn’t quite understand either the problem statement nor the reasoning itself.

choppaface2y ago

"Cosine similarity != relevance" In all ML search products, there's a tradeoff between precision and recall, and moreover there's almost never any "gold" data that ensures the "correctness" of surfaced results. I mean, Bing and Google have both invested millions of dollars in labeling web pages and even evaluating search results, but those labels can become useless as your set of documents change.

Cosine similar is a useful compromise and yes a lot of authors take this for granted. At the end of the day, an LLM product probably won't be evaluated on accuracy but rather "lift" over an alternative. And the evaluation will be in units of user happiness.

> All of this means you'll need to retrieve a multiple of K vectors, figure out some way to re-rank them to exclude the irrelevant ones, and have your own ground truth to measure the index's precision and recall.

This is usually a Series E problem, not a Series A problem.

saliagato2y ago

Azure Cognitive Search takes care of all of this combining semantic search with other layers of traditional search methods

freedmand2y ago· 6 in thread

I don’t fully understand the fascination with retrieval augmented generation. The retrieval part is already really good and computationally inexpensive — why not just pass the semantic search results to the user in a pleasant interface and allow them to synthesize their own response? Reading a generated paragraph that obscures the full sourcing seems like a practice that’s been popularized to justify using the shiny new tech, but is the generated part what users actually want? (Not to mention there is no bulletproof way to prevent hallucinations, lies, and prompt injection even with retrieval context.)

sdenton42y ago

On the modeling side, it's compelling to separate the memory from the linguistic skills. Vector search is hella fast and can be very good. So you can off load the memorization part of the problem, and let the language model focus on the language. This should allow better performance with much smaller models.

nottheengineer2y ago

I really like using LLMs to learn stuff because they can explain anything at the exact level I need. Hallucination is a big problem with that and RAG pretty much solves it. If I give chatGPT a good stackoverflow post and tell it to dumb it down for me, it does very well. RAG just automates that process with the added benefit of not letting the LLM decide which information to retrieve, which should greatly reduce the chance of accidentally biasing the model with your prompt.

matchagaucho2y ago

In a strict "one question / one response" search, raw semantic search results are a great solution. And consumes far fewer tokens.

In conversational AI, providing search results appended to a long-memory context produces "human-like" results.

jorgemf2y ago

The main reason is that you might not want the raw information but some reasoning above. LLM is not only the context but all the information it has been trained with. For example a math student is making a question, it doesn't want the raw theorems but some reasoning with them, and currently LLM can do that. It will make mistakes sometimes because of hallucinations, but for not very difficult questions it usually gives you the right answer. And that helps a lot when you are not an expert in the domain. And that is the reason GPT4 is a great tool for students, it helps you to understand the basics as if you have a teacher with you.

zawaideh2y ago

Sometimes what I want is to ask Google/Alexa/Siri a question and get a summary response along with the source. I think that would be a good application of the above.

Less so IMO when I’m on my phone or in front of the computer.

btbuildem2y ago

For me, the #1 advantage is being able to ask follow-up questions

ianpurton2y ago· 5 in thread

As an architect working on LLM applications I have these criteria for a database.

- Full SQL support

- Has good tooling around migrations (i.e. dbmate)

- Good support for running in Kubernetes or in the cloud

- Well understood by operations i.e. backups and scaling

- Supports vectors and similarity search.

- Well supported client libraries

So basically Postgres and PgVector.

seanhunter2y ago

Exactly. The whole point about databases is you don't need "a database for AI" you need a database, ideally with an extension to add additional AI functionality (ie postgres and pgvector). Trying to take a special store you invent for AI and retrofit all the desirable things you need to make it work properly in the context of a real application you're just going to end up with a mess.

As a thought-experiment for people who don't understand why you need (for example) regular relational columns alongside vector storage, consider how you would implement RAG for a set of documents where not everyone has permission to view every document. In the pgvector case it's easy - I can add one or more label columns and then when I do my search query filter to only include labels that user has permission to view. Then my vector similarity results will definitely not include anything that violates my access control. Trivial with something like pgvector - basically impossible (afaics) with special-purpose vector stores.

Or think about ranking. Say you want to do RAG over a space where you want to prioritise the most recent results, not just pure similarity. Or prioritise on a set of other features somehow (source credibility whatever). Easy to do if you have relational columns, no bueno if you just have a vector store.

And that's not to mention the obvious things around ACID, availability, recovery, replication, etc.

tinco2y ago

Can I add one more nice to have? Good support for graph data. I'm not 100% certain on it yet, but there's a lot of ideas surrounding storing knowledge as a graph out there and it makes a lot of intuitive sense. I haven't found a killer use case for it yet as so far I can get by just tagging things and sql querying on the tags is powerful enough.

Maybe someone could pitch in. Is knowledge really a graph (for your problem domain), or is that just some bullshit people made up when they still thought AI could be captured mathematically? It feels to me now knowledge is much more like the way vector embeddings work, it's in a cloud where things are related to each other in an analog or statistical way, not a discrete way.

But, perhaps for similar reasons, vector embeddings haven't been super useful to me in building RAG agents yet. Knowledge is either relevant or it's not, and at least for me if it's relevant it has the keywords or tags I need, and just a straight up SQL query brings it in.

akgfab2y ago

You can think of a vector database with n vectors as a network whose adjacency matrix is nxn and each edge is represented by whatever similarity metric between nodes you choose to use. So you can have strongly connected edges and weakly connected edges.

roseway42y ago

You may want to take a look at Zep, an LLM application platform that wraps Postgres, pgvector, embedding models, and more to offer chat memory persistence and document vector search.

The Python and TS SDKs are designed to support drop-in replacements for the bits of LangChain that don’t scale, but nothing stops you accessing Postgres directly.

https://github.com/getzep/zep

Disclosure: I’m the primary author.

ofermend2y ago

Yes totally agree with that (and other comments below). Moving from a toy example to production deployment requires all the things we are used to having in robust/mature products like postgres.

jamesblonde2y ago· 3 in thread

It's not clear to me that only a vector DB should be used for RAG. Vector DBs give you stochastic responses.

For customer chatbots, it seems that structured data - from an operational database or a feature store adds more value. If the user asks about an order they made or a product they have a question about, you use the user-id (when logged in) to retrieve all info about what the user bought recently - the LLM will figure out what the prompt is referring to.

Reference:

https://www.hopsworks.ai/dictionary/retrieval-augmented-llm

jarulraj2y ago

Thanks for sharing that observation on customer chatbots.

1. Will that query look like this:

  SELECT LLM("{user_question}", order_info)  
  FROM postgres_data.order_table  
  WHERE user_id = “101”;

2. How will a feature store, like Hopsworks, help in this app?

Shameless self-plug: We are building EvaDB [1], a query engine for shipping fast AI-powered apps with SQL. Would love to exchange notes on such apps if you're up for it!

[1] https://github.com/georgia-tech-db/evadb

jamesblonde2y ago

Why would your projection be this - SELECT LLM("{user_question}", ?

You can train a small llm on your private data to map the user question to tables in your db.

Then Just select with a limit ( or time bounded). The feature store is just another operational store that could have relevant data for the query.

2 more replies

J_Shelby_J2y ago

And for technical documentation or code I'm unclear how well semantic search works for CEQ.

I would assume the embedding model isn't trained on code and specific words that are industry/company specific.

amelius2y ago· 2 in thread

Unrelated question: is there a standard way for writing down neural network diagrams? I'm thinking of how it is done in electrical circuit schematics, which capture all relevant information in a single diagram, in a (mostly) standardized way.

I've seen the diagrams in DL papers etc. but I guess everyone invents their own conventions, and the diagrams often don't convey the complete flow of information.

gillesjacobs2y ago

There are conventions and most libraries have libraries to export diagrams to LaTex or image (e.g., TorchViz).

Visualizations are highly context and usage dependent anyway. Generally, there's is no value in showing fully connected or feed forward layers in detail outside of teaching materials.

amelius2y ago

> Generally, there's is no value in showing fully connected or feed forward layers in detail outside of teaching materials.

Well, in electrical circuit diagrams it is customary to draw e.g. a signal bus as a single connection, with the number of wires in the bus written next to it (with a little strike-through line). I'm guessing something similar can be done for DL networks.

Charon772y ago· 1 in thread

A lot of things mentioned are too handwaved and not explained well.

It's not explained how vector DB is going to help while incumbents like chatgpt4 can already call functions and do API calls.

It doesn't make AI less black box, it's irrelevant and not explained..

There's already existing ways to fine tune models without expensive hardwares such as using LoRA to inject small layers with customized training data, which trains in fractions of the time and resource needed to retrain the model

antupis2y ago

There is lots of things like which you don’t want leak eg customer specific data. For those cases vectors are great.

juxtaposicion2y ago

We use Lance extensively at my startup. This blog post (previously on HN) details nicely why: https://thedataquarry.com/posts/vector-db-4/ but essentially it’s because Lance is a “just a file” in the same way SQLite is a “just a file” which makes it embedded and serverless and straightforward to use locally or in a deployment.

zwaps2y ago

I find it quite comical to speak of a "missing storage layer" during your own self-promotion, considering that the market for vector databases is literally overflowing right now.

Everything else may be missing, but not the storage layer.

saaaaaam2y ago

Does ChatGPT always start articles with “in the rapidly evolving landscape of X”?

Surely if you’re posting an article promoting miraculous AI tech you should human edit the article summary so that it’s not really obviously drafted by AI.

Or just use the prompt “tone your writing down and please remember that you’re not writing for a high school student who is impressed by nonsensical hyperbole”. I’ve started using this prompt and it works astonishingly well in the fast evolving landscape of directionless content creation.

eth0pal2y ago

Shameless self promotion

dr_dshiv2y ago

404

j / k navigate · click thread line to collapse

61 comments

40 comments · 11 top-level

panarky2y ago· 12 in thread

brigadier1322y ago

The vectors are literally constructed so that cosine similarity is semantic similarity.

> second unstated assumption is that the vector index can accurately identify the top K vectors by cosine similarity, and that's not true either

Its not unstated, its called ANN for a reason

godelski2y ago

> The vectors are literally constructed so that cosine similarity is semantic similarity.

2 more replies

spott2y ago

To be fair… semantic similarity isn’t the same as relevance either.

They are related, and we frequently assume they are close enough that it doesn’t matter, but they are different.

2 more replies

BoorishBears2y ago

This is kind of a moot argument, semantic similarity is higher dimensionality than cosine similarity can capture.

If I'm using vectors for question/answer, then:

"What is a cat"

and

"What is a dog"

Should be more dissimilar than the documents answering either.

If I'm using it for FAQ filtering then they should be more similar.

1 more reply

sgt1012y ago

yes - but calculating the consine similarity for all the candidates is prohibitively expensive.

hence heuristic.

Jimmc4142y ago

code: https://github.com/jimmc414/document_intelligence/blob/main/... https://github.com/jimmc414/document_intelligence

Nowado2y ago

Interesting, do you happen to have some quantitative results on this/additional insights/etc?

1 more reply

sandGorgon2y ago

this is very interesting. you had better results here than the openai ada02 and other embeddings like bge ?

bugglebeetle2y ago

As opposed to sentencebert or what?

1 more reply

NhanH2y ago

Could you please explain a bit on your 2nd paragraph. I couldn’t quite understand either the problem statement nor the reasoning itself.

choppaface2y ago

This is usually a Series E problem, not a Series A problem.

saliagato2y ago

Azure Cognitive Search takes care of all of this combining semantic search with other layers of traditional search methods

freedmand2y ago· 6 in thread

sdenton42y ago

nottheengineer2y ago

matchagaucho2y ago

In a strict "one question / one response" search, raw semantic search results are a great solution. And consumes far fewer tokens.

In conversational AI, providing search results appended to a long-memory context produces "human-like" results.

jorgemf2y ago

zawaideh2y ago

Sometimes what I want is to ask Google/Alexa/Siri a question and get a summary response along with the source. I think that would be a good application of the above.

Less so IMO when I’m on my phone or in front of the computer.

btbuildem2y ago

For me, the #1 advantage is being able to ask follow-up questions

ianpurton2y ago· 5 in thread

As an architect working on LLM applications I have these criteria for a database.

- Full SQL support

- Has good tooling around migrations (i.e. dbmate)

- Good support for running in Kubernetes or in the cloud

- Well understood by operations i.e. backups and scaling

- Supports vectors and similarity search.

- Well supported client libraries

So basically Postgres and PgVector.

seanhunter2y ago

And that's not to mention the obvious things around ACID, availability, recovery, replication, etc.

tinco2y ago

akgfab2y ago

roseway42y ago

You may want to take a look at Zep, an LLM application platform that wraps Postgres, pgvector, embedding models, and more to offer chat memory persistence and document vector search.

The Python and TS SDKs are designed to support drop-in replacements for the bits of LangChain that don’t scale, but nothing stops you accessing Postgres directly.

https://github.com/getzep/zep

Disclosure: I’m the primary author.

ofermend2y ago

Yes totally agree with that (and other comments below). Moving from a toy example to production deployment requires all the things we are used to having in robust/mature products like postgres.

jamesblonde2y ago· 3 in thread

It's not clear to me that only a vector DB should be used for RAG. Vector DBs give you stochastic responses.

Reference:

https://www.hopsworks.ai/dictionary/retrieval-augmented-llm

jarulraj2y ago

Thanks for sharing that observation on customer chatbots.

1. Will that query look like this:

  SELECT LLM("{user_question}", order_info)  
  FROM postgres_data.order_table  
  WHERE user_id = “101”;

2. How will a feature store, like Hopsworks, help in this app?

Shameless self-plug: We are building EvaDB [1], a query engine for shipping fast AI-powered apps with SQL. Would love to exchange notes on such apps if you're up for it!

[1] https://github.com/georgia-tech-db/evadb

jamesblonde2y ago

Why would your projection be this - SELECT LLM("{user_question}", ?

You can train a small llm on your private data to map the user question to tables in your db.

Then Just select with a limit ( or time bounded). The feature store is just another operational store that could have relevant data for the query.

2 more replies

J_Shelby_J2y ago

And for technical documentation or code I'm unclear how well semantic search works for CEQ.

I would assume the embedding model isn't trained on code and specific words that are industry/company specific.

amelius2y ago· 2 in thread

I've seen the diagrams in DL papers etc. but I guess everyone invents their own conventions, and the diagrams often don't convey the complete flow of information.

gillesjacobs2y ago

There are conventions and most libraries have libraries to export diagrams to LaTex or image (e.g., TorchViz).

Visualizations are highly context and usage dependent anyway. Generally, there's is no value in showing fully connected or feed forward layers in detail outside of teaching materials.

amelius2y ago

> Generally, there's is no value in showing fully connected or feed forward layers in detail outside of teaching materials.

Charon772y ago· 1 in thread

A lot of things mentioned are too handwaved and not explained well.

It's not explained how vector DB is going to help while incumbents like chatgpt4 can already call functions and do API calls.

It doesn't make AI less black box, it's irrelevant and not explained..

antupis2y ago

There is lots of things like which you don’t want leak eg customer specific data. For those cases vectors are great.

juxtaposicion2y ago

zwaps2y ago

I find it quite comical to speak of a "missing storage layer" during your own self-promotion, considering that the market for vector databases is literally overflowing right now.

Everything else may be missing, but not the storage layer.

saaaaaam2y ago

Does ChatGPT always start articles with “in the rapidly evolving landscape of X”?

Surely if you’re posting an article promoting miraculous AI tech you should human edit the article summary so that it’s not really obviously drafted by AI.

eth0pal2y ago

Shameless self promotion

dr_dshiv2y ago

404

j / k navigate · click thread line to collapse