RAG at scale: Synchronizing and ingesting billions of text embeddings (opens in new tab)

(medium.com)

160 pointspicohen2y ago55 comments

55 comments

43 comments · 11 top-level

juxtaposicion2y ago· 7 in thread

We’re also building billion-scale pipeline for indexing embeddings. Like the author, most of our pain has been scaling. If you only had to do millions, this whole pipeline would be a 100 LoC. but billions? Our system is at 20k LoC and growing.

The biggest surprise to me here is using Weavite at the scale of billions — my understanding was that this would require tremendous memory requirements (of order a TB in RAM) which are prohibitively expensive (10-50k/m for that much memory).

Instead, we’ve been using Lance, which stores its vector index on disk instead of in memory.

ddematheu2y ago

Co-author of article here.

Yeah a ton of the time and effort has gone into building robustness and observability into the process. When dealing with millions of files, a failure half way through it is imperative to be able to recover.

RE: Weaviate: Yeah, we needed to use large amounts of memory with Weaviate which has been a drawback from a cost perspective, but that from a performance perspective delivers on the requirements of our customers. (on Weaviate we explored using product quantization. )

What type of performance have you gotten with Lance both on ingestion and retieval? Is disk retrieval fast enough?

juxtaposicion2y ago

Disk retrieval is definitely slower. In-memory retrieval typically can be ~1ms or less, whereas disk retrieval on a fast network drive is 50-100ms. But frankly, for any use case I can think of 50ms of latency is good enough. The best part is that the cost is driven by disk not ram, which means instead of $50k/month for ~TB of RAM you're talking about $1k/mo for a fast NVMe on a fast link. That's 50x cheaper, because disks are 50x cheaper. $50k/mo for an extra 50ms latency is a pretty clear easy tradeoff.

bryan02y ago

we've been using pgvector at the 100M scale without any major problems so far, but I guess it depends on your specific use case. we've also been using elastic search dense vector fields which also seems to scale well, but of course its pricey but we already have it in our infra so works well.

ddematheu2y ago

What type of latency requirements are you dealing with? (i.e. look up time, ingestion time)

Were you using postgres already or migrated data into it?

2 more replies

omneity2y ago

What size are your embeddings?

1 more reply

esafak2y ago

What kind of retrieval performance are you observing with Lance?

juxtaposicion2y ago

For a "small" dataset of 50M and 0.5TB in size with 20 results get around 50-100ms.

dluc2y ago· 6 in thread

We are also developing an open-source solution for those who would like to test it out and/or contribute, it can be consumed as a web service, or embedded into .NET apps. The project is codenamed "Semantic Memory" (available in GitHub) and offers customizable external dependencies, such as using Azure Queues, RabbitMQ, or other alternatives, and options for Azure Cognitive Search, Qdrant (with plans to include Weaviate and more). The architecture is similar, with queues and pipelines.

We believe that enabling custom dependencies and logic, as well as the ability to add/remove pipeline steps, is crucial. As of now, there is no definitive answer to the best chunk size or embedding model, so our project aims to provide the flexibility to inject and replace components and pipeline behavior.

Regarding Scalability, LLM text generators and GPUs remain a limiting factor also in this area, LLMs hold great potential for analyzing input data, and I believe the focus should be less on the speed of queues and storage and more on finding the optimal way to integrate LLMs into these pipelines.

ddematheu2y ago

The queues and storage are the foundation on which some of these other integrations can be built on top. Agree fully on the need for LLMs within the pipelines to help with data analysis.

Our current perspective has been on leveraging LLMs as part of async processes to help analyze data. This only really works when your data follows a template where I might be able to apply the analysis to a vast number of documents. Alternatively it becomes too expensive to do at a per document basis.

What types of analysis are you doing with LLMs? Have you started to integrate some of these into your existing solution?

dluc2y ago

Currently we use LLMs to generate a summary, used as an additional chunk. As you might guess, this can take time, so we postpone the summarization at the end (the current default pipeline is: extract, partition, gen embedding, save embeddings, summarize, gen embeddings (of the summary), save emb)

Initial tests though are showing that summaries are affecting the quality of answers, so we'll probably remove it from the default flow and use it only for specific data types (e.g. chat logs).

There's a bunch of synthetic data scenarios we want to leverage LLMs for. Without going too much into details, sometimes "reading between the lines", and for some memory consolidation patterns (e.g. a "dream phase"), etc.

1 more reply

bradneuberg2y ago

Really interesting library.

Is anyone aware of something similar but hooked into Google Cloud infra instead of Azure?

dluc2y ago

we could easily add that if there's interest, e.g. using Pub/Sub and Cloud Storage. If there are .NET libraries, should be straightforward implementing some interfaces. Similar considerations for the inference part, embedding and text generation.

1 more reply

CharlieDigital2y ago

Why .NET apps specifically?

dluc2y ago

Multiple reasons, some are subjective as usual in these choices. Customers, performance, existing SK community, experience, etc.

However, the recommended use is running it as a web service, so from a consumer perspective the language doesn't really matter.

typest2y ago· 6 in thread

It seems to me that RAG is really search, and search is generally a hard problem without an easy one size fits all solution. E.g., as people push retrieval further and further in the context of LLM generation, they're going to go further down the rabbit hole of how to build a good search system.

Is everyone currently reinventing search from first principles?

zby2y ago

I am convinced that we should teach the LLMs to use search as a tool instead of creating special search that is useful for LLMs. We now have a lot of search systems and LLMs can in theory use all kind of text interface, the only problem is with the limited context that LLMs can consume. But is is quite orthogonal to what kind of index we use for the search. In fact for humans it is also be useful that search returns limited chunks - we already have that with the 'snippets' that for example Google shows - we just need it to tweak a bit for them to be maybe two kind of snippets - shorter as they are now and longer.

You can use LLMs to do semantic search using a keyword search - by telling the LLM to come up with a good search term that would include all the synonymes. But if vector search in embeddings really gives better results than keyword search - then we should start using it in all the other search tools used by humans.

LLMs are the more general tool - so adjusting them to the more restricted search technology should be easier and quicker to do instead of doing it the other way around.

By the way - this prompted me to create my Opinionated RAG wiki: https://github.com/zby/answerbot/wiki

isaacfung2y ago

Depends on what you mean by search. Do you consider all Question Answering as search?

Some questions require multi-hop reasoning or have to be decomposed into simpler subproblems. When you google a question, often the answer is not trivially included in the retrieved text and you have to process(filter irrelevant information, resolve conflicting information, extrapolate to cases not covered, align the same entities referred to with two different names, etc), forumate an answer for the original question and maybe even predict your intent based on your history to personalize the result or customize the result in the format you like(markdown, json, csv, etc).

Researchers have developed many different techniques to solve the related problems. But as LLMs are getting hyped, many people try to tell you LLM+vector store is all you need.

fkyoureadthedoc2y ago

We're using a product from our existing enterprise search vendor, which they pitch an NLP search. Not convinced it's better than the one we already had consider we have to use an intermediate step of having the LLM turn the user's junk input into a keyword search query, but it's definitely more expensive...

mrfox3212y ago

Your intuition on search being implemented is correct.

It's still TBD on whether these new generations of language models will democratize search on bespoke corpuses.

There's going to be a lot of arbitrary alchemy and tribal knowledge...

ddematheu2y ago

To some degree. The amount of data that will be brought into search solutions will be enormous, seems like a good time to try to reimagine what that process might look like

antupis2y ago

Also this is search for LLM not for humans so optimal solution will be different. Or even with models it is not that hard to imagine that Mistral-8b will need different results than GPT4 which has 1.76 trillion parameters.

1 more reply

wanderingmind2y ago· 5 in thread

Are there any good implementations of using RAG within postgresql ecosystem? I have seen blogposts from supabase[0] and timescale db[1] but not a full fledged project. The full text search is very good within postgres at the moment and having semantic search within the same ecosystem is quiet helpful atleast for simple usecases.

[0] https://supabase.com/docs/guides/database/extensions/pgvecto...

[1] https://www.timescale.com/blog/postgresql-as-a-vector-databa...

losteric2y ago

Isn't RAG "just" dynamically injecting relevant text in a prompt? What more would one implement to achieve RAG, beyond using Postgres' built in full text or knn search?

wanderingmind2y ago

what i'm looking for is a neat python library (or equivalent) that integrates end to end say with postgres/pgvector using sqlalchemy, enables parallel processing of large number of documents, create interfaces for embeddings using openai/ollama etc. It looks like FastRAG [0] from intel looks close to what i'm envisioning but it doesnt appear to have integration to postgres ecosystem yet i guess.

[0] https://github.com/IntelLabs/fastRAG

1 more reply

avthar2y ago

Timescale recently released Timescale Vector [0] a scalable search index (DiskANN) and efficient time-based vector search, in addition to all capabilities of pgvector and vanilla PostgreSQL. We plan to add the document processing and embedding creation capabilities you discuss into our Python client library [1] next, but Timescale Vector integrates with LangChain and LlamaIndex today [2], which both have document chunking and embedding creation capabilities. (I work on Timescale Vector)

[0]: https://www.timescale.com/blog/how-we-made-postgresql-the-be... [1]: https://github.com/timescale/python-vector [2]: https://www.timescale.com/ai/#resources

antupis2y ago

Or generally what are good vector dbs have tried LlaMaindex, pinecone and milvus but all kinda sucked different way.

ddematheu2y ago

What about then sucked?

joewferrara2y ago· 4 in thread

This is a great article about the technical difficulties of building a RAG system at scale from an engineering perspective. Performance is about speed and compute. A topic that is not addressed is how to evaluate a RAG system where performance is about whether the RAG system is retrieving the correct context and answering questions accurately. A RAG system should be built so that the different parts (retriever, embedder, etc) can easily be taken out and modified to improve the performance of the RAG system at answering questions accurately. Whether a RAG system is answering questions accurately should be assessed during development and then continuously monitored.

ddematheu2y ago

Co-author of the article here.

You are right. Retrieval accuracy is important as well. From an accuracy perspective, any tools you have found useful in helping validate retrieval accuracy?

In our current architecture, all the different pieces within the RAG ingestion pipeline are modifiable to be able to improve loading, chunking and embedding.

As part of our development process, we have started to enable other tools that we don't talk as much in the article about including a pre processing and embeddings playground (https://www.neum.ai/post/pre-processing-playground) to be able to test different combinations of modules against a piece of text. The idea being that you can establish you ideal pipeline / transformations that can then be scaled.

visarga2y ago

Did you consider pre-processing each chunk separately to generate useful information - summary, title, topics - that would enrich embeddings and aid retrieval? Embeddings only capture surface form. "Third letter of second word" won't match embedding for letter "t". Info has surface and depth. We get depth through chain-of-thought, but that requires first digesting raw text with an LLM.

Even LLMs are dumb during training but smart during inference. So to make more useful training examples, we need to first "study" them with a model, making the implicit explicit, before training. This allows training to benefit from inference-stage smarts.

Hopefully we avoid cases where "A is B" fails to recall "B is A" (the reversal curse). The reversal should be predicted during "study" and get added to the training set, reducing fragmentation. Fragmented data in the dataset remains fragmented in the trained model. I believe many of the problems of RAG are related to data fragmentation and superficial presentation.

A RAG system should have an ingestion LLM step for retrieval augmentation and probably hierarchical summarisation up to a decent level. It will be adding insight into the system by processing the raw documents into a more useful form.

2 more replies

janalsncm2y ago

> From an accuracy perspective, any tools you have found useful in helping validate retrieval accuracy?

You’ll probably want to start with the standard rank-based metrics like MRR, nDCG, and precision/recall@K.

Plus if you’re going to spend $$$ embedding tons of docs you’ll want to compare to a “dumb” baseline like bm25.

ac2u2y ago

Yeah, especially if you're experimenting with training and applying a matrix to the embeddings generated by an off the shelf model to help it surface subtleties unique to your domain.

vimota2y ago· 2 in thread

Thanks for writing this up! I'm working on a very similar service (https://embeddingsync.com/) and I implemented almost the same as you've described here, but using a poll-based stateful workflow model instead of queueing.

The biggest challenge - which I haven't solved as seamlessly as I'd like - is supporting updates / deletes in the source. You don't seem to discuss it in this post, does Neum handle that?

ddematheu2y ago

Co-author of the article here.

We do support updates for some sources. Deletes not yet. For some sources we do polling which is then dumped on the queues. For other we have listeners that subscribe to changes.

What are the challenges you are facing in supporting this?

vimota2y ago

Similar to you, for polling you only see new data not the deletion events so I can't delete embeddings unless I keep track of state and do a diff. To properly support that you/I would need effectively CDC, which gets more complex for arbitrary / self-serve databases.

bluelightning2k2y ago· 2 in thread

Good article BUT I can't fathom that people would use a managed service to generate and store embeddings.

The openAI or replicate embeddings APIs are already a managed service... You would still need to self managing it all just into a different API.

And dealing with embeddings is the kind of fun work every engineer wants to do anyway.

Still a good article but very perplexing how the company can exist

raverbashing2y ago

Sounds like the same people who use langchain's "Prompt replacement" methods instead of, you know, just use string formatting

https://python.langchain.com/docs/modules/model_io/prompts/p...

ddematheu2y ago

Some engineers find it fun, other might not. Same as everything.

IMO the fun parts are actually prototyping and figuring out the right pattern I want to use for my solution. Once you have done that, scaling and dealing with robustness tends to be a bit less fun.

dkatz232382y ago

What statistics/metrics are used to evaluate RAG systems? Is there any paper that systematically compares different RAG methods (chunkings, models, ect)? I would assume that such metric would be similar to something used for evaluating summarization or question and answering but I am curious to know if there are specific methods/metrics used to evaluate RAG systems.

joelthelion2y ago

Can anyone who has used such systems for some time comment on their usefulness? Is it something you can't live with, a nice to have, or something you tend to forget is available after a while?

vtuulos2y ago

here's how we solved engineering challenges related to RAG using open-source Metaflow: https://outerbounds.com/blog/retrieval-augmented-generation/

arzelaascoli2y ago

We also shared an article about how we run these indexing jobs at scale at deepset with kubernetes, SQS, s3 and KEDA.

TL;DR: Queue upload events via SQS, upload files to s3, scale consumers based on queue length with keda and use haystack to turn files into embeddings.

This also works for arbitrary pipelines with your models, custom nodes (python code snippeds) and is pretty efficient.

Part1 (application&architecture): https://medium.com/@ArzelaAscoli/scaling-nlp-indexing-pipeli... Part2 (scaling): https://medium.com/@ArzelaAscoli/scaling-nlp-indexing-pipeli... Example code: https://github.com/ArzelaAscoIi/haystack-keda-indexing

We actually also stared with celery, but moved to SQS to improve the stability.

j / k navigate · click thread line to collapse

55 comments

43 comments · 11 top-level

juxtaposicion2y ago· 7 in thread

Instead, we’ve been using Lance, which stores its vector index on disk instead of in memory.

ddematheu2y ago

Co-author of article here.

What type of performance have you gotten with Lance both on ingestion and retieval? Is disk retrieval fast enough?

juxtaposicion2y ago

bryan02y ago

ddematheu2y ago

What type of latency requirements are you dealing with? (i.e. look up time, ingestion time)

Were you using postgres already or migrated data into it?

2 more replies

omneity2y ago

What size are your embeddings?

1 more reply

esafak2y ago

What kind of retrieval performance are you observing with Lance?

juxtaposicion2y ago

For a "small" dataset of 50M and 0.5TB in size with 20 results get around 50-100ms.

dluc2y ago· 6 in thread

ddematheu2y ago

The queues and storage are the foundation on which some of these other integrations can be built on top. Agree fully on the need for LLMs within the pipelines to help with data analysis.

What types of analysis are you doing with LLMs? Have you started to integrate some of these into your existing solution?

dluc2y ago

Initial tests though are showing that summaries are affecting the quality of answers, so we'll probably remove it from the default flow and use it only for specific data types (e.g. chat logs).

1 more reply

bradneuberg2y ago

Really interesting library.

Is anyone aware of something similar but hooked into Google Cloud infra instead of Azure?

dluc2y ago

1 more reply

CharlieDigital2y ago

Why .NET apps specifically?

dluc2y ago

Multiple reasons, some are subjective as usual in these choices. Customers, performance, existing SK community, experience, etc.

However, the recommended use is running it as a web service, so from a consumer perspective the language doesn't really matter.

typest2y ago· 6 in thread

Is everyone currently reinventing search from first principles?

zby2y ago

LLMs are the more general tool - so adjusting them to the more restricted search technology should be easier and quicker to do instead of doing it the other way around.

By the way - this prompted me to create my Opinionated RAG wiki: https://github.com/zby/answerbot/wiki

isaacfung2y ago

Depends on what you mean by search. Do you consider all Question Answering as search?

Researchers have developed many different techniques to solve the related problems. But as LLMs are getting hyped, many people try to tell you LLM+vector store is all you need.

fkyoureadthedoc2y ago

mrfox3212y ago

Your intuition on search being implemented is correct.

It's still TBD on whether these new generations of language models will democratize search on bespoke corpuses.

There's going to be a lot of arbitrary alchemy and tribal knowledge...

ddematheu2y ago

To some degree. The amount of data that will be brought into search solutions will be enormous, seems like a good time to try to reimagine what that process might look like

antupis2y ago

1 more reply

wanderingmind2y ago· 5 in thread

[0] https://supabase.com/docs/guides/database/extensions/pgvecto...

[1] https://www.timescale.com/blog/postgresql-as-a-vector-databa...

losteric2y ago

Isn't RAG "just" dynamically injecting relevant text in a prompt? What more would one implement to achieve RAG, beyond using Postgres' built in full text or knn search?

wanderingmind2y ago

[0] https://github.com/IntelLabs/fastRAG

1 more reply

avthar2y ago

[0]: https://www.timescale.com/blog/how-we-made-postgresql-the-be... [1]: https://github.com/timescale/python-vector [2]: https://www.timescale.com/ai/#resources

antupis2y ago

Or generally what are good vector dbs have tried LlaMaindex, pinecone and milvus but all kinda sucked different way.

ddematheu2y ago

What about then sucked?

joewferrara2y ago· 4 in thread

ddematheu2y ago

Co-author of the article here.

You are right. Retrieval accuracy is important as well. From an accuracy perspective, any tools you have found useful in helping validate retrieval accuracy?

In our current architecture, all the different pieces within the RAG ingestion pipeline are modifiable to be able to improve loading, chunking and embedding.

visarga2y ago

2 more replies

janalsncm2y ago

> From an accuracy perspective, any tools you have found useful in helping validate retrieval accuracy?

You’ll probably want to start with the standard rank-based metrics like MRR, nDCG, and precision/recall@K.

Plus if you’re going to spend $$$ embedding tons of docs you’ll want to compare to a “dumb” baseline like bm25.

ac2u2y ago

Yeah, especially if you're experimenting with training and applying a matrix to the embeddings generated by an off the shelf model to help it surface subtleties unique to your domain.

vimota2y ago· 2 in thread

The biggest challenge - which I haven't solved as seamlessly as I'd like - is supporting updates / deletes in the source. You don't seem to discuss it in this post, does Neum handle that?

ddematheu2y ago

Co-author of the article here.

We do support updates for some sources. Deletes not yet. For some sources we do polling which is then dumped on the queues. For other we have listeners that subscribe to changes.

What are the challenges you are facing in supporting this?

vimota2y ago

bluelightning2k2y ago· 2 in thread

Good article BUT I can't fathom that people would use a managed service to generate and store embeddings.

The openAI or replicate embeddings APIs are already a managed service... You would still need to self managing it all just into a different API.

And dealing with embeddings is the kind of fun work every engineer wants to do anyway.

Still a good article but very perplexing how the company can exist

raverbashing2y ago

Sounds like the same people who use langchain's "Prompt replacement" methods instead of, you know, just use string formatting

https://python.langchain.com/docs/modules/model_io/prompts/p...

ddematheu2y ago

Some engineers find it fun, other might not. Same as everything.

IMO the fun parts are actually prototyping and figuring out the right pattern I want to use for my solution. Once you have done that, scaling and dealing with robustness tends to be a bit less fun.

dkatz232382y ago

joelthelion2y ago

Can anyone who has used such systems for some time comment on their usefulness? Is it something you can't live with, a nice to have, or something you tend to forget is available after a while?

vtuulos2y ago

here's how we solved engineering challenges related to RAG using open-source Metaflow: https://outerbounds.com/blog/retrieval-augmented-generation/

arzelaascoli2y ago

We also shared an article about how we run these indexing jobs at scale at deepset with kubernetes, SQS, s3 and KEDA.

TL;DR: Queue upload events via SQS, upload files to s3, scale consumers based on queue length with keda and use haystack to turn files into embeddings.

This also works for arbitrary pipelines with your models, custom nodes (python code snippeds) and is pretty efficient.

We actually also stared with celery, but moved to SQS to improve the stability.

j / k navigate · click thread line to collapse