We believe that enabling custom dependencies and logic, as well as the ability to add/remove pipeline steps, is crucial. As of now, there is no definitive answer to the best chunk size or embedding model, so our project aims to provide the flexibility to inject and replace components and pipeline behavior.
Regarding Scalability, LLM text generators and GPUs remain a limiting factor also in this area, LLMs hold great potential for analyzing input data, and I believe the focus should be less on the speed of queues and storage and more on finding the optimal way to integrate LLMs into these pipelines.
Our current perspective has been on leveraging LLMs as part of async processes to help analyze data. This only really works when your data follows a template where I might be able to apply the analysis to a vast number of documents. Alternatively it becomes too expensive to do at a per document basis.
What types of analysis are you doing with LLMs? Have you started to integrate some of these into your existing solution?
Initial tests though are showing that summaries are affecting the quality of answers, so we'll probably remove it from the default flow and use it only for specific data types (e.g. chat logs).
There's a bunch of synthetic data scenarios we want to leverage LLMs for. Without going too much into details, sometimes "reading between the lines", and for some memory consolidation patterns (e.g. a "dream phase"), etc.
Is anyone aware of something similar but hooked into Google Cloud infra instead of Azure?
However, the recommended use is running it as a web service, so from a consumer perspective the language doesn't really matter.
The biggest surprise to me here is using Weavite at the scale of billions — my understanding was that this would require tremendous memory requirements (of order a TB in RAM) which are prohibitively expensive (10-50k/m for that much memory).
Instead, we’ve been using Lance, which stores its vector index on disk instead of in memory.
Yeah a ton of the time and effort has gone into building robustness and observability into the process. When dealing with millions of files, a failure half way through it is imperative to be able to recover.
RE: Weaviate: Yeah, we needed to use large amounts of memory with Weaviate which has been a drawback from a cost perspective, but that from a performance perspective delivers on the requirements of our customers. (on Weaviate we explored using product quantization. )
What type of performance have you gotten with Lance both on ingestion and retieval? Is disk retrieval fast enough?
Were you using postgres already or migrated data into it?
You are right. Retrieval accuracy is important as well. From an accuracy perspective, any tools you have found useful in helping validate retrieval accuracy?
In our current architecture, all the different pieces within the RAG ingestion pipeline are modifiable to be able to improve loading, chunking and embedding.
As part of our development process, we have started to enable other tools that we don't talk as much in the article about including a pre processing and embeddings playground (https://www.neum.ai/post/pre-processing-playground) to be able to test different combinations of modules against a piece of text. The idea being that you can establish you ideal pipeline / transformations that can then be scaled.
Even LLMs are dumb during training but smart during inference. So to make more useful training examples, we need to first "study" them with a model, making the implicit explicit, before training. This allows training to benefit from inference-stage smarts.
Hopefully we avoid cases where "A is B" fails to recall "B is A" (the reversal curse). The reversal should be predicted during "study" and get added to the training set, reducing fragmentation. Fragmented data in the dataset remains fragmented in the trained model. I believe many of the problems of RAG are related to data fragmentation and superficial presentation.
A RAG system should have an ingestion LLM step for retrieval augmentation and probably hierarchical summarisation up to a decent level. It will be adding insight into the system by processing the raw documents into a more useful form.
You’ll probably want to start with the standard rank-based metrics like MRR, nDCG, and precision/recall@K.
Plus if you’re going to spend $$$ embedding tons of docs you’ll want to compare to a “dumb” baseline like bm25.
Is everyone currently reinventing search from first principles?
You can use LLMs to do semantic search using a keyword search - by telling the LLM to come up with a good search term that would include all the synonymes. But if vector search in embeddings really gives better results than keyword search - then we should start using it in all the other search tools used by humans.
LLMs are the more general tool - so adjusting them to the more restricted search technology should be easier and quicker to do instead of doing it the other way around.
By the way - this prompted me to create my Opinionated RAG wiki: https://github.com/zby/answerbot/wiki
Some questions require multi-hop reasoning or have to be decomposed into simpler subproblems. When you google a question, often the answer is not trivially included in the retrieved text and you have to process(filter irrelevant information, resolve conflicting information, extrapolate to cases not covered, align the same entities referred to with two different names, etc), forumate an answer for the original question and maybe even predict your intent based on your history to personalize the result or customize the result in the format you like(markdown, json, csv, etc).
Researchers have developed many different techniques to solve the related problems. But as LLMs are getting hyped, many people try to tell you LLM+vector store is all you need.
It's still TBD on whether these new generations of language models will democratize search on bespoke corpuses.
There's going to be a lot of arbitrary alchemy and tribal knowledge...
[0] https://supabase.com/docs/guides/database/extensions/pgvecto...
[1] https://www.timescale.com/blog/postgresql-as-a-vector-databa...
[0]: https://www.timescale.com/blog/how-we-made-postgresql-the-be... [1]: https://github.com/timescale/python-vector [2]: https://www.timescale.com/ai/#resources
The biggest challenge - which I haven't solved as seamlessly as I'd like - is supporting updates / deletes in the source. You don't seem to discuss it in this post, does Neum handle that?
We do support updates for some sources. Deletes not yet. For some sources we do polling which is then dumped on the queues. For other we have listeners that subscribe to changes.
What are the challenges you are facing in supporting this?
The openAI or replicate embeddings APIs are already a managed service... You would still need to self managing it all just into a different API.
And dealing with embeddings is the kind of fun work every engineer wants to do anyway.
Still a good article but very perplexing how the company can exist
https://python.langchain.com/docs/modules/model_io/prompts/p...
IMO the fun parts are actually prototyping and figuring out the right pattern I want to use for my solution. Once you have done that, scaling and dealing with robustness tends to be a bit less fun.
TL;DR: Queue upload events via SQS, upload files to s3, scale consumers based on queue length with keda and use haystack to turn files into embeddings.
This also works for arbitrary pipelines with your models, custom nodes (python code snippeds) and is pretty efficient.
Part1 (application&architecture): https://medium.com/@ArzelaAscoli/scaling-nlp-indexing-pipeli... Part2 (scaling): https://medium.com/@ArzelaAscoli/scaling-nlp-indexing-pipeli... Example code: https://github.com/ArzelaAscoIi/haystack-keda-indexing
We actually also stared with celery, but moved to SQS to improve the stability.