calculito on Hacker News

Ask HN: RAG and unstructured data from several docs

I don't have a deep understanding of RAG and related concepts. My approach is always to start by assuming there is no solution and then figure out how to solve it. When it comes to RAG, I ask myself the following questions:

When I have two paragraphs from one document, say p1 and p2, should I analyze them individually or first determine if they share some common information? If they do share information, should I then evaluate the common information (p1|2), p1 excluding the common information (p1-p1|2), and p2 excluding the common information (p2-p1|2)? This approach aims to reduce redundancy, but does it make sense?

Additionally, I would assign a 'label' to each paragraph. For example, p1 could have the label 'llm', p2 could have the label 'RAG', and p3 could have the label 'llm'. Using these labels as primary parameters might increase the system's speed. The label could also be an array of relevant words, representing the essence of the paragraph. By finding related paragraphs through labels, I would know that p1 and p3 are somehow related. Again, does this approach make sense?

Furthermore, regarding re-ranking or re-chunking, should the database, whether it's a vector DB, a knowledge graph, or a hybrid, be highly dynamic or rather static?

Another question: When comparing two paragraphs, p1 and p2, should the comparison be at the paragraph level, or should it also be word by word? For example, consider the sentences "Dog sits here", "Dog sits there", and "Dog, sit!". Without using NLP and LLM, simply comparing the words shows that s1 and s2 are closer to each other than to s3. Would an additional layer of comparison be helpful? How should punctuation be handled, for example?

6calculito1y ago3

calculito

Recent submissions

Pharma Giants, Crypto Profits: How Ozempic Could Redefine Corporate Investment (opens in new tab)

Ask HN: RAG and unstructured data from several docs

Recent submissions

Pharma Giants, Crypto Profits: How Ozempic Could Redefine Corporate Investment (opens in new tab)

Ask HN: RAG and unstructured data from several docs