When I have two paragraphs from one document, say p1 and p2, should I analyze them individually or first determine if they share some common information? If they do share information, should I then evaluate the common information (p1|2), p1 excluding the common information (p1-p1|2), and p2 excluding the common information (p2-p1|2)? This approach aims to reduce redundancy, but does it make sense?
Additionally, I would assign a 'label' to each paragraph. For example, p1 could have the label 'llm', p2 could have the label 'RAG', and p3 could have the label 'llm'. Using these labels as primary parameters might increase the system's speed. The label could also be an array of relevant words, representing the essence of the paragraph. By finding related paragraphs through labels, I would know that p1 and p3 are somehow related. Again, does this approach make sense?
Furthermore, regarding re-ranking or re-chunking, should the database, whether it's a vector DB, a knowledge graph, or a hybrid, be highly dynamic or rather static?
Another question: When comparing two paragraphs, p1 and p2, should the comparison be at the paragraph level, or should it also be word by word? For example, consider the sentences "Dog sits here", "Dog sits there", and "Dog, sit!". Without using NLP and LLM, simply comparing the words shows that s1 and s2 are closer to each other than to s3. Would an additional layer of comparison be helpful? How should punctuation be handled, for example?