GRAG in the direction of the MSR paper adds some important areas:
- summary indexes that can be lexical (document hierarchy) or not (topic, patient ID, etc), esp via careful entity extraction & linking
- domain-optimized summarization templates, both automated & manual
- + as mentioned, wider context around these at retrieval
- introducing a more generalized framework for handling different kinds of concept relations, summary indexing, and retrieval around these
Ex: The same patient over time & docz, and seperately, similar kinds of patients across documents
Note that I'm not actually a big fan of how the MSR paper indirects the work through KG extraction, as that exits the semantic domain, and we don't do it that way
Fundamentally, that both moves away from paltry retrieval result sets that are small/gaps/etc, and enables cleaner input to the runtime query
I agree it is a quick win if quality can be low and you have low budget/time. Like combine a few out of the box index types and do rank retrieval. But a lot of the power gets lost. We are working on infra (+ OSSing it) because that is an unfortunate and unnecessary state of affairs. Right now llamaindex/langchain and raw vector DBs feel like adhoc and unprincipled ways to build these pipelines in a software engineering and AI perspective, so from an investment side, moving away from hacks and to more semantic, composable, & scalable pipelines is important IMO.