I see a lot of these KG tools pop up, but they never solve the first problem I have, which is actually constructing the KG itself.
I have heard good things about Graphrag [1] (but what a stupid name). I did not have the time to try it properly, but it is supposed to build the knowledge graph itself somewhat transparently, using LLMs. This is a big stumbling block. At least vector stores are easy to understand and trivial to build.
It looks like KAG can do this from the summary on GitHub, but I could not really find how to do it in the documentation.
This is a common issue I've seen from LLM projects that only kind-of understand what is going on here and try and separate their vector database w/ semantic edge information into something that has a formal name.
NLP is fast but requires a model that is trained on an ontology that works with your data. Once you do, it’s a matter of simply feeling the model your bazillion CSVs and PDFs.
LLMs are slow but way easier to start as ontologies can be generated on the fly. This is a double edged sword however as LLMs have a tendency to lose fidelity and consistency on edge naming.
I work in NLP, which is the most used in practice as it’s far more consistent and explainable in very large corpora. But the difficulty in starting a fresh ontology dead ends many projects.
https://github.com/getzep/graphiti
I’m one of the authors. Happy to answer any questions.
Don't have time to scan the source code myself, but are you using the OpenAI python library, so the server URL can easily be changed? Didn't see it exposed by your library, so hoping it can at least be overridden with a env var, so we could use local LLMs instead.
This becomes a cyclical hallucination problem. The LLM hallucinates and create incorrect graph which in turn creates even more incorrect knowledge.
We are working on this issue of reducing hallucination in knowledge graphs and using LLM is not at all the right way.
So yes, there's a huge pile of tools and software for working with knowledge graphs, but to date populating the graph is still the realm of human experts.
Perhaps one needs to manually create a starting point then ask the LLM to propse links to various documents or follow an existing one.
Sufficiently loopable transversal should create a KG
I’ve noticed this too and the ironic thing is that building the KG is the most critical part of making everything work.
https://neuml.hashnode.dev/advanced-rag-with-graph-path-trav...
I’ve heard of a few very large companies using glean (https://www.glean.com/)
This is the route I’d take if I wanted to make a business around rag.
I’ve had good success with CIM for Utilities to build a network graph for modelling the distribution and transmission networks adding sensor and event data for monitoring and analysis about 15 years ago.
Anywhere there is a technology focussed consortium of vendors and users building standards you will likely find a prebuilt graph. When RDF was “hot” many of the these groups spun out some attempt to model their domain.
In summary, if you need one look for one. Maybe there’s one waiting for you and you get to do less convincing and more doing.
https://github.com/OpenSPG/KAG/blob/master/kag/builder/promp...
All you’re doing here is “front loading” AI: Imstead of running slow and expensive LLMs at query time, you run them at index time.
It’s a method for data augmentation or, in database lingo, index building. You use LLMs to add context to chunks that doesn’t exist on either the word level (searchable by BM25) or the semantic level (searchable by embeddings).
A simple version of this would be to ask an LLM:
“List all questions this chunk is answering.” [0]
But you can do the same thing for time frames, objects, styles, emotions — whatever you need a “handle” for to later retrieve via BM25 or semantic similarity.
I dreamed of doing that back in 2020, but it would’ve been prohibitively expensive. Because it requires passing your whole corpus through an LLM, possibly multiple times, once for each “angle”.
That being said, I recommend running any “Graph RAG” system you see here on HN over some 1% or so of your data. And then look inside the database. Look at all text chunks, original and synthetic, that are now in your index.
I’ve done this for a consulting client who absolutely wanted “Graph RAG”. I found the result to be an absolute mess. That is because these systems are built to cover a broad range of applications and are not adapted at all to your problem domain.
So I prefer working backwards:
What kinds of queries do I need to handle? What does the prompt to my query time LLM need to look like? What context will the LLM need? How can I have this context for each of my chunks, and be able to search by match air similarity? And now how can I make an LLM return exactly that kind of context, with as few hallucinations and as little filler as possible, for each of my chunks?
This gives you a very lean, very efficient index that can do everything you want.
[0] For a prompt, you’d add context and give the model “space to think”, especially when using a smaller model. Also, you’d instruct it to use a particular format, so you can parse out the part that you need. This “unfancy” approach lets you switch out models easily and compare them against each other without having to care about different APIs for “structured output”.
https://github.com/OpenSPG/KAG/blob/master/kag/builder/promp...
GraphRAG and a lot of the semantic indexes are simply vector database with pre-computed similarity edges which does not allow you to perform any reasoning over (the definition and intention of a knowledge graph).
This is probably worth looking at, its the first opensource project I've seen that is actually using LLMs to generate knowledge graphs. This does look pretty primitive for that task but it might be a useful reference for others going down this road.
Same findings here, re: legal text. Basic hybrid search performs better. In this use case the user knows what to look for, so the queries are specific. The advantage of graph RAG is when you need to integrate disparate sources for a holistic overview.
Just finished a call a few mins, and we came to conclusion we do natural query language, BM25 scoring with Tantivy based code first
https://github.com/quickwit-oss/tantivy
In meanwhile we collect all questions to ask LLM so we can be more consious at Hybrid Search implementation phase
If you want a transformational shift in terms of accuracy and reasoning, the answer is different. Many a times RAG accuracy suffers because the text is out of distribution, and ICL does not work well. You get away with it if all your data is in public domain in some form (ergo, llm was trained on it), else you keep seeing the gaps with no way to bridge them. I published a paper around it and how to effciently solve it, if interested. Here is a simplified blog post on the same: https://medium.com/@ankit_94177/expanding-knowledge-in-large...
Edit: Please reach out here or on email if you would like further details. I might have skipped too many things in the above comment.
But we need a theory on the differences too. Now it is kind of random how we differentiate the tools. We need ergonomics for llms.
This is realistic but hence going to be unpopular unfortunately, because people expect magic / want zero effort.
When I need to build something for an LLM to use, I ask the LLM to build it. That way, by definition, the LLM has a built in understanding of how the system should work, because the LLM itself invented it.
Similarly, when I was doing some experiments with a GPT-4 powered programmer, in the early days I had to omit most of the context (just have method stubs). During that time I noticed that most of the code written by GPT-4 was consistently the same. So I could omit its context because the LLM would already "know" (based on its mental model) what the code should be.
Thats not how an LLM works. It doesn't understand your question, nor the answer. It can only give you a statistically significant sequence of words that should follow what you gave it.
Really? I’m not sure that the word “understanding” means the same thing to you as it does to me.
That's not even correct, starring isn't going to do that. You'd need to smash that subscribe button and not forget the bell icon (metaphorically), not ~like~ star it.
At this point, the onus is on the developer to prove it's value through AB comparisons versus traditional RAG. No person/team has the bandwidth to try out this (n + 1) solution.
In fact, im wondering if thats what happened in the early noughts and we had the misfortune of Java, and still have the misfortune of Javascript.
This is actually attempting fact extraction into an ontology so you can reason over this instead of reasoning in the LLM.
> The white paper is only available for professional developers from different industries. We need to collect your name, contact information, email address, company name, industry type, position and your download purpose to verify your identity...
That's new.
,after_submitting: 'https://spg.openkg.cn/en-US/download?token=0a735e9a-72ea-11ee-b962-0242ac120002'
https://mdn.alipayobjects.com/huamei_xgb3qj/afts/file/A*6gpq...what exactly is being tokenized? RDS, OWL, Neo4j, ...?
how is the knowledge graph serialized?
We used neo4j as the graph database and used the LLM to generate parts of the spark queries.
2.2.
"The engine includes three types of operators: planning, reasoning, and retrieval, which transform natural language problems into problem solving processes that combine language and notation.
In this process, each step can use different operators, such as exact match retrieval, text retrieval, numerical calculation or semantic reasoning, so as to realize the integration of four different problem solving processes: Retrieval, Knowledge Graph reasoning, language reasoning and numerical calculation."
Retrieving one with low latency is another.