I guess it coincided with the social network phenomenon. Much more recently geometric learning (ML on graphs and other structures) shone, until LLMs stole their thunder. I still think geometric learning has a lot of life left in it, and I would like to see it gain popularity.
Then there are "graph algorithms" such as PageRank, graph centrality, and such. In a lot of those cases there is one edge type or a small number of edge cases.
There are some generic algorithms you can apply to graphs with many typerd edges edges such as the magic SPARQL pattern
?s1 ?p ?o .
?s2 ?p ?o .
which finds ?s1 and ?s2 that share a relationship ?p with some ?o and is the basis for a similarity metric between ?s1 and ?s2. Then there are the cases that you pick out nodes with some specific ?p and apply some graph algorithm to those.The thing about graphs is, in general, they are amorphous and could have any structure (or lack of structure) at all which can be a disaster from a memory latency perspective. Specific graphs usually do have some structure with some locality. There was a time I was using that magic SPARQL pattern and wrote a program that would have taken 100 years to run and then repacked the data structures and discovered an approximation that let me run the calculation in 20 minutes.
Thus practitioners tend to be skeptical about general purpose graph processing libraries as you may very have a problem that I could code up a special-purpose answer to in less time than you'll spend fighting with the build system for that thing that runs 1000x faster.
----
If you really want to be fashionable though, arXiv today is just crammed with papers about "graph neural networks" that never seem to get hyped elsewhere. YOShInOn has made me a long queue of GNN papers to look at but I've only skimmed a few. A lot of articles say they can be applied to the text analysis problems I do but they don’t seem to really perform better than the system YOShInOn and I use so I haven’t been in a hurry to get into them.
A graph of typed pointers. As you likely know, the basic element of RDF is not “foo has a relationship with bar”, but “foo has a relationship with bar of type baz”.
Also, the types themselves can be part of relationships as in “baz has a relationship with quux of type foobar”
> The thing about graphs is, in general, they are amorphous and could have any structure (or lack of structure) at all which can be a disaster from a memory latency perspective
But that’s an implementation detail ;-)
In theory, the engine you use to store the graph could automatically optimize memory layout for both the data and the types of query that are run on it.
Practice is different.
> Thus practitioners tend to be skeptical about general purpose graph processing libraries
I am, too. I think the thing they’re mostly good for is producing PhD’s, both on the theory of querying them, ignoring performance, and on improving performance of implementations.
Now the Lotus notes patents have been long expired so I’d like to see some graph database based products that can do what Notes did 30 years ago but it is lost technology like the pyramids, stonehenge and how to make HTML form applications without React.
Nope, "of type baz" is not required.
2. Geometric learning is the broader category that subsumes graph neural networks.
I’ll also say I think graph algorithms are overrated, I mean you know the diameter of some graph: who cares? Physicists (like me back in the day) are notorious for plotting some statistics on log-log paper, seeing that the points sorta kinda fall on a line if you squint and decide that three of the points are really bug splats and then yelling “UNIVERSIALITY” and sending it to Physical Review E but the only thing that is universal is that none of them have ever heard of a statistical estimator or a significance test for power law distributions. Node 7741 is the “most central” node, but does that make a difference? Maybe if you kill the top 1% central nodes that will disrupt the terrorist network but for most of us I don’t see high quality insights coming out of graph algorithms most of the time.
Integrations include:
* NetworkX -- https://networkx.org/
* DeepGraphLibrary -- https://www.dgl.ai/
* cuGraph (Rapids.ai Graph) -- https://docs.rapids.ai/api/cugraph/stable/
* PyG (PyTorch Geometric) -- https://pytorch-geometric.readthedocs.io/en/latest/
--
1: https://docs.arangodb.com/3.11/data-science/adapters/
2: https://github.com/arangodb/interactive_tutorials#machine-le...
bazel build //...
bazel test //...
bazel query //...
The last one should list all targets (from what I remember). bazel run //in_memory/clustering:graph
ERROR: Cannot run target //in_memory/clustering:graph
I'm going to wait until someone updates the readme I think!If you `bazel build //...`, you should get the compiled libs under `bazel-out/*fastbuild/bin/`.
This in turn allows you to only build and use the 'package' you care about without having to build the whole repo in other projects. Continuing on the above example, if you only wanted to use the asynchronous_union_find.h header file in your project, somewhere in your WORKSPACE file, you add the graph-mining library using a git_repository rule (see WORKSPACE.bazel for examples), and in a cc_library rule in a BUILD file inside your project, you can add a `@graph-mining//in_memory/connected_components:asynchronous_union_find`. Then you can include it as a header elsewhere. Building your project then only builds that package and its dependencies, and not the entire graph-mining library.
edit: thanks to https://sdkman.io/ it's up and running. It wasn't _that_ bad after all.
Or do(better?) frameworks for the same function as this code already exist(maybe networkx?)?
What is Starland ?
It's interesting and deceptively simple at first.
Generally I'd expect companies to open source things when it's proven itself internally and they want to reap the benefits of open source:
- Make internal engineers happy - engineers like having their code released outside the bounds of their company
- Prestige, which can help with hiring
- External contributions (not even code necessarily, just feedback from people who are using it can be amazingly useful for improving the software)
- Ability to hire people in the future who already know important parts of your technical stack, and don't need internal training on it
- Externally produced resources that help people learn how to use the software (tutorials, community discussion forums etc)
If the software is no longer used internally, open sourcing it is MORE expensive - first you have to get through the whole process of ensuring the code you are releasing can be released (lots of auditing, legal reviews etc), and then you'll have to deal with a flow of support requests which you can either deal with (expensive, especially if your internal knowledge of the tool is atrophying) or ignore (expensive to your reputation).
If your open source project/protocol is the most popular, and you have the governance over it, then you decide where it goes. Chromium is open source, but Google controls it, and everyone who depends on it has to follow. If Chromium was not open source, maybe Firefox would be more popular, and Google would not have control over that.
> or ignore (expensive to your reputation).
I don't think that anything is expensive for Google. They can do whatever they want.
FYI, I work at Google.
Open source at Google generally takes the form of libraries rather than products. Often, that's something that an individual engineer is working on, and it's easier to open source than get the copyright reassigned (since Google by default owns any code you write). There are also libraries that are open sourced for business reasons - e.g. SDKs. You can tell the difference, because most individually-driven libraries contain the copy "Not an official Google product" in the README.
Is "Graph Mining" so ubiquitous that people know what this is all about?
If you need docs just read the .h files, they have extensive comments. I’m sure they’ll add them or maybe, just maybe, you could write some to contribute.
This would have made some of my previous work much easier, it’s really nice to see google open source this.
curious if this is typical dev experience inside google..
Some code libraries had excellent docs (recordio, sstable, mapreduce). But yes, reading the header file was often the best place to start.
Reading the code, especially the header files, seems to be pretty standard as far as what I see in non-open source code. So, it’s been my typical dev experience, I’d say if you’re somewhere that has gleaming, easy to understand docs that are actually up to date with the code you all have too much time on your hands, but I serially work at startups that are running to market.
10 mins spent on readme with some high level details is investment with 100x return for lib users.
Or just write a make file and cut all the bazel build optimization out.
They don’t put instructions on how to start a F1 car inside the cockpit, you don’t hop into a fighter jet and look for the push to start easy button, it’s expected that when you’re at that level you bring expertise.
Either they rewrite git history or it took about 2 years to get approval on making this repo public.
So probably they just started the project two years ago, had aspiration to open source, and finally just did now. Some teams might publish earlier, some like to wait until it's had enough internal usage to prove it out.
$ cat $(find . -type f | grep -vE LICENSE\|README\|BUILD\|bazel\|git\|docs) | sort -u | wc -l
8360 unique lines scattered across more than 100 files. Good luck deciphering that in a single day!
By the way, the first issue in the repo is a "Request for a more verbose README", which I agree with.
so as i see it you have like three options if you are unhappy with that:
1. close the tab
2. dig into the impl and learn as you go
3. do 2 but also write docs
i just really believe i've covered literally all the cases that any reasonable (not whiney, not entitled) person would concede.
> the first issue in the repo is a "Request for a more verbose README", which I agree with.
posted today - do you think it might have something to do with this post we find ourselves convening on? i.e. no one was so bothered about a lack of docs until now?
edit:
i forgot actually something else you could do: email the author and ask nicely for some tips.
https://github.com/google/graph-mining/blob/main/in_memory/c...
This, too:
https://github.com/google/graph-mining/blob/main/in_memory/s...
Same with most of the other files.
How is it usable? It's usable if you want to find date within lots and lots of data efficiently. That's kinda Google's thing. :-D