Graph Mining Library (opens in new tab)

(github.com)

306 pointszuzatm2y ago101 comments

101 comments

Graph mining was "so hot right now" ten years ago. Remember GraphX (https://spark.apache.org/graphx/) and GraphLab (https://en.wikipedia.org/wiki/GraphLab) ? Or graph databases?

I guess it coincided with the social network phenomenon. Much more recently geometric learning (ML on graphs and other structures) shone, until LLMs stole their thunder. I still think geometric learning has a lot of life left in it, and I would like to see it gain popularity.

PaulHoule2y ago

There are "graph databases" which see graphs as a universal approach to data, see RDF and SPARQL and numerous pretenders. For that matter, think of a C program where the master data structure is a graph of pointers. In a graph like that there is usually a huge number of different edge types such as "is married to", "has yearly average temperature", ...

Then there are "graph algorithms" such as PageRank, graph centrality, and such. In a lot of those cases there is one edge type or a small number of edge cases.

There are some generic algorithms you can apply to graphs with many typerd edges edges such as the magic SPARQL pattern

  ?s1 ?p ?o .
  ?s2 ?p ?o .

which finds ?s1 and ?s2 that share a relationship ?p with some ?o and is the basis for a similarity metric between ?s1 and ?s2. Then there are the cases that you pick out nodes with some specific ?p and apply some graph algorithm to those.

The thing about graphs is, in general, they are amorphous and could have any structure (or lack of structure) at all which can be a disaster from a memory latency perspective. Specific graphs usually do have some structure with some locality. There was a time I was using that magic SPARQL pattern and wrote a program that would have taken 100 years to run and then repacked the data structures and discovered an approximation that let me run the calculation in 20 minutes.

Thus practitioners tend to be skeptical about general purpose graph processing libraries as you may very have a problem that I could code up a special-purpose answer to in less time than you'll spend fighting with the build system for that thing that runs 1000x faster.

----

If you really want to be fashionable though, arXiv today is just crammed with papers about "graph neural networks" that never seem to get hyped elsewhere. YOShInOn has made me a long queue of GNN papers to look at but I've only skimmed a few. A lot of articles say they can be applied to the text analysis problems I do but they don’t seem to really perform better than the system YOShInOn and I use so I haven’t been in a hurry to get into them.

Someone2y ago

> a universal approach to data, see RDF and SPARQL and numerous pretenders. For that matter, think of a C program where the master data structure is a graph of pointers.

A graph of typed pointers. As you likely know, the basic element of RDF is not “foo has a relationship with bar”, but “foo has a relationship with bar of type baz”.

Also, the types themselves can be part of relationships as in “baz has a relationship with quux of type foobar”

> The thing about graphs is, in general, they are amorphous and could have any structure (or lack of structure) at all which can be a disaster from a memory latency perspective

But that’s an implementation detail ;-)

In theory, the engine you use to store the graph could automatically optimize memory layout for both the data and the types of query that are run on it.

Practice is different.

> Thus practitioners tend to be skeptical about general purpose graph processing libraries

I am, too. I think the thing they’re mostly good for is producing PhD’s, both on the theory of querying them, ignoring performance, and on improving performance of implementations.

PaulHoule2y ago

Funny, the core table of salesforce.com is triples but they got a patent circa 2000 on a system that builds indexes and materializes views based on query profiling so the performance is good (w/ gold plated hardware). That patent is one reason why graph databases sucked for a long time.

Now the Lotus notes patents have been long expired so I’d like to see some graph database based products that can do what Notes did 30 years ago but it is lost technology like the pyramids, stonehenge and how to make HTML form applications without React.

1 more reply

throwaway2902y ago

> foo has a relationship with bar of type baz

Nope, "of type baz" is not required.

2 more replies

esafak2y ago

1. Graph algorithms like the ones you mentioned are processed not by graph databases like Neo4j, but graph processing libraries like the titular Google library.

2. Geometric learning is the broader category that subsumes graph neural networks.

https://geometricdeeplearning.com/

PaulHoule2y ago

Depends, some graph databases have some support for graph algorithms.

I’ll also say I think graph algorithms are overrated, I mean you know the diameter of some graph: who cares? Physicists (like me back in the day) are notorious for plotting some statistics on log-log paper, seeing that the points sorta kinda fall on a line if you squint and decide that three of the points are really bug splats and then yelling “UNIVERSIALITY” and sending it to Physical Review E but the only thing that is universal is that none of them have ever heard of a statistical estimator or a significance test for power law distributions. Node 7741 is the “most central” node, but does that make a difference? Maybe if you kill the top 1% central nodes that will disrupt the terrorist network but for most of us I don’t see high quality insights coming out of graph algorithms most of the time.

1 more reply

reaperman2y ago

I still use NetworkX a lot when a problem is best solved with graph analysis, I really enjoy the DevEx of that package.

emmanueloga_2y ago

For those wanting to play with graphs and ML I was browsing the arangodb docs recently and I saw that it includes integrations to various graph libraries and machine learning frameworks [1]. I also saw a few jupyter notebooks dealing with machine learning from graphs [2].

Integrations include:

* NetworkX -- https://networkx.org/

* DeepGraphLibrary -- https://www.dgl.ai/

* cuGraph (Rapids.ai Graph) -- https://docs.rapids.ai/api/cugraph/stable/

* PyG (PyTorch Geometric) -- https://pytorch-geometric.readthedocs.io/en/latest/

1: https://docs.arangodb.com/3.11/data-science/adapters/

2: https://github.com/arangodb/interactive_tutorials#machine-le...

afandian2y ago

Can someone with familiarity with Bazel give any clues how to build? `bazel build` does something, but I end up with `bazel-build` and `bazel-build` with no obvious build artefacts.

elteto2y ago

In bazel //... is the equivalent of the 'all' target in make:

    bazel build //...
    bazel test //...
    bazel query //...

The last one should list all targets (from what I remember).

afandian2y ago

Thanks! That last one lists 84 results. None looks obviously like 'main'. Trying a random one:

    bazel run //in_memory/clustering:graph
    ERROR: Cannot run target //in_memory/clustering:graph

I'm going to wait until someone updates the readme I think!

tfsh2y ago

considering the repo doesn't contain a cc_binary build rule, I'm inclined to believe there's no demo, the easiest way to get started (if you want to play around from scratch) would be to add a cc_binary, point that to a main.cpp file which depends on the library targets you want, e.g "//in_memory/clustering:graph" and ensure there's sufficient visibility from the targets.

hashar2y ago

`bazel run` is for a rule that has been marked `executable = True` and there is no such rule in the repository.

If you `bazel build //...`, you should get the compiled libs under `bazel-out/*fastbuild/bin/`.

elteto2y ago

Then most likely this is meant to be used primarily as a library. You should wait until they open source the tests (soon, per another commenter). Those will be runnable targets.

simpleladle2y ago

Just to follow up on the above replies, you could also just build a single package. For example, you could build asynchronous_union_find with `bazel build //in_memory/connected_components:asynchronous_union_find`. (This isn't very useful outside of the context of a cc_binary rule.)

This in turn allows you to only build and use the 'package' you care about without having to build the whole repo in other projects. Continuing on the above example, if you only wanted to use the asynchronous_union_find.h header file in your project, somewhere in your WORKSPACE file, you add the graph-mining library using a git_repository rule (see WORKSPACE.bazel for examples), and in a cc_library rule in a BUILD file inside your project, you can add a `@graph-mining//in_memory/connected_components:asynchronous_union_find`. Then you can include it as a header elsewhere. Building your project then only builds that package and its dependencies, and not the entire graph-mining library.

Keyframe2y ago

Interesting. I always had in back of my mind this notion that I ought to check bazel one of these days. So, one of these days is then today. In order to install bazel, recommended way seems to be to install bazelisk first and just rename that to bazel and move it somewhere on the path like /usr/local/bin/bazel.. fine. Now when I run query it warned me about JDK.. huh. Now when I run build it errored and failed due to missing JAVA with "WARNING: Ignoring JAVA_HOME, because it must point to a JDK, not a JRE.". Ok, I'm not using Java - let's check which Java JDK/JRE to use these days and after few minutes of googling I'm not up for it anymore and that, ladies and gentlemen, is where this day is then up for another day after all. Pathetic how cargo and even npm/yarm spoiled us.

edit: thanks to https://sdkman.io/ it's up and running. It wasn't _that_ bad after all.

itissid2y ago

Noob Q: Would this library be a (good?) candidate to be integrated with a wrappers/extension libraries to have all the graph based clustering algorithms in one place(assuming they are not already)?

Or do(better?) frameworks for the same function as this code already exist(maybe networkx?)?

sbrother2y ago

I might be (very) far behind the times, but does this have any relationship with Pregel?

cmckn2y ago

Pregel is a distributed graph processing system, this (AFAICT) is a library for working with graphs in-memory on a single computer.

zekenie2y ago

some examples would be super helpful!

specproc2y ago

Documentation of any sort would be super helpful.

zuzatmOP2y ago

It's coming! Check again in 12 hours, I believe it should be up then!

pharmakom2y ago

Can someone explain what this library might be useful for?

oddthink2y ago

Clustering. I used the correlation clusterer from here for a problem that I could represent as a graph of nodes with similarity measures (this data looks like this other data) and strong repelling features (this data is known to be different from this other, so never merge them).

whitten2y ago

Github says it is C, C++, and Starland.

What is Starland ?

Laremere2y ago

It's Starlark, the language for configuring the build system Bazel. Bazel is the open source port of Google's internal build system, Blaze. Starlark is a subset of Python.

Terr_2y ago

This list of corporate project name associations makes me wonder where Galactus comes in. :P

https://www.youtube.com/watch?v=y8OnoxKotPQ

ashout332y ago

if I had to guess, that is a typo and should be starlark, which is the language used for bazel build files. bazel is the build system they use

jefftk2y ago

Github says "Starlark 6.2%", so it looks like whitten's typo, not GitHub's.

nolok2y ago

On which keyboard layout is rk into nd a typo ...

mcpeepants2y ago

on any layout operated by a human, who may at times type the wrong word entirely

1 more reply

bsimpson2y ago

"STARLAND VOCAL BAND? THEY SUCK!"

nologic012y ago

Graph algorithms cry out for some standardization. Think blas and lapack.

bigbillheck2y ago

Consider: https://graphblas.org

nologic012y ago

I wonder how much overlap this new project with graphblas and older graph libraries like boost::graph https://www.boost.org/doc/libs/1_83_0/libs/graph/doc/

xxpor2y ago

I was hoping this would mine literal stats graphs for anomaly detection

lanstin2y ago

https://en.wikipedia.org/wiki/Graph_theory

It's interesting and deceptively simple at first.

blitzar2y ago

I think they use the word "graph" to mean a different thing to what I use the word for.

supriyo-biswas2y ago

That’s relatively easy, see https://en.m.wikipedia.org/wiki/Interquartile_range

charcircuit2y ago

Most of these files have a double license header.

ldhulipala2y ago

Thanks for pointing this out (fixed now).

specproc2y ago

If you're working on this repo, can we plz haz docs?

ldhulipala2y ago

Thanks, yes, this is on the list of TODOs! (also, to open-source the tests)

3 more replies

tomrod2y ago

This is a big deal, I think. I'm guessing it's not widely used internally anymore if they are open sourcing it. What is used instead?

simonw2y ago

I don't think "not widely used internally anymore" is a common rationale for open sourcing something.

Generally I'd expect companies to open source things when it's proven itself internally and they want to reap the benefits of open source:

- Make internal engineers happy - engineers like having their code released outside the bounds of their company

- Prestige, which can help with hiring

- External contributions (not even code necessarily, just feedback from people who are using it can be amazingly useful for improving the software)

- Ability to hire people in the future who already know important parts of your technical stack, and don't need internal training on it

- Externally produced resources that help people learn how to use the software (tutorials, community discussion forums etc)

If the software is no longer used internally, open sourcing it is MORE expensive - first you have to get through the whole process of ensuring the code you are releasing can be released (lots of auditing, legal reviews etc), and then you'll have to deal with a flow of support requests which you can either deal with (expensive, especially if your internal knowledge of the tool is atrophying) or ignore (expensive to your reputation).

palata2y ago

IMO, you forget one important point: control.

If your open source project/protocol is the most popular, and you have the governance over it, then you decide where it goes. Chromium is open source, but Google controls it, and everyone who depends on it has to follow. If Chromium was not open source, maybe Firefox would be more popular, and Google would not have control over that.

> or ignore (expensive to your reputation).

I don't think that anything is expensive for Google. They can do whatever they want.

mlinhares2y ago

Google has done this before so it's not surprising people would think that.

tomrod2y ago

To be fair, I can often be uninformed with regards to some of the smaller movements of techcos.

wilsynet2y ago

As merely two examples, both gRPC and Kubernetes are important to Google, and yet Google opened sourced them. "No longer used" is not the criteria Google uses to make their software OSS.

FYI, I work at Google.

jefftk2y ago

Google Wave is the only counterexample I can think of, where it was "we're deprecating this project, but releasing it as open source".

bsimpson2y ago

I don't think Google generally opensources _products_ - either it always is open source (Android) or never is (web apps). I can't think of an example where a product was closed source, released as open source, and continually maintained.

Open source at Google generally takes the form of libraries rather than products. Often, that's something that an individual engineer is working on, and it's easier to open source than get the copyright reassigned (since Google by default owns any code you write). There are also libraries that are open sourced for business reasons - e.g. SDKs. You can tell the difference, because most individually-driven libraries contain the copy "Not an official Google product" in the README.

1 more reply

PaulHoule2y ago

I'd say both of those are actively harmful products (like PFOS or cigarettes) that hurt Google's competition by being open sourced. Google wrecked their own productivity, the least they could do was wreck everybody else's.

dieortin2y ago

And why would any of those be harmful? Care to elaborate?

1 more reply

tomrod2y ago

Thanks for clarifying

nivekney2y ago

It's based on ParlayLib, which is for shared-memory multicore machines. Highly suspect that they moved the algorithms on to distributed systems.

spankalee2y ago

Most projects try not to open something that's not going to be maintained. If they do it's usually rather loudly called out in the readme.

meneer_oke2y ago

Would it be possible to explain why it's big deal.

dllthomas2y ago

Does it accept graphviz?

ponyous2y ago

No idea where is the hype coming from, who is actually upvoting this? 0 Docs, 0 examples, 0 explanation of how is it useful.

Is "Graph Mining" so ubiquitous that people know what this is all about?

ldhulipala2y ago

We are updating the README to be more descriptive; in the meantime, please see https://gm-neurips-2020.github.io/ or https://research.google/teams/graph-mining/

ldhulipala2y ago

There are now more documents linked to in the README.md and an example you can try to run: https://github.com/google/graph-mining

bafe2y ago

It was hyped some years ago. There are plenty of legitimate applications of graphs, perhaps the library offers well optimized implementation of important algorithms. But the past hype around all things "graph" was not rational. As always, you can't solve all problems with a graph as you can't with a neural network or with any other structure/algorithm

MarkMarine2y ago

Whew. Lots of complaints from people who probably will never need to use this code.

If you need docs just read the .h files, they have extensive comments. I’m sure they’ll add them or maybe, just maybe, you could write some to contribute.

This would have made some of my previous work much easier, it’s really nice to see google open source this.

riku_iki2y ago

> If you need docs just read the .h files

curious if this is typical dev experience inside google..

dekhn2y ago

I think in most cases, back when I worked there, I would have instead searched the monorepo for targets that depended on this library (an easy lookup), and look at how they used it.

Some code libraries had excellent docs (recordio, sstable, mapreduce). But yes, reading the header file was often the best place to start.

MarkMarine2y ago

I’m not at google so I’ve got no idea.

Reading the code, especially the header files, seems to be pretty standard as far as what I see in non-open source code. So, it’s been my typical dev experience, I’d say if you’re somewhere that has gleaming, easy to understand docs that are actually up to date with the code you all have too much time on your hands, but I serially work at startups that are running to market.

riku_iki2y ago

Header file gives you a view into some narrow window of the system, API, pipeline, and you probably have no idea which header files are important and which are part of some internal implementation.

10 mins spent on readme with some high level details is investment with 100x return for lib users.

1 more reply

ls6122y ago

I think it’s that it’s not at all obvious how to even build the damn thing so at least a little bit of readme would have been nice. I agree with the sentiment this looks like a super cool tool.

PaulHoule2y ago

It says you're supposed to leave a ticket if you have questions or comments... A README file isn't much to ask for.

MarkMarine2y ago

I’m not saying it’s too much to ask for, but also, when you’re doing distributed in memory graph mining (which means you’ve got an application with a big enough graph that you need to do this, and the technical expertise to need the algorithms in this open source package) maybe it’s expected that you can read the bazel files and bazel docs yourself and figure it out.

Or just write a make file and cut all the bazel build optimization out.

They don’t put instructions on how to start a F1 car inside the cockpit, you don’t hop into a fighter jet and look for the push to start easy button, it’s expected that when you’re at that level you bring expertise.

1 more reply

helsinki2y ago

The .proto files are the documentation everyone is looking for.

corentin882y ago

Interesting fact: the first commit is 2 years old and is entitled "Boilerplate for new Google open source project".

Either they rewrite git history or it took about 2 years to get approval on making this repo public.

j2kun2y ago

The code has an internal analogue, and the tooling lets you choose whether to export the entire git history or squash it. They may have chosen the former, in which case it could just be 2 years to migrate and rework the code to be ready for open sourcing. In that time I imagine there were four reorgs and countless priority shifts :)

spankalee2y ago

If you know you want to open source a project eventually, it's easier if you start it in the open source part of the internal repo with all the licensing and headers in place. Open sourcing existing code is harder because you need to review that it hasn't used something that can't be opened.

So probably they just started the project two years ago, had aspiration to open source, and finally just did now. Some teams might publish earlier, some like to wait until it's had enough internal usage to prove it out.

progval2y ago

FWIW they had already pushed that commit four months ago: https://archive.softwareheritage.org/browse/snapshot/bd01717...

hiddencost2y ago

That could be the template it was cloned from

thfuran2y ago

Now that's bureaucracy.

numpad02y ago

I'd agree if last commit was 2 years ago.

0x6461188A2y ago

How is this usable. I see no documentation. There is a docs folder but all it contains is a code of conduct.

mathisfun1232y ago

I'm not trying to be snarky but have you considered reading the code? Like I'll be honest I can't remember the last time I looked at docs at all instead of reading the code itself.

n3150n2y ago

Are you for real? I'm also not trying to be snarky but...

8360 unique lines scattered across more than 100 files. Good luck deciphering that in a single day!

By the way, the first issue in the repo is a "Request for a more verbose README", which I agree with.

mathisfun1232y ago

my guy what exactly are you expecting here? this is free as in beer code (apache license). no one is forcing you to use this and no one is asking anything of you for using it. i fully support people releasing their code (that took enormous amounts of blood sweat tears to get working) absolutely however they want to. if i'm interestd enough i'll figure it out and thank them.

so as i see it you have like three options if you are unhappy with that:

1. close the tab

2. dig into the impl and learn as you go

3. do 2 but also write docs

i just really believe i've covered literally all the cases that any reasonable (not whiney, not entitled) person would concede.

> the first issue in the repo is a "Request for a more verbose README", which I agree with.

posted today - do you think it might have something to do with this post we find ourselves convening on? i.e. no one was so bothered about a lack of docs until now?

edit:

i forgot actually something else you could do: email the author and ask nicely for some tips.

jshobrook2y ago

You could try using something like Adrenaline https://www.useadrenaline.com/ I built it exactly for this use case :)

otteromkram2y ago

This header file has lots of commentary.

https://github.com/google/graph-mining/blob/main/in_memory/c...

This, too:

https://github.com/google/graph-mining/blob/main/in_memory/s...

Same with most of the other files.

How is it usable? It's usable if you want to find date within lots and lots of data efficiently. That's kinda Google's thing. :-D

PaulHoule2y ago

Towards the end of my relationship with a business partner, he was really impressed with a graph processing library released by Intel (because it was Intel), while my thoughts were "ho hum, this looks like it was done by a student" (like a student who got a C-, not a A student) and thought about how much I liked my really crude graph processing scripts in Pig that were crazy fast because they used compressed data structures and well-chosen algorithms.

j / k navigate · click thread line to collapse

101 comments

esafak2y ago

Graph mining was "so hot right now" ten years ago. Remember GraphX (https://spark.apache.org/graphx/) and GraphLab (https://en.wikipedia.org/wiki/GraphLab) ? Or graph databases?

PaulHoule2y ago

Then there are "graph algorithms" such as PageRank, graph centrality, and such. In a lot of those cases there is one edge type or a small number of edge cases.

There are some generic algorithms you can apply to graphs with many typerd edges edges such as the magic SPARQL pattern

  ?s1 ?p ?o .
  ?s2 ?p ?o .

----

Someone2y ago

> a universal approach to data, see RDF and SPARQL and numerous pretenders. For that matter, think of a C program where the master data structure is a graph of pointers.

A graph of typed pointers. As you likely know, the basic element of RDF is not “foo has a relationship with bar”, but “foo has a relationship with bar of type baz”.

Also, the types themselves can be part of relationships as in “baz has a relationship with quux of type foobar”

> The thing about graphs is, in general, they are amorphous and could have any structure (or lack of structure) at all which can be a disaster from a memory latency perspective

But that’s an implementation detail ;-)

In theory, the engine you use to store the graph could automatically optimize memory layout for both the data and the types of query that are run on it.

Practice is different.

> Thus practitioners tend to be skeptical about general purpose graph processing libraries

I am, too. I think the thing they’re mostly good for is producing PhD’s, both on the theory of querying them, ignoring performance, and on improving performance of implementations.

PaulHoule2y ago

1 more reply

throwaway2902y ago

> foo has a relationship with bar of type baz

Nope, "of type baz" is not required.

2 more replies

esafak2y ago

1. Graph algorithms like the ones you mentioned are processed not by graph databases like Neo4j, but graph processing libraries like the titular Google library.

2. Geometric learning is the broader category that subsumes graph neural networks.

https://geometricdeeplearning.com/

PaulHoule2y ago

Depends, some graph databases have some support for graph algorithms.

1 more reply

reaperman2y ago

I still use NetworkX a lot when a problem is best solved with graph analysis, I really enjoy the DevEx of that package.

emmanueloga_2y ago

Integrations include:

* NetworkX -- https://networkx.org/

* DeepGraphLibrary -- https://www.dgl.ai/

* cuGraph (Rapids.ai Graph) -- https://docs.rapids.ai/api/cugraph/stable/

* PyG (PyTorch Geometric) -- https://pytorch-geometric.readthedocs.io/en/latest/

1: https://docs.arangodb.com/3.11/data-science/adapters/

2: https://github.com/arangodb/interactive_tutorials#machine-le...

afandian2y ago

Can someone with familiarity with Bazel give any clues how to build? `bazel build` does something, but I end up with `bazel-build` and `bazel-build` with no obvious build artefacts.

elteto2y ago

In bazel //... is the equivalent of the 'all' target in make:

    bazel build //...
    bazel test //...
    bazel query //...

The last one should list all targets (from what I remember).

afandian2y ago

Thanks! That last one lists 84 results. None looks obviously like 'main'. Trying a random one:

    bazel run //in_memory/clustering:graph
    ERROR: Cannot run target //in_memory/clustering:graph

I'm going to wait until someone updates the readme I think!

tfsh2y ago

hashar2y ago

`bazel run` is for a rule that has been marked `executable = True` and there is no such rule in the repository.

If you `bazel build //...`, you should get the compiled libs under `bazel-out/*fastbuild/bin/`.

elteto2y ago

Then most likely this is meant to be used primarily as a library. You should wait until they open source the tests (soon, per another commenter). Those will be runnable targets.

simpleladle2y ago

Keyframe2y ago

edit: thanks to https://sdkman.io/ it's up and running. It wasn't _that_ bad after all.

itissid2y ago

Noob Q: Would this library be a (good?) candidate to be integrated with a wrappers/extension libraries to have all the graph based clustering algorithms in one place(assuming they are not already)?

Or do(better?) frameworks for the same function as this code already exist(maybe networkx?)?

sbrother2y ago

I might be (very) far behind the times, but does this have any relationship with Pregel?

cmckn2y ago

Pregel is a distributed graph processing system, this (AFAICT) is a library for working with graphs in-memory on a single computer.

zekenie2y ago

some examples would be super helpful!

specproc2y ago

Documentation of any sort would be super helpful.

zuzatmOP2y ago

It's coming! Check again in 12 hours, I believe it should be up then!

pharmakom2y ago

Can someone explain what this library might be useful for?

oddthink2y ago

whitten2y ago

Github says it is C, C++, and Starland.

What is Starland ?

Laremere2y ago

It's Starlark, the language for configuring the build system Bazel. Bazel is the open source port of Google's internal build system, Blaze. Starlark is a subset of Python.

Terr_2y ago

This list of corporate project name associations makes me wonder where Galactus comes in. :P

https://www.youtube.com/watch?v=y8OnoxKotPQ

ashout332y ago

if I had to guess, that is a typo and should be starlark, which is the language used for bazel build files. bazel is the build system they use

jefftk2y ago

Github says "Starlark 6.2%", so it looks like whitten's typo, not GitHub's.

nolok2y ago

On which keyboard layout is rk into nd a typo ...

mcpeepants2y ago

on any layout operated by a human, who may at times type the wrong word entirely

1 more reply

bsimpson2y ago

"STARLAND VOCAL BAND? THEY SUCK!"

nologic012y ago

Graph algorithms cry out for some standardization. Think blas and lapack.

bigbillheck2y ago

Consider: https://graphblas.org

nologic012y ago

I wonder how much overlap this new project with graphblas and older graph libraries like boost::graph https://www.boost.org/doc/libs/1_83_0/libs/graph/doc/

xxpor2y ago

I was hoping this would mine literal stats graphs for anomaly detection

lanstin2y ago

https://en.wikipedia.org/wiki/Graph_theory

It's interesting and deceptively simple at first.

blitzar2y ago

I think they use the word "graph" to mean a different thing to what I use the word for.

supriyo-biswas2y ago

That’s relatively easy, see https://en.m.wikipedia.org/wiki/Interquartile_range

charcircuit2y ago

Most of these files have a double license header.

ldhulipala2y ago

Thanks for pointing this out (fixed now).

specproc2y ago

If you're working on this repo, can we plz haz docs?

ldhulipala2y ago

Thanks, yes, this is on the list of TODOs! (also, to open-source the tests)

3 more replies

tomrod2y ago

This is a big deal, I think. I'm guessing it's not widely used internally anymore if they are open sourcing it. What is used instead?

simonw2y ago

I don't think "not widely used internally anymore" is a common rationale for open sourcing something.

Generally I'd expect companies to open source things when it's proven itself internally and they want to reap the benefits of open source:

- Make internal engineers happy - engineers like having their code released outside the bounds of their company

- Prestige, which can help with hiring

- External contributions (not even code necessarily, just feedback from people who are using it can be amazingly useful for improving the software)

- Ability to hire people in the future who already know important parts of your technical stack, and don't need internal training on it

- Externally produced resources that help people learn how to use the software (tutorials, community discussion forums etc)

palata2y ago

IMO, you forget one important point: control.

> or ignore (expensive to your reputation).

I don't think that anything is expensive for Google. They can do whatever they want.

mlinhares2y ago

Google has done this before so it's not surprising people would think that.

tomrod2y ago

To be fair, I can often be uninformed with regards to some of the smaller movements of techcos.

wilsynet2y ago

As merely two examples, both gRPC and Kubernetes are important to Google, and yet Google opened sourced them. "No longer used" is not the criteria Google uses to make their software OSS.

FYI, I work at Google.

jefftk2y ago

Google Wave is the only counterexample I can think of, where it was "we're deprecating this project, but releasing it as open source".

bsimpson2y ago

1 more reply

PaulHoule2y ago

dieortin2y ago

And why would any of those be harmful? Care to elaborate?

1 more reply

tomrod2y ago

Thanks for clarifying

nivekney2y ago

It's based on ParlayLib, which is for shared-memory multicore machines. Highly suspect that they moved the algorithms on to distributed systems.

spankalee2y ago

Most projects try not to open something that's not going to be maintained. If they do it's usually rather loudly called out in the readme.

meneer_oke2y ago

Would it be possible to explain why it's big deal.

dllthomas2y ago

Does it accept graphviz?

ponyous2y ago

No idea where is the hype coming from, who is actually upvoting this? 0 Docs, 0 examples, 0 explanation of how is it useful.

Is "Graph Mining" so ubiquitous that people know what this is all about?

ldhulipala2y ago

We are updating the README to be more descriptive; in the meantime, please see https://gm-neurips-2020.github.io/ or https://research.google/teams/graph-mining/

ldhulipala2y ago

There are now more documents linked to in the README.md and an example you can try to run: https://github.com/google/graph-mining

bafe2y ago

MarkMarine2y ago

Whew. Lots of complaints from people who probably will never need to use this code.

If you need docs just read the .h files, they have extensive comments. I’m sure they’ll add them or maybe, just maybe, you could write some to contribute.

This would have made some of my previous work much easier, it’s really nice to see google open source this.

riku_iki2y ago

> If you need docs just read the .h files

curious if this is typical dev experience inside google..

dekhn2y ago

I think in most cases, back when I worked there, I would have instead searched the monorepo for targets that depended on this library (an easy lookup), and look at how they used it.

Some code libraries had excellent docs (recordio, sstable, mapreduce). But yes, reading the header file was often the best place to start.

MarkMarine2y ago

I’m not at google so I’ve got no idea.

riku_iki2y ago

Header file gives you a view into some narrow window of the system, API, pipeline, and you probably have no idea which header files are important and which are part of some internal implementation.

10 mins spent on readme with some high level details is investment with 100x return for lib users.

1 more reply

ls6122y ago

I think it’s that it’s not at all obvious how to even build the damn thing so at least a little bit of readme would have been nice. I agree with the sentiment this looks like a super cool tool.

PaulHoule2y ago

It says you're supposed to leave a ticket if you have questions or comments... A README file isn't much to ask for.

MarkMarine2y ago

Or just write a make file and cut all the bazel build optimization out.

1 more reply

helsinki2y ago

The .proto files are the documentation everyone is looking for.

corentin882y ago

Interesting fact: the first commit is 2 years old and is entitled "Boilerplate for new Google open source project".

Either they rewrite git history or it took about 2 years to get approval on making this repo public.

j2kun2y ago

spankalee2y ago

progval2y ago

FWIW they had already pushed that commit four months ago: https://archive.softwareheritage.org/browse/snapshot/bd01717...

hiddencost2y ago

That could be the template it was cloned from

thfuran2y ago

Now that's bureaucracy.

numpad02y ago

I'd agree if last commit was 2 years ago.

0x6461188A2y ago

How is this usable. I see no documentation. There is a docs folder but all it contains is a code of conduct.

mathisfun1232y ago

I'm not trying to be snarky but have you considered reading the code? Like I'll be honest I can't remember the last time I looked at docs at all instead of reading the code itself.

n3150n2y ago

Are you for real? I'm also not trying to be snarky but...

8360 unique lines scattered across more than 100 files. Good luck deciphering that in a single day!

By the way, the first issue in the repo is a "Request for a more verbose README", which I agree with.

mathisfun1232y ago

so as i see it you have like three options if you are unhappy with that:

1. close the tab

2. dig into the impl and learn as you go

3. do 2 but also write docs

i just really believe i've covered literally all the cases that any reasonable (not whiney, not entitled) person would concede.

> the first issue in the repo is a "Request for a more verbose README", which I agree with.

posted today - do you think it might have something to do with this post we find ourselves convening on? i.e. no one was so bothered about a lack of docs until now?

edit:

i forgot actually something else you could do: email the author and ask nicely for some tips.

jshobrook2y ago

You could try using something like Adrenaline https://www.useadrenaline.com/ I built it exactly for this use case :)

otteromkram2y ago

This header file has lots of commentary.

https://github.com/google/graph-mining/blob/main/in_memory/c...

This, too:

https://github.com/google/graph-mining/blob/main/in_memory/s...

Same with most of the other files.

How is it usable? It's usable if you want to find date within lots and lots of data efficiently. That's kinda Google's thing. :-D

PaulHoule2y ago

j / k navigate · click thread line to collapse