So I am considering a workaround to use a graph structure, but represent the underlying data in a reliable rdbms, in particular postgres.
The approach is: 1. Make two parent tables: node and edge. 2. Make separate tables for all "objects" (like people, places, things, etc.) which would be inherited from node, so they would all have a unique node ID. 3. Make separate tables inherited from edge which would be used to represent the many many relations. So each relation has an edge ID, and each table inherited from edge can be thought of as representing a specific kind of relation (like person lives in place, or person is friends with person).
One thing I have observed so far, is a large number of tables with few columns, but I think that lends itself to the advantage of easy indexing. There can be a large number of individual queries from the front end, but I believe I can use views to make represent tables with more comprehensive info to reduce the number of queries.
What do you guys think of an approach like this? What am I missing, what is wrong with it? I haven't come across this previously, and so am a bit nervous about the ramifications. Is someone else also doing this?
The issue is querying the data, specifically when that involves walking the graph.
The real question then, is how do you want to query the data? If you are willing to make an assumption now that adds a tight constraint on your future work... i.e. "We will only ever perform queries that are at most 1 or 2 hops from the nodes/edges being queried" then perhaps PostgreSQL will work for you.
For shallow or narrow queries of the graph this is the way to go.
But for deep and broad queries, if you want to perform some valuable analysis that involves traversing the graph some non-trivial distance from the point of origin, that's where PostgreSQL is going to let you down and you would probably be better off picking a more appropriate tool.
Same question though: How do you want to query your graph, and are you willing to limit yourself to not querying it in a certain way?
Implementation wise, it is not that different from a query with many joins anyway - which is a standard problem, and is addressed by materialized views as well as indexing. So as long as the complex queries don't need to be ad hoc and real time, I can prepare for them using materialized views.
However you are quite right in that complex, ad hoc, and realtime graph queries Will be a pain point. That is a risk I am prepared to tolerate in the short term. The thing is, I also need transactional integrity, so I am stuck with an rdbms as the primary solution. In the long term, I anticipate the need for both graph queries and transactions - thus, I think it is better to start with postgres and incorporate a graphdb as the need arises than the other way round.
Hopefully this line of reasoning make sense? I am really not experienced at this, so am trying to not bump into too many walls.
Implementing a graph on top of a RDBMS is trivial, and if the semantics are correct (the same as the ones exposed by a graph db?), then I'm not sure why would people want to use a proper graph db.
I thought that probably it'd be an issue of performance: the "right tool for the job" that "does one thing and does it well" probably is leaner and more efficient. After all, unlike a trivial implementation, getting a graph on a RDBMS to perform well might not be that simple after all (still, your idea of inheriting tables might make things more flexible and maybe more efficient)
But then, when looking up some Neo4j benchmarks, the numbers seems to not be good at all:
http://baach.de/Members/jhb/neo4j-performance-compared-to-my...
I'd like to hear from someone that used Neo4j (also, other graph databases are interesting) in production, and benchmarked it against a RDBMS prototype, finding the former as the better solution of the two.
A proper graph db is almost mandatory if there are complex, ad hoc queries that need to be made on real time data.
Using an RDBMS is great if the graph query types are going to be known in advance - so they can be prepared for using materialized views and indexes, and if they aren't too complex - so one doesn't descend into JOIN-hell.
But most applications aren't like that, and not all applications can be completely satisfied using only a graphdb. Hence the rise of the new multimodel databases like OrientDB and ArangoDB. So I think it is a question of what risk one is prepared to tolerate.
https://github.com/jhb/neo4j-experiements/blob/master/query_...
There's no way that a single HTTP request is responsible for a difference in hundreds of seconds
(his benchmark code isn't really sound, but I don't see how this can affect a difference so striking)
You could test out 3 or 4 different SPARQL solutions in the time that it would take you to develop something graph like on your own.
On the other hand, cutting edge approaches, actually take a graph representation of data and lay it out in a relational manner. http://ercim-news.ercim.eu/en96/special/monetdb-rdf-discover... giving the best of both worlds.
In short you can build something yourself. But don't expect that it will be better than something build and supported by someone else.
So investigate the competition: BlazeGraph, Virtuoso, StarDog, Oracle 12c EE/semnet, DB9, Dydra before deciding to build your own. Building your own because its fun to do is great, but unless it pays your bills not a good idea for production environment.
PS. The edge table (EAV) is the major problem, it leads to a lot of self joins and difficult exercises for the query planner.
You can improve a lot on this if you can put "different" edges into different tables or partitions.
Triple stores with support for JSON-LD framing such as Dydra can also make it easier to have a front-end on top of your DB without extensive middle layer code.
A store like StarDog and BlazeGraph on the other hand gives you a lot of flexibility by both supporting SPARQL and TinkerPop. (both cluster and scale out, although BlazeGraph has GPL option. StarDog is only Commercial)
> Node <- Entities
> Edge <- M2M
which is strange since an edge is already a M2M.
Thanks for the monetdb link.
That would be better than one big table in performance, which was a major problem in the RDF on SQL databases.
Of course those accepted any graph, if one constrains the number of possible predicates/relations then this solution could more efficient.
My understanding is that big5 and others do not use graphdb for all their stuff since they are not as good as rdbms to do queries with a single hop or JOIN. They embrace microservices and maintain a graphdb (in memory, persistent or distributed) to answer domain specific queries. That approach is similar to your schema except that graph queries run on a single node without superfluous network roundtrips.
It would be nice to be able to use a single database for all data related stuff to have atomic writes and simpler architecture. That's what multi-model databases are tackling have a look at OrientDB and ArangoDB [1].
Also, Tinkerpop, already mentioned in the thread, is a ready-to-scale graphdb with much love I recommend you have a look at Tales of the Tinkerpop [2].
Multi model db's are a good idea indeed. They were the first things I had considered, and extensively so - both OrientDB and ArangoDB. I came back to traditional solutions simply for practical reasons of maturity, and ease of finding professional support.
Someone here mentioned about a Tinkerpop implementation that uses Postgres as the storage engine, I think that might offer the best of both worlds, but I am yet to look into it.
If you have multi GB graphs and need to run intensive classic graph analysis algorithms then a lot of that can come optimised out of the box with a dedicated graph db. They will still leave you with plenty of hard problems.
However, if you have small graphs or only need conventional crud operations then using a new and unfamiliar graph db would probably be insane.
If your case is the latter and postgres is your comfort zone and you can get a working solution quickly, I would be tempted to prototype along the lines you suggest. When you start getting performance problems or need graph analysis, you can then make a better decision whether a graphdb or restructuring your relational db will solve it the best.
I definitely need acid transactions, and also some graph capability - which may or may not grow in the future. More pertinently, the kind of representation I was attempting led to a very normalized design anyway, and I think it also lends itself to all but the most sophisticated of graph functionalities! But I am only just starting, so hopefully it works out well.
* Trees in the database - Slides: http://www.slideshare.net/quipo/trees-in-the-database-advanc...
* Graphs in the database - Slides : http://www.slideshare.net/quipo/rdbms-in-the-social-networks...
I would challenge the idea that you need these supertables, though. What those allow you to do is to write graph queries which are generic (polymorphic?) over multiple kinds of node and edge. So as well as "find me all the people who are friends with my friends", you can write "find me all the people who are friends with my friends, or who have been to a place i've been, or who own a thing i own" without using a union. Do you actually need to write queries like that?
There are two downsides to the supertables. Firstly, more complexity, although it's a minor, or at least constant-factor, amount. Secondly, a loss of type safety. If your edges are defined in a supertable, then the columns which point to the ends of the edge have to be foreign keys to the node supertable. That means they can be any type of node; there's no way to constrain particular kinds of edge to connecting particular kinds of node. That seems like a considerable drawback to me.
The generic kind of queries you mention: actually, yes, I think I will need them.
Your second point on downsides is something I have/had to think about, thanks for raising it. I had some vague premonitions on those lines, but you helped make it concrete. The problem can be avoided by not making meaningless connections, but that's not a real solution. It won't be a preventive, but it could help to audit relationships - by having a script list out which tables FKs originate from.
Since I started writing this post, I have had two ideas on how to address this, it'd be great to have your opinion on them - I'm thinking out loud here.
1. Use referential integrity. Since each relationship/edge will be in its own inherited table, one could impose a constraint that each column_in_the_specific_edge_table REFERENCES another_column_in_a_specific_node_table. Like the columns in the person_lives_in_place relationship table must reference columns in the person table and the place table - each of which is also its own inherited node table.
2. More convoluted and a bit cumbersome, but if the previous/simpler approach is insufficient, one could create data types corresponding to each node type. And impose that as a type constraint on the edge tables.
But maybe the 1st option could actually work...what do you think?
For short graph traversals, recursive queries and joins perform surprisingly well, even when compared against dedicated graph databases -- at least, the open source ones I've been able to try, including neo4j. The ones I've tried perform better than Postgres mostly when you get more than two hops away -- but then at least Neo and the MySQL thing tend to suffer if the data set doesn't fit in memory.
I had at least one project which reverted from a dedicated graph database to Postgres specifically because it needed decent on-disk performance (2TB per shard).
Anyway, to your question: you could build a PostgreSQL database like that, but why would you? If you have distinct types of entities (person vs. furniture vs. town), why would you denormalize them into generic node interitance? Having a completely abstracted database with one big table called "things" and another big table called "relationships" seems really attractive before you actually do it. Then it starts to suck.
Relational databases are designed to perform well, and to be easy to query, with "normalized" databases in which different types of entities are represented by different tables and relationships are defined by foreign keys. The further you get away from that model, the harder the database gets to use.
This is one of the problems with using graph concepts for questions which are not, fundamentally, graphs. Yes, you can define people's nationality as a graph from location to nation, but that makes it pretty hard to do voter registration stats. Graphs are a great tool for answering many important questions, but they are not a great tool for answering all questions.
Now, if you are answering an actual graphing question (i.e. show me everyone who shares something in common with me via at least two links within 3 hops), then proceed. But figure out what kinds of questions you specifically want to answer before you get too far.
The list of Postgres-basd extensions mentioned in this thread turned out quite relevant actually! I'm still going through them all, seeing if I can reuse something..
Many thanks for the inputs, quite helpful.
* RDFLib Store for PostgreSQL ( https://github.com/RDFLib/rdflib-sqlalchemy )
* Sparqlify is a SPARQL-SQL rewriter that enables one to define RDF views on relational databases and query them with SPARQL. ( PostgreSQL supported http://aksw.org/Projects/Sparqlify.html )
* PostgreSQL-Neo4j Foreign Data Wrappers https://github.com/nuko-yokohama/neo4j_fdw
* PgRouting (network ) http://docs.pgrouting.org/dev/doc/src/tutorial/analytics.htm...
The nice thing about your approach is that you can shard your objects by their id and your edges by their "from" id, and have all lookups go to one box when you shard. Throw in some caching and it scales to multiple boxes really well.
Actually, Facebook's TAO was an early inspiration for me as well, but it is really customized and complex too, so probably not something for a tiny team. But the concepts from TAO are applicable to many similar scenarios.
I didn't use table inheritance, and I don't distinguish relationships at the query stage, so if that's core to your method you should think about it. I assume with that setup it'd be easier to take advantage of partial indexes though.
Recursive queries work but if the intermediate tables gets large it explodes fast. You can't use aggregates in the recursive function either (e.g. to count number of leaf nodes in a tree), you have to apply them at the end, in which case the intermediate table has to be large...
I've had decent experience so far with GIN indexes and @> operators on an ARRAY[] column adjacency list. In my case I stored "ancestors" so I could select by object id and get all ancestors, or use `ancestors @> ARRAY[parent_id]` to select all object ids as descendants.
Of course, if you don't need any "self" relationships, then none of the really complicated stuff matters...
Indeed, one of the reasons behind inheriting tables was to use partial indexes - which should help with performance. Another was ease scaling out, if needed.
Using the @> operators on array columns is something I also looked into, mainly for materialized paths - I expect them to remain static, so no expensive updates to the paths/arrays. But actually, I don't think I will need many self relationships - that was the third reason for using inherited tables, so I could split entities into separate groups.
Of course, I have no idea how any of it will work out, since this was just a semi-serious line of inquiry initially, and everything was conceptual so far, except the feedback of experienced people such as yourself, on this thread - which has led me to pursue this seriously and actually try to build it out. After reading the responses on this thread, I now think it will be worth the effort.
It is trivial to implement these APIs on top of simple SQL. I did that in a few days in both Python and go.
* usually one single query to fetch an entire tree or subtree * more work on reconstructing tree from query result * requires prefix pattern matching operator, e.g. "like" from RDBMS
By far the most productive thing I did. People will say that you need recursive queries, etc, but I found most of the time just keeping the query set in the application layer was easy enough and super fast. Scaling it was easier too because there are lots of articles on scaling the MySQL / Postgres.
For traversals, it tries its level best to let Postgres create the appropriate JOINs, which Postgres does well... most of the time anyway. In doing this, I learned really fast how many weird semantic corners of bog-standard SQL are, and the optimization choices Postgres makes based on the way you formulate the query.
Ping me on that PR! And/or join #cayley on Freenode
I too am interested in understanding the downsides of a gdb (seeing some refs to performance but nothing concrete).
So unless the solution you're building has a specific need for graph queries - in particular, complex, realtime and ad hoc graph queries, a traditional solution is a safer bet.
You can make v1 and v2 two separate nodes, if you need information of both of them. In this case you just link each version to its attributes. If you want the "latest" marked separately, you can have three nodes - one for "current", and the other two for the individual iterations. So initially "current" would link to a,b,c and later link x,y,z, and remove a. Or, you can have two nodes - current and previous iteration, and store the version number of the latest iteration as a field within the "current" node.
Each row in a triple store stores a relationship between 2 objects, with unique IDs for the subject, the relationship type, and the object. The IDs could refer to other tables that store entity and relationship details.
There are specialised triple store DBMSs, but implementing one in PostgreSQL should be pretty easy.
The obvious first question would be: why are you representing your data as a graph? Can you represent it better as sets of predicates (ie: relations)?
Using views should not reduce the number of queries: a view is just a query with a name. If you can do it by combining views you can do it by combining queries.
I do however recommend exploring and even building your own graph database, it is a great learning exercise. But don't fool yourself by doing it on top of Postgres, actually jump in and become immersed. I built http://gunDB.io/ to scratch a similar itch, I wanted an easy graph database with realtime updates like Firebase. I would love to hear your feedback, or even have you contribute.