Native multi-model can compete with pure document and graph databases (opens in new tab)

(arangodb.com)

74 pointsfceller11y ago35 comments

35 comments

28 comments · 9 top-level

rmrfrmrf11y ago· 5 in thread

Pardon my ignorance, but what is a graph database?

It's a database that stores a "graph" of vertices that are connected by edges. If you were say, creating the next LinkedIn and you wanted to find the shortest path between two users based on their connections, a graph database would be a good choice.

Let's imagine you want to see how Fred is connected to Steve, their network looks like this:

    [Fred] <-knows-> [Bob]
    [Bob] <-isMarriedTo-> [Sally]
    [Bob] <-knows-> [Alice]
    [Alice] <-workedWith-> [John]
    [John] <-wentToSchoolWith-> [Sandra]
    [Sandra] <-knows-> [Steve]

Diagram: http://yuml.me/6ff3074e

A "traditional" database like MySQL or Mongo makes this kind of query prohibitively expensive and complicated, as it must perform a new join for every connected person in the user's graph.

Graph databases come into their own because they are designed specifically for efficient traversal of these connecting edges. They typically do this by storing "pointers" on each vertex to its connected edges, so while a normal RDBMS requires something like a hash table lookup to resolve a join, a graph database can simply "jump" to the relevant record via a pointer. This means that things like Dijkstra's algorithm [0] can be implemented efficiently.

However, "traditional" graph databases like Neo4j require everything to be structured in terms of vertices and edges. This is often quite inconvenient, so Multi Model databases like ArangoDB integrate this graph approach with a document store as well, the idea being that if you can keep everything in the same db your app gets a lot simpler, you regain things like ACIDity that you'd normally lose by using 2 separate dbs, and performance should be a lot better too.

[0] http://en.wikipedia.org/wiki/Dijkstra%27s_algorithm

rmrfrmrf11y ago

Thank you for this detailed and informative explanation!

marvel_boy11y ago

Thanx !

mercnet11y ago

Instead of tables, data is stored in a graph as node and edge. Both can have stored properties.

Check out http://neo4j.com/ and live examples using it http://gist.neo4j.org/

Xophmeister11y ago

https://en.wikipedia.org/wiki/Graph_database

amelius11y ago· 3 in thread

Why the "native" adjective? Aren't all databases native?

phpnode11y ago

Some people build "graph databases" on top of storage backends that are ill suited for such workloads. e.g you can build a "graph database" (or K/V store) on top of MySQL, but the performance is terrible - http://java.dzone.com/articles/mysql-vs-neo4j-large-scale

a "native graph database" is one that is actually designed for the task.

fcellerOP11y ago

Basically for me, it has two aspects: first the storage engine is designed to handle all models natively. second you have a common query language which is supported by the database storage engine.

There are different approaches, which are used in other products and which can also work well. For example, you can restrict the database engine to a pure key/value store and add different personalities to it. Or you have a client which implements a common query language for different products.

dexterchief11y ago

I think the idea with "native multi-model" is that Arango was explicitly designed to do k/v, documents and graphs rather than it being something that is bolted on afterwards.

codewithcheese11y ago· 3 in thread

How does OrientDB compare to ArangoDB?

phpnode11y ago

OrientDB fails to deliver on its promises. It has a load of features but they are poorly thought out and/or broken.

ArangoDB is OrientDB done right, but it's a lot younger.

If you're considering using either, you owe it to yourself to investigate whether postgres's Common Table Expressions [0] can do what you want instead. If you can stick with something more mature like postgres, then you'll be saving yourself a lot of pain.

[0] http://www.postgresql.org/docs/9.1/static/queries-with.html

codewithcheese11y ago

Thanks for the suggestion, WITH keyword looks interesting

crudbug11y ago

"ArangoDB is OrientDB done right"

How are you backing this ? I am sure Luca from OrientDB will have some comments.

1 more reply

maxdemarzi11y ago· 3 in thread

Never trust a benchmark.

>>The uncompressed JSON data for the vertices need around 600 MB and the uncompressed JSON data for the edges requires around 1.832 GB.

So why use a 60GB RAM machine for so little data?

Can we get some raw numbers instead of %?

fcellerOP11y ago

Hi, this Frank from ArangoDB again. We've not added absolute times because it strongly depends of your machines, network. For the setup we have used, the baseline (ArangoDB) is as follows: shortest path 0.5 sec, neighbors 0.15 sec, single read 26 sec, single write 27 sec, aggregations 1.4 sec, memory 12.5 GByte. Therefore a 16 GByte machine should have been enough. But we did not know beforehand, therefore we selected the biggest machine from GCE available to us (thanks to Google for giving us credits to use the machines for free).

I agree never trust a benchmark. It really all depends on your use case. If you have ideas for improvements, we would love to hear about them. Also if you have any idea how to improve the mongodb or neo4j queries, please check-out github and let us know.

nosideeffects11y ago

I don't understand the % either. They state it is graph of _throughput_, with higher percentages from baseline being less throughput? If I hadn't read the backwards description I would've concluded that their DB is really slow on most fronts.

ifcologne11y ago

Thanks for the hint, the description was cut out. I've updated the chart right now. (Ingo from ArangoDB)

amirouche11y ago· 2 in thread

This does not explain why arangodb is faster. Stating that «Native multi-model» is a killer feature would be more interesting if it was explained what it means outside of «arangodb=graph+k/v+document stores». What is the difference between a graph vertex without edge and a document at the storage level? arangodb is faster than wiredtiger? suspicious.

fcellerOP11y ago

Hi amirouche, I'm Frank, CTO of ArangoDB. Claudius benchmark looked at queries occurring a typical social network project. The tests shows that wiredtiger is indeed a bit faster for reads and writes. The neighbors of neighbors is typically a question you would ask a graph database, not a a document store. Therefore, you would set-up two databases and ask MongoDB the document questions and Neo4J the graph questions. If you use a native multi-model approach, you only need to setup, maintain one database. The response times for example for reads and shortest paths are comparable to the specialized solutions.

For the technical difference at storage level: graphs and documents model are in my opinion a perfect match, because a vertex (and an edge for that matter) can be stored as ordinary documents. This allows you to use any document query you have in a document (give me all users, which live in Denver) and start graph queries from the vertices found in this manner (give me their 1 and 2 level friends).

phpnode11y ago

> What is the difference between a graph vertex without edge and a document at the storage level?

Nothing, most multi-model dbs store vertices (and edges) as documents

arthursilva11y ago· 2 in thread

There should definitely be a bigger dataset version, the teste data is a very small fraction of the available memory.

dexterchief11y ago

That would be cool. I'm not sure it would change much though. I know someone working with search data who recently tried out Neo4j with a test data set of 500,000,000 nodes and apparently was really disappointed with the results.

I'm not sure that graph data (generally) is particularly amenable to being spread across multiple nodes. My understanding is that ArangoDB has implemented some clustering based on Googles Pregel Framework, so I suspect it might fare a bit better in my friends test... but in spite of my urging I don't know that he has had time to recreate the test with Arango. I'm keeping my fingers crossed.

I don't know if any database is fun to deal with at that size. My experience with Arango has been an unremarkable amount of remarkably complex data, so I would also be interested to see the results with something huge.

jexp11y ago

I'd love to hear from your friend and his experience with Neo4j, to see how we can make it easier / better to configure it correctly for the data volume.

crudbug11y ago· 1 in thread

why OrientDB left out ? I would love to see the comparison.

dmarcelino11y ago

They've added a graph including OrientDB at https://www.arangodb.com/2015/06/performance-comparison-betw...

dexterchief11y ago

I've been using ArangoDB for a year now and I think they are definitely on to something cool.

Having stumbled upon some really complex data a few times now, I am increasingly appreciating how amazing it is to model your data any way you need, without having to deal with the complexity of running multiple data stores.

Cool to see that I apparently didn't give up any performance to get the flexibility. :)

I'd love to see them push the geospatial capabilities a little further, but they are already pretty decent.

jsteemann11y ago

Very interesting! Would like to see a comparison with relational datastores (e.g. Postgres), too.

j / k navigate · click thread line to collapse

35 comments

28 comments · 9 top-level

rmrfrmrf11y ago· 5 in thread

Pardon my ignorance, but what is a graph database?

phpnode11y ago

Let's imagine you want to see how Fred is connected to Steve, their network looks like this:

    [Fred] <-knows-> [Bob]
    [Bob] <-isMarriedTo-> [Sally]
    [Bob] <-knows-> [Alice]
    [Alice] <-workedWith-> [John]
    [John] <-wentToSchoolWith-> [Sandra]
    [Sandra] <-knows-> [Steve]

Diagram: http://yuml.me/6ff3074e

A "traditional" database like MySQL or Mongo makes this kind of query prohibitively expensive and complicated, as it must perform a new join for every connected person in the user's graph.

[0] http://en.wikipedia.org/wiki/Dijkstra%27s_algorithm

rmrfrmrf11y ago

Thank you for this detailed and informative explanation!

marvel_boy11y ago

Thanx !

mercnet11y ago

Instead of tables, data is stored in a graph as node and edge. Both can have stored properties.

Check out http://neo4j.com/ and live examples using it http://gist.neo4j.org/

Xophmeister11y ago

https://en.wikipedia.org/wiki/Graph_database

amelius11y ago· 3 in thread

Why the "native" adjective? Aren't all databases native?

phpnode11y ago

a "native graph database" is one that is actually designed for the task.

fcellerOP11y ago

Basically for me, it has two aspects: first the storage engine is designed to handle all models natively. second you have a common query language which is supported by the database storage engine.

dexterchief11y ago

I think the idea with "native multi-model" is that Arango was explicitly designed to do k/v, documents and graphs rather than it being something that is bolted on afterwards.

codewithcheese11y ago· 3 in thread

How does OrientDB compare to ArangoDB?

phpnode11y ago

OrientDB fails to deliver on its promises. It has a load of features but they are poorly thought out and/or broken.

ArangoDB is OrientDB done right, but it's a lot younger.

[0] http://www.postgresql.org/docs/9.1/static/queries-with.html

codewithcheese11y ago

Thanks for the suggestion, WITH keyword looks interesting

crudbug11y ago

"ArangoDB is OrientDB done right"

How are you backing this ? I am sure Luca from OrientDB will have some comments.

1 more reply

maxdemarzi11y ago· 3 in thread

Never trust a benchmark.

>>The uncompressed JSON data for the vertices need around 600 MB and the uncompressed JSON data for the edges requires around 1.832 GB.

So why use a 60GB RAM machine for so little data?

Can we get some raw numbers instead of %?

fcellerOP11y ago

nosideeffects11y ago

ifcologne11y ago

Thanks for the hint, the description was cut out. I've updated the chart right now. (Ingo from ArangoDB)

amirouche11y ago· 2 in thread

fcellerOP11y ago

phpnode11y ago

> What is the difference between a graph vertex without edge and a document at the storage level?

Nothing, most multi-model dbs store vertices (and edges) as documents

arthursilva11y ago· 2 in thread

There should definitely be a bigger dataset version, the teste data is a very small fraction of the available memory.

dexterchief11y ago

jexp11y ago

I'd love to hear from your friend and his experience with Neo4j, to see how we can make it easier / better to configure it correctly for the data volume.

crudbug11y ago· 1 in thread

why OrientDB left out ? I would love to see the comparison.

dmarcelino11y ago

They've added a graph including OrientDB at https://www.arangodb.com/2015/06/performance-comparison-betw...

dexterchief11y ago

I've been using ArangoDB for a year now and I think they are definitely on to something cool.

Cool to see that I apparently didn't give up any performance to get the flexibility. :)

I'd love to see them push the geospatial capabilities a little further, but they are already pretty decent.

jsteemann11y ago

Very interesting! Would like to see a comparison with relational datastores (e.g. Postgres), too.

j / k navigate · click thread line to collapse