Skip to content

Top Best Ask Show New Jobs

Show HN: We have built a benchmark platform for graph databases (opens in new tab)

(memgraph.com)

83 pointsmapleeman3y ago45 comments

45 comments

36 comments · 12 top-level

mapleemanOP3y ago· 6 in thread

Hi everyone! I’m one of the co-authors of BenchGraph[1], a platform for Graph Database Performance Benchmarks. Our platform shows the results of running benchmark tests (via mgBench) on supported vendors. It shows the overall performance of each system relative to others.

Inspiration came from ClickBench, a Benchmark For Analytical DBMS.

We previously developed mgBench as in-house testing infrastructure to benchmark Memgraph, and now we are adapting it to support other graph database vendors. In order to test graph database performance, mgBench executes Cypher queries on a given dataset. Queries are general and represent a typical workload that would be used to analyse any graph dataset. Running this benchmark is automated, and the code used to run benchmarks is publicly available. You can run mgBench yourself to validate the results on the BenchGraph platform. The methodology is explained in detail on GitHub repo [2]

As you can see, at the moment, we have two vendors on the platform. We would like to add more vendors to our platform. If you want, feel free to contribute.

Let me know if you have any questions or suggestions.

[1] https://memgraph.com/benchgraph [2] https://github.com/memgraph/memgraph/tree/master/tests/mgben...

jandrewrogers3y ago

This is nice but I have a few comments:

Measuring peak memory will be nonsensical for some implementations. Some databases do minimal dynamic allocation for performance reasons. Some will be paging to storage, which can work well for graph databases with an appropriate I/O scheduler design. This benchmark seems to assume all graph databases are in-memory and doing dynamic memory allocation.

The test data models are tiny. Even the “large” test data model falls below the noise floor of scalable graph database architectures. This has the implication that good results overfit for graph databases that scale poorly. You need something closer to a billion edges to really exercise and differentiate the performance characteristics of graph databases, and in realistic applications that still isn’t a particularly large graph (maybe medium-sized?).

It would also be useful to benchmark how long it takes to load and prepare the data. This is important operationally and, for many graph databases, unreasonably slow. Graph databases tend to skip over this part when talking about performance.

mapleemanOP3y ago

Thanks for all the comments and inputs. I have added the suggestions that we plan to implement: https://github.com/memgraph/memgraph/issues/689.

Both on Memory usage tracking and precise data on load/input.

Regarding scale, we are aware of the issue, listed in limitations: https://github.com/memgraph/memgraph/tree/master/tests/mgben.... Next versions will probably have a billion nodes/relationships.

Actually, Neo4j is particularly slow on writes, import/load times were 50x faster on Memgraph, but we didn't show it. Will do it in the next version for all vendors.

Hey mapleeman, why is the benchmark infrastructure not language agnostic? Shouldn't it be more desirable to keep the infra as a scheduler and verifier which can take an input in any language and process it while collecting the useful metrics. Restricting it to cypher leaves a bunch of DBs unable to run this.

samsquire3y ago

That's a lot of work and would be a large project in itself to separate the test driver to different query formats and have it drivable from multiple programming languages.

I implemented a toy Cypher database (samsquire/hash-db) and I just use a python test script. I am yet to benchmark, the performance is probably poor.

I tried running standardised SQL benchmarks against MySQL but the benchmark code fell behind the MySQL client and it's work to maintain it.

I inherited a Jepsen suite to test ActiveMQ and it wasn't easy to understand

Testing can be a full time job!

I agree. Designing a generic test runner running workloads by reading them from raw arbitrary SQL files and then executing them against XYZ database backend would have been much easier for adding new workloads and extending supported database backends. I wrote few such frameworks so I for sure know it's not a big deal. Sysbench does this through Lua which is also ok if you need more advanced scripting capabilities in the workload itself.

mapleemanOP3y ago

Hey, thanks for the comment, you are 100% right, this is just the initial version since we are compatible with Neo4j, so it was a least effort to do it. It is just the initial setup, making it language agnostic will take a bit time. If you peak at the methodology and future part: https://github.com/memgraph/memgraph/tree/master/tests/mgben... You will see that we have the plan to add more database vendors + make it language-agnostic. We are also keeping track of all comments regarding this, I have opened an issue: https://github.com/memgraph/memgraph/issues/689, there is a language agnostic note in there. If you have any other input, it would be highly appreciated.

didgetmaster3y ago· 5 in thread

The test results are impressive (assuming that they are not just cherry-picked) against the competition; but I seriously wonder if anyone really cares. People will tell you all day that performance is super important, but in my experience they rarely 'put their money where their mouth is'.

For example, I created a new kind of relational database engine that also has some very favorable performance characteristics when compared with popular engines like SQLite, MySQL, PostgreSQL, and SQL Sever. I have posted some videos (https://www.youtube.com/watch?v=Va5ZqfwQXWI and https://www.youtube.com/watch?v=OVICKCkWMZE) and done a number of live demonstrations of my system showing how it is many times faster with both transactional and analytic operations across a wide variety of queries.

I certainly didn't expect mass adoption where everyone decided to dump their old system and move to mine (it is still missing some important features). But in a world where even a 50% increase in speed should be seen as significant; the reaction has instead been crickets! You would think that showing a 10x improvement at anything would result in some serious inquiry by at least a few hundred people. Instead, I seriously wonder if anyone cares about speed anymore.

fshr3y ago

Do you have an open source repo? Commercial licensing? Is there a business / pricing model? Do you have a roadmap for feature parity with other solutions? Do you have a public list of the missing features? Is there written documentation? Can the db server be interacted with aside from the Browser app?

I don't mean to downplay the potential performance benefits you've shown, but you're up against other databases that have been around for 22-33 years each. You have a bus factor of seemingly 1. You have a specific way of communicating that you've decided on that is unlike any other popular solution out there. I wouldn't assume speed is the primary factor.

didgetmaster3y ago

Didgets is currently in beta with a free download available from the website https://www.Didgets.com so anyone can try it out with their own data set. I haven't decided yet what parts I am going to open source. The engine has an API so many other programs besides the browser app could interact with it.

It is a little startup with a tiny 'bus factor' so like I said, I don't expect everyone to just jump on board; but I expected to get a little more attention from the 'curiosity crowd' who just might wonder why it is so fast. It needs more resources to support other platforms (the code is cross-platform, but I have only tested on Windows and Linux so far) and languages.

Again, the browser is a proof of concept app that demonstrates that the underlying engine is not just blowing smoke. If it can't attract some interested customers and/or investors, then it will be a dead-end project just like every other promising technology that failed to get a hold in the market.

mapleemanOP3y ago

You are right about this 100% percent, performance is not the only factor, especially in an established DB space such as a relational DB world. There is a lot of things to consider before moving/or picking the database.

mapleemanOP3y ago

So these results were not cherry-picked since the same queries were run previously in our CI infrastructure. Yep, as you have said, performance is just one of the things that matter. A lot of things matter when picking a vendor, some are mentioned in the comments.

What is specific in graph space is that things are still quite early days compared to relational database space. This means performance differences are big, and playing a more important role.

tacone3y ago

Performance in graph databases is critical, and would need improvements of several orders of magnitude, rather than tens of percent points. Many algorithms are impractical in real world graphs because they are often O(n^2) or more, and in practice we have to resort to approximated algorithms instead (like pagerank and all the likes).

rozularen3y ago· 2 in thread

I've been working with Neo4j daily for the past 3+ years and at first I was surprised by the comparison until I discovered memgraph works in-memory in contrast of Neo4j which works on disk. As others commenters have already said, I'd make it clear in your page to make it fair.

mbuda3y ago

Fair point, but it's not only that, there are many differences, it's mostly impossible to put 2 different systems in a fair config state, it's more about how different systems operate with certain configs / in certain environments. E.g. apparently Neo4j limits the number of concurrent execution threads in the community edition, which means it's not possible to fully stress the system... stuff like that :)

mapleemanOP3y ago

Thanks for the input. It is a fair point. It is hard to create benchmark that is universally fair. But both things serve the same/similar purpose, on top of that, Neo4j also loads a bunch of stuff into RAM, consuming more memory than Memgraph, if your dataset fits in RAM, it because quite a fair comparison.

alexchantavy3y ago· 2 in thread

Looked into your product and saw this: https://memgraph.com/blog/neo4j-vs-memgraph

> There is a broad number of drivers in many different programming languages available for both solutions. While Memgraph only maintains a few in-house drivers that it develops and supports (C, C++, Python, Rust), most Neo4j drivers can also be used with Memgraph. This is due to the fact that both solutions use the Bolt protocol, labeled property graph model and Cypher query language.

Can I really create a Neo4j driver object from the official Neo4j Python driver and point it at Memgraph and my app code will work as expected?

mapleemanOP3y ago

Yep, that is possible. Actually, one of the community members few days ago did this: https://gigi.nullneuron.net/gigilabs/using-the-neo4j-bolt-dr... You just need to tell driver, it is not Neo4j.

If you want to continue using the Neo4j driver. We actually also have GQL Alchemy, which is ORM for Python. Take a look here: https://memgraph.com/gqlalchemy

alexchantavy3y ago

Thanks for the helpful links!

The ORMs we’ve experimented with tend to be too opinionated in their structure or have leaky abstractions or don’t allow configuring important settings like connection timeouts or don’t handle things like batch writes or transaction control correctly. For my use case it’s much better to simply drop in a bolt driver and watch it go.

rkwz3y ago· 2 in thread

The numbers look impressive compared to Neo4j!

Do you know what is the reason for these results? Is it because of C++ (MemGraph) vs Java (Neo4j)?

Any other queries where MemGraph does poorly compared to Neo4j? What are the downsides to using MemGraph?

mapleemanOP3y ago

So is it because of C++ vs Java? Well, not everything can be written just to these differences, a lot of stuff can influence results, architecture, database isolation level etc. One of the many reasons is definitely a C++ and Java argument. Take a look at memory usage, here, you can see how JVM is memory hungry. Also, it takes time to warm JVM and Neo4j, so it is another penalty for the same reason.

So far, on this dataset and scale, we didn't encounter but we have plans for a bigger dataset and more complex queries, you can take a look at the limitations part of this benchmark.

What are the downsides, Memgraph and Neo4j are currently a bit different vendors, Memgraph is an in-memory database, while Neo4j is on disk. So in Memgraph's case, you are exclusively using RAM as a storage but gain speed, we have snapshots for disk for recovery, etc. While Neo4j is on disk, not-in memory but they are loading a bunch of stuff in RAM and consuming waste amounts of RAM, so it is hard to give pure distinction.

The memory/disk is actually a huge difference for durability and a really important factor when making a decision to use a database. I believe it would be good to mention it in the benchmark for fair comparison.

jsumrall3y ago· 1 in thread

This reads like marketing material for your product rather than an unbiased comparison.

Being in memory vs Neo4j doesn’t even seem like a fair comparison. I would hope to god your product is faster for that reason. But it’s like comparing a minivan vs an electric smart car. One is quick and fast but it has its limitations, and the other option is more versatile.

Also I think you’re limiting yourself if the positioning of your product is purely as a counter part to Neo4j, and riding on their coattails.

Glad to see some thoughts about benchmarking, but every vendor makes such a thing and, surprise surprise, they’re always better than the rest.

mapleemanOP3y ago

We plan to add more graph database vendors to this benchmark, this will not be just Memgraph vs Neo4j comparison, hence the name "bench graph". You are 100% right about comparing architecturally different database systems, it is hard to compare them, but they serve the same/similar purposes, we are mentioning that part in limitations methodology: https://github.com/memgraph/memgraph/tree/master/tests/mgben... We first added Neo4j since we are both compatible.

habibur3y ago· 1 in thread

Drop in a sqlite database in there for comparison and we get a better grasp of the magnitude of the numbers.

Of course sqlite isn't a graph database. But you can make it work like one by writing some query joins.

mapleemanOP3y ago

This is a great idea, with one relational database as a reference point for every query on the graph database. I added it to the backlog: https://github.com/memgraph/memgraph/issues/689 Thanks for the idea! :D

wiradikusuma3y ago· 1 in thread

Looking forward comparison with Dgraph ( https://dgraph.io/ ) — I mentioned Dgraph in other, older, posts. I'm not a shill, just a Dgraph user who's looking for alternative.

mapleemanOP3y ago

Added Dgraph to the reported ideas/request for the next benchmark version: https://github.com/memgraph/memgraph/issues/689 :D.

What are you doing with Dgraph, and what are the requirements for use-case? Of course, if you can share more info?

jcuenod3y ago· 1 in thread

Is there a way to show the queries in the overview? I found them at https://github.com/memgraph/memgraph/tree/master/tests/mgben... but it's not obvious...

mapleemanOP3y ago

Glad you found them. Thanks for the tip!

AtNightWeCode3y ago· 1 in thread

Never seen good performance from Neo. But if you scale it beyond what can be held in memory you are supposed to still get fast results when you explore areas of a graph. Aggregated queried are not really a valid usecase for a graphdb.

mapleemanOP3y ago

Yep, you are right. We will expand the quantity, complexity, and variety of queries in future versions.

jcuenod3y ago· 1 in thread

Why has the docker image for `memgraph/memgraph-platform` gone from 501.04 MB a year ago to 2.99 GB today?

mbuda3y ago

There are few reasons: * platform has all the software included (memgraph, lab, mage), most of the image size are actually mage deps * at the moment we start building for ARM, the image also grow in size -> we are still figuring out how to optimize

If image size is a problem, there are other images which not include everything available here https://hub.docker.com/r/memgraph/memgraph-platform/tags, probably not everything included in the platform is actually required for a given use-case :D

senda3y ago· 1 in thread

give tigergraph a whack

mapleemanOP3y ago

Haha, adding to a backlog of things to do, add support for TigerGraph https://github.com/memgraph/memgraph/issues/689

j / k navigate · click thread line to collapse