Inspiration came from ClickBench, a Benchmark For Analytical DBMS.
We previously developed mgBench as in-house testing infrastructure to benchmark Memgraph, and now we are adapting it to support other graph database vendors. In order to test graph database performance, mgBench executes Cypher queries on a given dataset. Queries are general and represent a typical workload that would be used to analyse any graph dataset. Running this benchmark is automated, and the code used to run benchmarks is publicly available. You can run mgBench yourself to validate the results on the BenchGraph platform. The methodology is explained in detail on GitHub repo [2]
As you can see, at the moment, we have two vendors on the platform. We would like to add more vendors to our platform. If you want, feel free to contribute.
Let me know if you have any questions or suggestions.
[1] https://memgraph.com/benchgraph [2] https://github.com/memgraph/memgraph/tree/master/tests/mgben...
Measuring peak memory will be nonsensical for some implementations. Some databases do minimal dynamic allocation for performance reasons. Some will be paging to storage, which can work well for graph databases with an appropriate I/O scheduler design. This benchmark seems to assume all graph databases are in-memory and doing dynamic memory allocation.
The test data models are tiny. Even the “large” test data model falls below the noise floor of scalable graph database architectures. This has the implication that good results overfit for graph databases that scale poorly. You need something closer to a billion edges to really exercise and differentiate the performance characteristics of graph databases, and in realistic applications that still isn’t a particularly large graph (maybe medium-sized?).
It would also be useful to benchmark how long it takes to load and prepare the data. This is important operationally and, for many graph databases, unreasonably slow. Graph databases tend to skip over this part when talking about performance.
Both on Memory usage tracking and precise data on load/input.
Regarding scale, we are aware of the issue, listed in limitations: https://github.com/memgraph/memgraph/tree/master/tests/mgben.... Next versions will probably have a billion nodes/relationships.
Actually, Neo4j is particularly slow on writes, import/load times were 50x faster on Memgraph, but we didn't show it. Will do it in the next version for all vendors.
I implemented a toy Cypher database (samsquire/hash-db) and I just use a python test script. I am yet to benchmark, the performance is probably poor.
I tried running standardised SQL benchmarks against MySQL but the benchmark code fell behind the MySQL client and it's work to maintain it.
I inherited a Jepsen suite to test ActiveMQ and it wasn't easy to understand
Testing can be a full time job!
For example, I created a new kind of relational database engine that also has some very favorable performance characteristics when compared with popular engines like SQLite, MySQL, PostgreSQL, and SQL Sever. I have posted some videos (https://www.youtube.com/watch?v=Va5ZqfwQXWI and https://www.youtube.com/watch?v=OVICKCkWMZE) and done a number of live demonstrations of my system showing how it is many times faster with both transactional and analytic operations across a wide variety of queries.
I certainly didn't expect mass adoption where everyone decided to dump their old system and move to mine (it is still missing some important features). But in a world where even a 50% increase in speed should be seen as significant; the reaction has instead been crickets! You would think that showing a 10x improvement at anything would result in some serious inquiry by at least a few hundred people. Instead, I seriously wonder if anyone cares about speed anymore.
I don't mean to downplay the potential performance benefits you've shown, but you're up against other databases that have been around for 22-33 years each. You have a bus factor of seemingly 1. You have a specific way of communicating that you've decided on that is unlike any other popular solution out there. I wouldn't assume speed is the primary factor.
It is a little startup with a tiny 'bus factor' so like I said, I don't expect everyone to just jump on board; but I expected to get a little more attention from the 'curiosity crowd' who just might wonder why it is so fast. It needs more resources to support other platforms (the code is cross-platform, but I have only tested on Windows and Linux so far) and languages.
Again, the browser is a proof of concept app that demonstrates that the underlying engine is not just blowing smoke. If it can't attract some interested customers and/or investors, then it will be a dead-end project just like every other promising technology that failed to get a hold in the market.
What is specific in graph space is that things are still quite early days compared to relational database space. This means performance differences are big, and playing a more important role.
> There is a broad number of drivers in many different programming languages available for both solutions. While Memgraph only maintains a few in-house drivers that it develops and supports (C, C++, Python, Rust), most Neo4j drivers can also be used with Memgraph. This is due to the fact that both solutions use the Bolt protocol, labeled property graph model and Cypher query language.
Can I really create a Neo4j driver object from the official Neo4j Python driver and point it at Memgraph and my app code will work as expected?
If you want to continue using the Neo4j driver. We actually also have GQL Alchemy, which is ORM for Python. Take a look here: https://memgraph.com/gqlalchemy
The ORMs we’ve experimented with tend to be too opinionated in their structure or have leaky abstractions or don’t allow configuring important settings like connection timeouts or don’t handle things like batch writes or transaction control correctly. For my use case it’s much better to simply drop in a bolt driver and watch it go.
Do you know what is the reason for these results? Is it because of C++ (MemGraph) vs Java (Neo4j)?
Any other queries where MemGraph does poorly compared to Neo4j? What are the downsides to using MemGraph?
So far, on this dataset and scale, we didn't encounter but we have plans for a bigger dataset and more complex queries, you can take a look at the limitations part of this benchmark.
What are the downsides, Memgraph and Neo4j are currently a bit different vendors, Memgraph is an in-memory database, while Neo4j is on disk. So in Memgraph's case, you are exclusively using RAM as a storage but gain speed, we have snapshots for disk for recovery, etc. While Neo4j is on disk, not-in memory but they are loading a bunch of stuff in RAM and consuming waste amounts of RAM, so it is hard to give pure distinction.
Being in memory vs Neo4j doesn’t even seem like a fair comparison. I would hope to god your product is faster for that reason. But it’s like comparing a minivan vs an electric smart car. One is quick and fast but it has its limitations, and the other option is more versatile.
Also I think you’re limiting yourself if the positioning of your product is purely as a counter part to Neo4j, and riding on their coattails.
Glad to see some thoughts about benchmarking, but every vendor makes such a thing and, surprise surprise, they’re always better than the rest.
Of course sqlite isn't a graph database. But you can make it work like one by writing some query joins.
What are you doing with Dgraph, and what are the requirements for use-case? Of course, if you can share more info?
If image size is a problem, there are other images which not include everything available here https://hub.docker.com/r/memgraph/memgraph-platform/tags, probably not everything included in the platform is actually required for a given use-case :D