Ask HN: What database should we use for our “big data” problem?
We've identified the following requirements of a database:
- Thousands of inserts/updates per second
- Has an solid aggregation strategy: Aggregating data in N different ways is really important to us, any hindrance in our ability to aggregate data is going to slow us down.
- Store + query related data (hierarchical/recursive JSON) -- ideally efficiently, but not required: getting the data out in a useful format is very important to us.
- Partition tolerant (easily clustered, automated replication)
- Availability (heavy reads + writes): same story as the partition tolerance. Availability requirements scale with the popularity of the video games we're storing data for.
With those requirements in mind, we've been mulling over a few different choices.
- Cassandra: Nails the partition tolerance and availability requirements, but it's very limited in querying capability. Also has the intriguing ScyllaDB, which could prove to provide us with more capacity if needed. With the addition of Spark to our infrastructure, Cassandra may be a good fit for our aggregation needs as well. (And we realize Riak + others are compatible with Spark as well.)
- MongoDB: Native support for JSON could be a big plus. Has a built in aggregation pipeline, but we're unsure of its capability. Not too enthused with the master-slave replication - again we're not overly concerned with consistency.
- Postgres: Maybe we don't need NoSQL yet? With the right partitioning strategy, an RDBMS could prove to work for our use case. Postgres has (from what I can tell) excellent support for JSON. Indexing on JSON properties could prove to be very useful for aggregation queries.