I'm not exactly a SQL fanboy, but maybe ACID is kinda useful in situations like this and having to write your own application land 1000 liners for stuff that got solved in SQL land decades ago isn't the best use of time?
I like SQL engines for moderate data sets that fit nicely on one machine and well within the normal performance envelope. But even there I will often have to try a few different incantations and cross my fingers that one of them will perform reasonably because that's easier than trying to figure out what that 1 MLOC engine is up to. And I don't know anybody who does very large MySQL setups without a lot more hassle than that.
For some things I'd much rather deal with 1KLOC that I had to write myself than the 1 MLOC that I'm scared to even start digging through.
It's a stable, well understood DB vs an immature, not well understood DB AND 1KLOC to deal with not being consistent.
To be clear, I'm not saying any given DB is the OneTrueWay, just that people seem to be a bit cavalier in regards to some of this crap and chasing the newest shiny thing while rediscovering why some of the braindamage in those 1MLOC was put there in the first place.
A project with rigorous error handling and testing will have more LOCs than a corresponding project without.
Some problems are just hard, and you'll want as much code as is necessary to make it secure and performant. Some parts of the code you will never run, but inactive code seldom hurts you.
MySQL has its issues, but none of them would be fixed just by having less code.
See ALTER.
It requires you to think about your application different, but it enables things that you could not do before.
For example, you can now handle databases in multiple datacenters, reducing latency to the client.
This is backwards. Multi-DC capability is a feature. Eventual Consistency is an explicit tradeoff in a desired characteristic (Consistency) to allow other features.
I was using the LevelDB backend with Riak 1.1.2, as my keys are too big to fit in RAM.
I ran tests on a 5 node dedicated server cluster (fast CPU, 8GB ram, 15k RPM spinning drives), and after 10 hours Riak was only able to write 250 new objects per second.
Here's a graph showing the drop from 400/s to 300/s: http://twitpic.com/9jtjmu/full
The tests were done using Basho's own benchmarking tool, with the partitioned sequential integer key generator, and 250 byte values. I tried adjusting the ring_size (1024 and 128), and tried adjusting the LevelDB cache_size etc and it didn't help.
Be aware of the poor write throughput if you are going to use it.
For average commodity hardware I found something like 400 reqs/s/node was normalish, even sustained. Yours looks like about 2 minutes in it dies. Come to think of it, could you have your open file descriptors limited in the OS settings? That looks just like pattern I'd expect to see from that.
Might be unrelated but common pitfalls I had were: - Using the HTTP proto. Protobuf is way faster. - You can tweak the r and w values to get less read and write consensus when you can afford to, depending on the task and data. - ulimit open file descriptors might be too low.
In any case, if you were to do a short writeup, I'm sure the basho guys at the mailing list would be interested.
I was monitoring with iostat and a couple of other tools. It was certainly very heavy on io, with 80% util, 20% iowait, and that increased as the currency went up.
I was using protobuf, and a w value of 1, so I was out of things to optimize.
When I was inserting objects already in Riak's cache, it ran about 3 times faster, but of course that's not possible with new objects.
As a simple remark on this, I've gotten 1000+ ops/sec on a single machine operating as 3 nodes (equating to about 3000 ops/sec per node) when using an SSD and a measly 150 ops/sec with a spinny disc in the same setup (equating to about 450 ops/sec per node)
Search wasn't in use on the test bucket.
For my app, I'd integrated Riak using ruby.
I've heard of Bump, and used it once or twice, but I don't actually know how big or popular it is. If we're talking about a database for a few million users, only a tiny percentage of which are actively "bumping" at any time, it's really hard for me to imagine this is an interesting scaling problem.
Ex. If I just read an article about a "data migration" who's scale is something a traditional DBMS would yawn at, the newsworthiness would have to be re-evaluated.
That's a growth rate of 5 million installs a month; if they kept up that pace, they're at 90 million installs.
To put that in perspective, Instagram "only" has 50 million users. http://www.quora.com/Instagram/How-many-users-does-Instagram...
More bump data here: http://bu.mp/static/images/infographic_9-2011_6.pdf
I'm not a user, but it seems like they have serious data.
90 million rows of denormalized data isn't a big deal, and if I had to guess, their ops per second is probably no higher than what a dedicated single, or maybe a small master-slave postgres deployment could handle.
Again, something a DBA would yawn at.
And I say this as someone who scaled up an API for a service that plugged into multiple ad networks concurrently for a total of billions of impressions per month with a high level of reliability. Using NoSQL and an RDBMS combined.
People who want to preach the NoSQL message should probably have some actual experience. Otherwise, it just makes very viable NoSQL solutions look really bad.
I'll happily share any other statistics you're interested in.
Edit: the Riak cluster actually contains lots of other data (communications, object metadata, etc.); we didn't need sixteen boxes for the user records.
The only other stat that I'm curious about is the total size of the DB. Certainly databases with tens of millions of records can be held completely in RAM these days... but that also depends on how big each record is.
Imagine that...this fascination with schema-less datastores just baffles me:
http://draconianoverlord.com/2012/05/08/whats-wrong-with-a-s...
I'm sure schema-less datastores are a huge win for your MVP release when it's all greenfield development, but from my days working for enterprises, it seems like you're just begging for data inconsistencies to sneak into your data.
Although, in the enterprise, data actually lives longer than 6 months--by which time I suppose most start ups are hoping to have been bought out.
(Yeah, I'm being snarky; none of this is targeted at bu.mp, they obviously understand pros/cons of schemas, having used pbuffers and mongo, I'm more just talking about how any datastore that's not relational these days touts the lack of a schema as an obvious win.)
SQL data stores provide a way to limit certain kinds of inconsistency, but a) I rarely see a system that uses all of that power, and b) there are plenty of inconsistencies that you can't prevent with standard SQL features.
Personally, I'm ok with schema-less stores in the same way I'm ok with saving files on disk. I don't expect my filesystem to enforce application-level file format quality. I just expect it to store things and give them back when I ask. That doesn't mean I don't care about data integrity, it just means I solve the problem somewhere else in the system.
For example this article mentions "With appropriate logic (set unions, timestamps, etc) it is easy to resolve these conflicts" however timestamps are not an adequate way to do this due to distributed systems having partial ordering. The magicd may be serialising all requests to riak to mitigate this (essentially using the time reference of magicd) in which case they're losing out on the distributed nature of riak (magicd becomes a single point of failure / bottleneck).
Insight into how others have approached this would be awesome.
One way is to write domain-specific logic that knows how to resolve your values. For example, your models might have some state that only happen-after another state, so conflicts of this nature resolve to the 'later' state.
Another approach is to use data-structures or a library designed for this, like CRDTs. Some resources below:
A comprehensive study of Convergent and Commutative Replicated Data Types http://hal.archives-ouvertes.fr/inria-00555588/
https://github.com/reiddraper/knockbox https://github.com/aphyr/meangirls https://github.com/ericmoritz/crdt https://github.com/mochi/statebox
You're right that this post is vague with regard to those details; they would be a good candidate for a future blog post, but the desired takeaway from this one is that we're quite pleased with the performance and scalability that Riak provides.
It doesn't seem fair to compare [old tech] with [new tech] when you've felt all the pitfalls with one but not the other.