Now, when trying to get stuck into a bit of NoSQL/schema-free/document store databases for the web, I am overwhelmed by the number of options, and am struggling to understand the best one for the job.
Do people genuinely believe that the world needs this many NoSQL systems, or are we just in the infancy/resurgence of schema-free, and things are yet to settle down?
A while ago, there were several different database query languages for relational databases, too. In interest of having a standard, they compromised on SQL. There are lots of version control systems, parsing frameworks, programming languages, etc., too. This isn't really unique to databases, they just get talked about more since there's so much buzz about hot new web development stuff.
One good source about relational databases (including their history) is _An Introduction to Database Systems_ by C.J. Date. The author has an axe to grind, but he's thorough, and there are plenty of other references cited should you want to dig deeper.
The non-SQL world is still pretty young. Well, the ideas themselves are old - but recent implementations try to solve unique problemsets.
> ... and things are yet to settle down?
Yes. IMO there would be 5-7 major projects supported by larger communities. Every of this projects will solve particular problem.
So, instead of having 2-3 general SQL providers, we can expect many solutions for very specific problems. The issue right now is that we don't really know what these problems are. Current NoSQL implementations are probing the market - answering the question if this specific features are useful for broader audience.
I think we can guess some of these 5-7 major specializations, for example:
- Memcachedb: Distributed K-V optimized for speed - no replication
- Distributed K-V optimized for reliability
- Distributed K-V optimized for size - like Dynamo.
- neo4j: Graph database
- redis: K-V with reach features, but limited to data size that fits in memory
- K-V framework created to allow Map-Reduce jobs - including scheduler, debugger and so on.
They all solve different types of problems (e.g. document stores vs key-value stores). Even similar databases solve the same problems differently (e.g. sharding). They have different performance profiles and bottlenecks. They give you different ways to model your data and query it. Some are persistent, some are not, and some are lazy persistent.
Big picture though: this is the first time your average startup/small team/individual hacker has needed a very scalable database solution because of websites. A website has the ability to get you a ton of users very quickly even if you are just one man hacking on a personal pet peeve (I went through this).
This kind of experimentation is awesome and it allows us to figure out what really works in what situations and is a sign of a very healthy community. I love being part of it.
CouchDB is well worth a hard look mainly because it takes advantage of several new ideas all in a very simple stack.
In a year or two I predict two or three will emerge as clear choices for a few distinct scenarios.
It's been very interesting reading the responses.