Kudos to the developers that rise above this, often working for nothing, to build the awesome tools that future generations will use to build awesome apps.
You can't spell NoSQL without the word no.
This is why I try never to use the word NoSQL. It's a flamebait word, deliberately engineered to add heat rather than light. There's no such thing as "a NoSQL database"; there are only databases. Even the relational databases that parse SQL have significant differences, and the databases that don't speak SQL are all over the map.
And have been since the punchcard days. For many people, it's hard to imagine there have been databases of all types, feature sets and performance characteristics before the dawn of SQL.
Nor can you spell it without the word os. But I don't think we're talking about a mouth or other external opening.
A lot of people now read "NoSQL" as "Not Only SQL", which seems more positive than negative.
If you go to the "Don't use MongoDB" post ( http://news.ycombinator.com/item?id=3202081 ) you will read some, IMO, extremely worrying comments from a few pro-NoSQL users including antirez (Redis).
For some reason NoSQL now apparently means "unreliable datastore for unimportant, throwaway data" and defaults are chosen accordingly. Why the hell is that?
NoSQL for me doesn't imply anything other than "no SQL", and at a stretch "no schema" - this makes a lot of sense for many of us who routinely need to create databases that are logically trivial. In many cases they are a bunch of glorified persistent hash tables that usually don't fit in memory. But this doesn't mean they aren't critical. Why would it have to? This isn't anything new either, we've had Berkeley DB for a long while. It's just a bit of the dry side and it may fall short in many cases.
What I was looking forward to and I hoped I could find in the "NoSQL scene" is an alternative to traditional DBs but without the overhead that many times is not necessary (but sometimes is, and I intend to continue using PostgreSQL when appropriate). Ideally, something as simple as mongoDB appears to be (tried the interactive tutorial).
So when exactly NoSQL stopped meaning "no SQL" and started meaning "unreliable cache"? Other than the simplicity, I fail to see where it would fit in the market then (other than the amateur market). There are better, stablished DB caching solutions. There are persistence libraries in any moderately language. There are reliable databases that are fast enough when you have the budget to scale to several dedicated servers.
How about Riak?
I'm worried about the culture that's brewing as well, but I see it more as an attempt from some NoSQL supporters to keep MongoDB looking good, even in the face of serious data integrity issues. The battle lines are forming between SQL and NoSQL (relational vs. non-relational data stores, really) and there's a lot of money and reputation at stake. What we don't want is for the facts to die in a war of rhetoric about the merits of SQL vs. NoSQL. That would be dumb.
With that said, the first paragraph of the rant is worrying:
"I've kept quiet for awhile for various political reasons, but I now feel a kind of social responsibility to deter people from banking their business on MongoDB."
What the hell does "various political reasons" mean? I'm more concerned about that than any deficiencies in MongoDB's codebase. Is there a well-funded campaign to silence MongoDB/NoSQL criticism, or is this just one customer's attempt to save face for choosing the wrong data store?
First, Riak is excellent. I can only say positive things about it as well as the folks that work on it.
Re: "store for unimportant data". I'll go beyond that. Not only should new databases be suitable for reliable storage, new databases should do things than existing databases can't. I am a bit sad that NoSQL had become to mean "replacement for an improperly tuned, ad-hoc sharded MySQL setup". To be clear, having a simple setup that provides partitioning, replication and defaults more tuned to modern hardware is a fine goal -- but why not do better? If I wanted something better than MySQL, I'd use Postgres (or properly tune my MySQL installation).
For example, Dynamo-style stores allow for any replica to initiate a write (something not possible with primary copy replication), allowing high availability applications. Some systems (Voldemort, riak-core, HBase with co-processors) also allow custom code to run on the server, significantly extending the capability of a system in a way in which a store procedure can't.
It's also sad to see NoSQL style systems repeat many mistakes that MySQL has made. MySQL in late 90s with MyISAM is a completely different beast from MySQL today with InnoDB: far better concurrency, durability, referential integrity, better replication. BerkeleyDB JE is also a powerful beast: log structured storage (this is why we're using it as the default storage engine in Voldemort), Paxos-based leader elections with tunable replication.
Schema-less data or (as in Voldemort) evolvable schemas is also a huge feature, but it's not impossible to replicate it on top of MySQL (e.g., Friendfeed's data model).
Here are some things that I'd like to really see evolve in NoSQL space:
* Support for new and interesting distribution models. Allowing users to choose between eventual consistency, quorum protocols, primary copy replication and even transactional replication.
* Support for large, unstructured blob data: Riak is going the right way with Luwak, I believe Facebook has been using HBase as a front-end for Haystack -- it would also make a great choice for Haystack's metadata store.
* Most NoSQL systems support transactions within the scope of a single value (or document) via the use of quorums, serializing through a single master, etc... However, it'd be nice if something like MegaStore's Entity Groups (or Tablet Groups in Microsoft Azure Cloud SQL server) were supported.
* Secondary indices, whether internal or external (by shipping a changelog) to the system.
* True multi-datacenter support (local quorums if desired, async replication to the remote site) including across unreliable, high latency WAN links (disclosure: Voldemort supports this -- https://github.com/voldemort/voldemort/wiki/Multi-datacenter... )
Having a serious discussion about NoSQL databases begs the exact same question as having a serious discussion about cancer: what kind would you like to have a serious discussion about?
I think the most important lesson we can learn from NoSQL in general is that the idea of a one-size-fits-all database is becoming dated. NoSQL databases certainly don't solve the problems the author points out, and they probably never will. In fact that's the point. By not solving one set of problems, you allow yourself to solve another set of problems.
How about we use databases to solve the problems they were meant to solve, rather than basing our choices on whatever the popular opinion is at the moment.
For programming languages, using the "right tool for the job" has little downside. Perhaps the developers need to learn an extra language, or perhaps there is some communication overhead between them. But unless the components are tightly-coupled, there's not much of a loss.
In contrast, the value of the whole data is greater than the sum of the parts. If you have a website selling products and an inventory management system and an automatic price-setting tool, it's hard to use a different DBMS for each one.
Even for data sets that seem unrelated at first, there may be a lot of value in the small connections between them. This is becoming increasingly apparent and companies are trying very hard to see these connections. Being in separate systems just makes that more difficult.
So, there are good reasons to use multiple database systems, but there is also a much higher cost. Saying "use the right tool for the job" doesn't give any guidance about when it's worth the cost and when it's not.
For production OLTP stuff, I'd argue that it's a bad idea to do the kind of processing you're talking about in the database unless you can avoid it. Beyond the performance implications, you'll likely have to alter your schema in unnatural ways that you wouldn't otherwise.
Now, I absolutely agree that you need to do a cost/benefit analysis and that there are costs associated with having multiple databases. But I don't think those costs are as high as they would appear on first intuition.
Yes, NoSQL doesn't fit the problem you're trying to solve. Perhaps there are a set of problems that are difficult to solve with NoSQL, but there exists sets of problems for which NoSQL databases are perfectly suited. So, I would modify your post to state that NoSQL isn't the solution to every problem, but don't think you're uncovering some big secret, because most people already know that.
I surely do not uncover any secret there. But I haven't stumbled upon a "what NoSQL lacks" blog post recently either. You can read my post in many ways, but the latter one is actually one possible way imho.
to me the problems you've described in your blog post are specifically application model problems. I think we shouldn't abstract the application model into the database, but the database into the application model.
I know a very innovative French developer who wrote an application server, that comes integrated with the database. In this very way you just call the exported functions provided by your database directly. How you model your (re)caching/(re)indexing and other application needs is totally up to you. This a) a freedom you barely find anywhere else. b) bare to the metal development of an application c) the most effcient way to develop an application. (b/c you only implement what you need and don't use a generalized construct that serves a general purpose very well, but doesn't scale with your application very well)
I would recommend to implement an application using the pattern that you know works best for the application, if you don't know it yet, then it's time to read books that enlighten our horizon of available solutions until we can start developing again.
I will show you an example of what I mean.
This is how I think is the most elegant way to interact with a(n integrated) database.
I am curious on what you think about this. I know I've not referred to the points in your post, but I've read it carefully. Thanks for hearing me out. I'm sorry I didn't post to your blog, but I prefer to post without subscribing to an external party. You limit the users who can answer this way imho. I'm not sure if it helps you to keep out trolls/spammers, but it sure helps to keep response rate low.
Regarding comments at my blog: I don't understand what you mean with "subscribing". According to the settings page, you do not have to register. You are free to comment there anonymously. That being said, your comment at HN is highly appreciated. Thank you for taking your time!
Managing Highly-dimensional data and access to it: ...I'm thinking of e.g. geo/spatial data here. Where are the solutions out there?
I am studying Multi-Dimensional Indexing for more than 3y now and have implemented many of the state-of-the-art indexes. They are all not sufficient as especially MongoDB's implemention is insufficient in especially the scalability-domain.
I also want a better discussion of NoSQL. It isn't fair to hate on databases without understanding the pressures of operations. I saw a friend's company where a big, fancy oracle system lost all of its data on their main test/dev system at a crucial moment - lost over 100,000 user accounts, including those of executives of key customers. They were forces to merge with a competitor about 4 months later.
You need to take database backups, you need to stage your systems. You need to have extra hardware on hand.
Some of our customers at Citrusleaf continue to "run with scissors". I like the attitude, but we've had to talk sternly with them about the benefits of staging, bucket testing new releases (app and db), and penciling out the realistic hardware requirements.
The new crop of distributed databases provide an immense opportunity for all of us. We can write more agile applications than ever before, and as a community we all need to understand the benefits of flexibility. This includes your entire organization.
That being said, there are technology differences between the NoSQL solutions, and at Citrusleaf we've focused on operations and deployability. My co-founder ran Yahoo Mobile's engineering and ops group, so understands the tradeoffs. We have a group in India (hi guys!) of great developers (not support guys) simply to make sure that when you've got a problem at 3am there's someone to take care of you.
Performance is important in this agile world, and Citrusleaf has it. http://bit.ly/rRlq9V
A slide I showed at HPTS (the high performance transaction systems conference) showed a Zynga game on the right, and an EA facebook game on the right. Zynga is an amazing machine in terms of getting huge, rich applications to market. Every pixel is covered with things to do, artwork, everything. And they're rolling out new games every week, and I haven't ever seen downtime (unlike Netflix Streaming, which has maintenance on a regular basis).
Zynga has been a huge proponent of NoSQL (but not Mongo) since its inception, and although I don't know what EA does internally (maybe they use the same tech but have other agility issues), NoSQL is clearly part of a high scale, rich application need.
Join or be flattened.
"The Citrusleaf server node received input from 4 client nodes, the MongoDB server node received input from 1 client node running 2 client processes, and the Redis server node received input from 2 client nodes."
I mean -- if you're cheating, that's bad. If you're not cheating, why the hell do you set things up to look like you're cheating?
Bingo, that is precisely what NotOnlySQL is all about. For example you trade some consistency guarantees for the ability to scale out.
Uninformed (has the author heard about the CAP theorem?), either-or diatribes like this article don't really serve any purpose other than sowing discord.
This is just like "C++ is better than Java is better than..." type flames wars. :)
We use Oracle and we use HBase. We would never replace Oracle with HBase for all of our data needs. At the same time we have need for a store that scales beyond what even Oracle can provide (and yes, we use RAC with multi TB caches across a database instance).
For the same reason we use Java, C++, Scala, Perl, Closure, Bash, JavaScript, etc... The right tool for the right job.
Personally what I would like to see is:
* secondary indexes
* snapshot isolation (in leu of global transactions, which will never scale).
Disclaimer: HBase committer here.
I tried my best to describe both what we gained and what we lost after the transition. At the end of the day, MongoDB (and other NoSQL solutions) are different tools for different jobs. Obviously it takes investment to master a new tool and we almost aborted the migration in two different occasions simply because we didn't know enough about maximizing MongoDB performance. Now that the dust has settled and with all things considered, I am glad we didn't.