Reason #1: Devs aren't Ops.
Reason #2: Devs need something new on their resume.
Reason #3: Certain type of Devs would read blogs and get excited and skipping scientific mumbo-jumbo and directly take the blogs as _the_ source of truth.
Reason #4: It's easy to bootstrap (schemaless, etc) your weekend project. Dealing with DB apparently is tedious for devs.
I'm sure others can add more...
Let me feel your love HN-ers ;)
What I hear you saying is unfortunately - it's worse for ops, so noone should use it.
Those who do not study the history of databases are doomed to repeat it. Soon we'll add back row-level write locks, transaction logging, schemas, multiple indexes and one day they wake up with MongoSQL.
It's really easy to work with. This is why people keep using it.
How come a DB lose data so frequently and it sill call itself web-scale? It just breaks when you need scale!
For auto-sharding it's also super unreliable, tried once and it failed, and now we are using a lib that do application level sharding. We are also considering move to other databases that at least know not losing data is the first and most important thing of a DB.
Some one summarized the issues of mongodb, http://pastebin.com/raw.php?i=FD3xe6Jt , we experienced most problems in the article. So just a remind for someone who want to create serious product using mongodb, read the article, it's not FUD, it's just so true that I hope I read it 1 year ago, so we don't have to try moving so much legacy data to a new database solution.
There's no other project I know of, which provides: schemaless json documents, indexing on any part of them, server-side mapreduce, lots of connectors for different languages, atomic updates on part of the document. If there is one and it's better than mongo, I'd switch any moment.
These absolutely were failures.
The author listed several instances in which the database became unavailable, the vendor-supplied client drivers refused to communicate with it, or both. Some of these scenarios included the primary database daemon crashing, secondaries failing to return from a "repairing" to an "online" state after a failure (and unable to serve operations in the cluster), and configuration servers failing to propagate shard config to the rest of the cluster -- which required taking down the entire database cluster to repair.
Each of the issues described above would result in extended application downtime (or at best highly degraded availability), the full attention of an operations team, and potential lost revenue. The data loss concern is also unnerving. In a rapidly-moving distributed system, it can be difficult to pin down and identify the root cause of data loss. However, many techniques such as implementing counters at the application level and periodically sanity-checking them against the database can at minimum indicate that data is missing or corrupted. The issues described do not appear to be related to a journal or lack thereof.
Further, the fact that the database's throughput is limited to utilizing a single core of a 16-way box due to a global write lock demonstrates that even when ample IO throughput is available, writes will be stuck contending for the global lock, while all reads are blocked. Being forced to run multiple instances of the daemon behind a sharding service on the same box to achieve any reasonable level of concurrency is embarrassing.
On the "1GB / small dataset" point, keep in mind that Mongo does not permit compactions and read/write operations to occur concurrently. As documents are inserted, updated, and deleted, what may be 1GB of data will grow without bound in size, past 10GB, 16GB, 32GB, and so on until it is compacted in a write-heavy scenario. Unfortunately, compaction also requires that nodes be taken out of service. Even with small datasets, the fact that they will continue to grow without bound in write/update/delete-heavy scenarios until the node is taken out of service to be compacted further compromises the availability of the system.
What's unfortunate is that many of these issues aren't simply "bugs" that can be fixed with a JIRA ticket, a patch, and a couple rounds of code review -- instead, they reach to the core of the engine itself. Even with small datasets, there are very good reasons to pause and carefully consider whether or not your application and operations team can tolerate these tradeoffs.
Oh, this mystery is a failure all right, and even the most charitable interpretation would call it a misfeature.
MongoDB is flaky. CouchDB is a maintainability nightmare, so I hear.
Riak? Cassandra? Or does everything else have some other equally huge down-side?
For every X sucks article, ther is Y is awesome.
In the nosql world the only way to choose is around the problems they solve... They are each specializing and optimizing for certain nitches. mongo is the most mysql-esque, but dosnt do things that redis, couch or cassandra do that you may need.
There is no clear winner (fortunately or unfortunately dependng on what you were hoping for)
Yet we've also seen in the past that shedding such a reputation is not strictly required to be popular. And marketing budgets do matter.
Perhaps because both of your premises are wrong? I've used Mongo for over a year now with ~1000 writes/sec and haven't seen any of these problems. I'm not saying they don't exist (some are confirmed bugs that have been fixed), but they're not nearly as prevalent as your 'Do you still beat your wife?'-style question implies.
Then, as you use it, the system optimizes itself (or makes suggestions) based on actual access patterns. A subset of objects could be a formal, indexed table? Have it happen automatically or offer the SQL as a suggestion.
Replication of any kind won't help you with a high write load as secondaries have to apply the same number of writes as primaries.
CouchDB is much better (you're as likely to lose data as with Postgres), but is potentially less efficient (no BSON).
My focus in starting Citruseaf wasn't features, it was operational dependability. I had worked at companies who had to take their system offline when they had the greatest exposure - like getting massive load from the Yahoo front page (back in the day). Citrusleaf focuses on monitoring, integration with monitoring software, operations. We call ourselves a real-time database because we've focused on predictable performance (and very high performance).
We don't have as many features as mongo. You can't do a javascript/json long running batch job. We'll get to features.
The global R/W lock does limit mongo. Absolutely. Our testing shows a nearly 10x difference in performance between Mongo and Citrusleaf on writes. Frankly, if you're still doing 1,000 tps, you should probably stick with a decent MySQL implementation.
Here's a performance analysis we did: http://bit.ly/rRlq9V
This theory that "mongo is designed to run on in-memory data sets" is, frankly, terrible --- simply because mongo doesn't give you the control to keep you in memory. You don't know when you're going to spill out of memory. There's no way to "timeout" a page cache IO. There's no asynchronous interface for page IO. For all of these reasons - and our internal testing showing page IO is 5x slower than aio; the reason all professional databases use aio and raw devices - we coded Citrusleaf using normal multithreaded io strategies.
With Citrusleaf, we do it differently, and that difference is huge. We keep our indexes in memory. Our indexes are the most efficient anywhere - more objects, fea. You configure Citrusleaf with the amount of memory you want to use, and apply policies when you start flowing out of memory. Like not taking writes. Like expiring the least-recently-used data.
That's an example of our focus on operations. If your application use pattern changes, you can't have your database go down, or go so slowly as to be nearly unusable.
Again, take my comments with a grain of salt, but with Citrusleaf you'll have better uptime, fewer servers, a far less complex installation. Sure, it's not free, but talk to us and we'll find a way to make it work for your project.
I'm a little surprised to see all of the MongoDB hate in this thread.
There seems to be quite a bit of misinformation out there: lots of folks seem focused on the global R/W lock and how it must lead to lousy performance.
In practice, the global R/W isn't optimal -- but it's really not a big deal.
First, MongoDB is designed to be run on a machine with sufficient primary memory to hold the working set. In this case, writes finish extremely quickly and therefore lock contention is quite low. Optimizing for this data pattern is a fundamental design decision.
Second, long running operations (i.e., just before a pageout) cause the MongoDB kernel to yield. This prevents slow operations from screwing the pooch, so to speak. Not perfect, but smooths over many problematic cases.
Third, the MongoDB developer community is EXTREMELY passionate about the project. Fine-grained locking and concurrency are areas of active development. The allegation that features or patches are withheld from the broader community is total bunk; the team at 10gen is dedicated, community-focused, and honest. Take a look at the Google Group, JIRA, or disqus if you don't believe me: "free" tickets and questions get resolved very, very quickly.
Other criticisms of MongoDB concerning in-place updates and durability are worth looking at a bit more closely. MongoDB is designed to scale very well for applications where a single master (and/or sharding) makes sense. Thus, the "idiomatic" way of achieving durability in MongoDB is through replication -- journaling comes at a cost that can, in a properly replicated environment, be safely factored out. This is merely a design decision.
Next, in-place updates allow for extremely fast writes provided a correctly designed schema and an aversion to document-growing updates (i.e., $push). If you meet these requirements-- or select an appropriate padding factor-- you'll enjoy high performance without having to garbage collect old versions of data or store more data than you need. Again, this is a design decision.
Finally, it is worth stressing the convenience and flexibility of a schemaless document-oriented datastore. Migrations are greatly simplified and generic models (i.e., product or profile) no longer require a zillion joins. In many regards, working with a schemaless store is a lot like working with an interpreted language: you don't have to mess with "compilation" and you enjoy a bit more flexibility (though you'll need to be more careful at runtime). It's worth noting that MongoDB provides support for dynamic querying of this schemaless data -- you're free to ask whatever you like, indices be damned. Many other schemaless stores do not provide this functionality.
Regardless of the above, if you're looking to scale writes and can tolerate data conflicts (due to outages or network partitions), you might be better served by Cassandra, CouchDB, or another master-master/NoSQL/fill-in-the-blank datastore. It's really up to the developer to select the right tool for the job and to use that tool the way it's designed to be used.
I've written a bit more than I intended to but I hope that what I've said has added to the discussion. MongoDB is a neat piece of software that's really useful for a particular set of applications. Does it always work perfectly? No. Is it the best for everything? Not at all. Do the developers care? You better believe they do.
I'm not sure it's a competitor at all. RavenDB is a CouchDB clone for .Net that requires a commercial license for proprietary software.
From this article, sounds like their data is pretty seriously relational.
Mongodb has been pushing the ops side of their product, but I can agree it has failings there. To me the advantage is the querying and the json style documents.
Mongo, on paper, should be an ideal candidate for this job; but, due to complications with the locking model and with its inability to do online compactions, it's failing.
I had to model data with umpteen crazy relationships so we went with Mongodb. We did not have the high update issue or any locking issues. If one has a few large tables with fixed columns that can easily define the data, then relational DBs probably make more sense. But to your point, 10gen will not tell you that and the hype doesn't tell you that either.