Searching for help online results in solutions for ancient versions or incomplete wiki pages ("we'll finish this soon!" from three years ago).
If Apple is one extreme end of the user friendly spectrum, hadoop is at the polar opposite end -- the error conditions and error messages can be downright hostile. The naming conventions are wonky too: namenode and secondary namenode, but the secondary isn't a backup, it's a copy. And don't get me started on tasktracker versus jobtracker (primarily because I can never remember the difference).
Restarting a tracker doesn't make it vanish in the namenode, so you have to restart the namenode too (at least in my CDH3 setup).
Everything is held together with duct tape shell scripts.
On the good side, I got everything hadoop related managed in puppet. All I need to do for a cluster upgrade is load a new CDH repo, reboot the cluster, then make sure nothing is borked.
If I didn't have to deal with isomorphic SQL<->hadoop queries, I'd start over using http://discoproject.org/
Oh well.
http://hadoop.apache.org/mapreduce/docs/current/mapred_tutor...
"The MapReduce framework consists of a single master JobTracker and one slave TaskTracker per cluster-node. The master is responsible for scheduling the jobs' component tasks on the slaves, monitoring them and re-executing the failed tasks. The slaves execute the tasks as directed by the master."
If your problems is in remembering the difference perhaps you have not spent enough time understanding the tools you are using.
I eventually got that in real-life, a task is just a subpiece of a job. But the naming is not immediately evocative of the relationship, particularly since in computing, "job" is pretty overloaded. Far better would be "job" and "subjob".
The approach of Hadoop is not my cup of tea, but I praise them for giving for free a working product solving such a hard problem.
Playing Hadoop's side, where are the test cases, the patches or bug reports? Or even some missing documentation blurbs, like you mention.
open source has different levels ranging from "made in basement and released to the world" (one maintainer, patches welcome) to "throw over the fence" (you can see the code, but nobody cares about patches) to "we're an open source corporation" (apache, patches welcome if you follow steps 1 to 22).
Hadoop falls under apache governance which means jira, contributor license agreements, and spending a week of your life trying to get your point across. It's much easier to complain like the spoiled developers we've become.
By your own comments, that's not really fair. The Apache project has become a disaster area of bureaucracy, crappy code and crappier documentation. You have to spend weeks in the muck to get to the point where you can even properly understand the deficiencies of any given Apache subproject, and by that point, you're so bound up in the project that it's hard to move away. From the build system of the code to the structure of the community, not a single thing about Apache is what I would characterize as nimble or lightweight. Can you blame people for feeling helpless?
It's gotten to the point where I'll look anywhere for a competing implementation of an idea before using Apache code to do something. It's just not worth the pain.
This is exactly not what happens, the new process copies only the parents address mappings (now marked read-only), which represents vastly less than 1gb physical memory.
I think only a single 4kb page or two will ultimately be copied, representing a chunk of the calling thread's stack used to prepare the new child before finally calling execve() or similar.
The wrench is that Linux has a feature called memory overcommit (which I haven't been able to decipher completely). Supposedly it causes forking to not actually reserve that much space, but by default it's in a "heuristic" mode so it may or may not take effect.
These are the best resources I could find on what happens when you fork:
http://developers.sun.com/solaris/articles/subprocess/subpro... http://lxr.linux.no/linux/Documentation/vm/overcommit-accoun...
In my case I had several virtual machines hosting Java VMs with 2+GB heaps running an app that liked to fork and run external programs for short periods of time. If the entire heapsize was say 3GB and it forked twice Linux would act like it needed 9GB. The Linux overcommit heuristic regularly got things wrong and wanted RAM that it would never actually use. This usually resulted in the JVM failing in new and interesting ways.
The workaround is to allocate a crapload of swap (mainly more than heap size times a guestimate of number of concurrent forks). It will never actually USE the swap but having it there seems keep the overcommit heuristic happy.
Yay for cargo cult server tuning. I've never figured out how to get the kernel to not be so pessimistic and I can't modify the Java code in question so... eh.
The latter causes much trouble because people tend to assume that, say, a database can allocate a certain amount of memory and be sure that it will never run out of memory as long as it does not explicitly allocate more memory. This is not true for Linux at all, and a process can be terminated at any time (if I am not mistaken, with the confusing SIGILL signal and an OOM-killer message sent to the system log) if the system is running out of memory.
I suspect the diagnostics in the article may be wrong, and there is something else causing the problem there. My polite guess would be something with the JVM (you are all having problems with the JVM... hmmm...). Merely forking a process is a tiny operation under Linux and I don't believe memory overcommitting is relevant at all here. And in any case you can simply turn off overcommits with a sysctl to verify the theory.
My experience is that it doesn't just work, even when someone else maintains the infrastructure.
Edit: Real alternatives I mean; something OSS (free is not so important, but no lock-in), able to manage huge amounts of data, actively developed, active projects built on top.
While that's not exactly the open source spirit, it is how plenty of people think and also happens to be the default, easiest thing to do. It would be nice if people could share to benefit themselves and each other.
Despite those issues, the most remarkable thing about Hadoop is the out-of-the-box resilience to get the work done. The strategy of no side-effects and a write-only approach (with failed tasks discarding work in progress) ensures predictable results--even if the time it takes to get those results can't be guaranteed.
The documentation isn't the greatest, and it's very confusing sorting out the sedimentary nature of the APIs and configuration (knowing which APIs and config match up across various versions such as 0.15, 0.17, 0.20.2, 0.21, etc., not to mention various distributions from Cloudera, Apache and Yahoo branches), but things are starting to finally converge. You're probably better off starting with one of the later, curated releases (such as the recent Cloudera distribution) where some work has been done to cherry pick features and patches from the main branches.
My general theory is that if its an important tool for your business, you need at least 1 person to be an expert on it. The alternative is to pay Cloudera a significant amount per node for support. Another possible alternative is to use http://www.MapR.com/, they are in beta and claim to be api compatible with Hadoop, but they are not free.
Again; there is not so much competition for tasks you would accomplish with Hadoop (and Hbase) on the scale it has been tested (by Yahoo/Stumbleupon and many others).
The guys working on it are over at Twitter nowadays.
Hadoop is so old school...
Even Google moving away from Map/Reduce.
Prepare for new shiny things,
which will be better, faster and cheaper to operate!