The Dark Side of Hadoop (opens in new tab)

(tech.backtype.com)

132 pointsnadahalli15y ago41 comments

41 comments

37 comments · 13 top-level

seiji15y ago· 9 in thread

A few months ago I got my hadoop cluster stabilized (64 cores with 64 TB hdfs space). It wasn't a pleasant experience. The most memorable frustrating parts of the setup:

Searching for help online results in solutions for ancient versions or incomplete wiki pages ("we'll finish this soon!" from three years ago).

If Apple is one extreme end of the user friendly spectrum, hadoop is at the polar opposite end -- the error conditions and error messages can be downright hostile. The naming conventions are wonky too: namenode and secondary namenode, but the secondary isn't a backup, it's a copy. And don't get me started on tasktracker versus jobtracker (primarily because I can never remember the difference).

Restarting a tracker doesn't make it vanish in the namenode, so you have to restart the namenode too (at least in my CDH3 setup).

Everything is held together with duct tape shell scripts.

On the good side, I got everything hadoop related managed in puppet. All I need to do for a cluster upgrade is load a new CDH repo, reboot the cluster, then make sure nothing is borked.

If I didn't have to deal with isomorphic SQL<->hadoop queries, I'd start over using http://discoproject.org/

xal15y ago

Disco looks incredible. Are you aware of any big production uses of it outside of nokia?

DavidMcLaughlin15y ago

I work at Nokia and I'm not aware of any big production uses of it inside Nokia. Most of the teams in my part of the company use Hadoop.

Oh well.

1 more reply

phren0logy15y ago

It seems to be a very active project:

https://github.com/tuulos/disco/commits/master

ajays15y ago

I think the word "job" comes from the old mainframe days. Each program was encapsulated in a "job". There was even a language on some systems (aptly named "Job Control Language", or JCL) to specify the different aspects of the job.

hvs15y ago

I don't think the OP was saying he doesn't know what "job" means, he was saying that they use two terms "task" and "job" that mean the same thing, so he doesn't know what the difference between the trackers are.

yid15y ago

Take a look at Hive if you need SQL semantics on Hadoop.

nickik15y ago

Backtype have there own language for quering. https://github.com/nathanmarz/cascalog

konradjg15y ago

I have to disagree completely with your example of horrible naming conventions regarding the jobtracker and tasktracker. The names are very descriptive and convey enough details to know what their purpose is.

http://hadoop.apache.org/mapreduce/docs/current/mapred_tutor...

"The MapReduce framework consists of a single master JobTracker and one slave TaskTracker per cluster-node. The master is responsible for scheduling the jobs' component tasks on the slaves, monitoring them and re-executing the failed tasks. The slaves execute the tasks as directed by the master."

If your problems is in remembering the difference perhaps you have not spent enough time understanding the tools you are using.

programnature15y ago

I think OP is saying he understands the difference in what they do, but can't keep the names straight. I felt the same way for the first few months of using hadoop.

I eventually got that in real-life, a task is just a subpiece of a job. But the naming is not immediately evocative of the relationship, particularly since in computing, "job" is pretty overloaded. Far better would be "job" and "subjob".

alecco15y ago· 4 in thread

It's good to have critical reviews. But perhaps it could have been better if the tone were more respectful towards a free and open source project.

The approach of Hadoop is not my cup of tea, but I praise them for giving for free a working product solving such a hard problem.

Playing Hadoop's side, where are the test cases, the patches or bug reports? Or even some missing documentation blurbs, like you mention.

seiji15y ago

We've become spoiled.

open source has different levels ranging from "made in basement and released to the world" (one maintainer, patches welcome) to "throw over the fence" (you can see the code, but nobody cares about patches) to "we're an open source corporation" (apache, patches welcome if you follow steps 1 to 22).

Hadoop falls under apache governance which means jira, contributor license agreements, and spending a week of your life trying to get your point across. It's much easier to complain like the spoiled developers we've become.

timr15y ago

It's much easier to complain like the spoiled developers we've become."

By your own comments, that's not really fair. The Apache project has become a disaster area of bureaucracy, crappy code and crappier documentation. You have to spend weeks in the muck to get to the point where you can even properly understand the deficiencies of any given Apache subproject, and by that point, you're so bound up in the project that it's hard to move away. From the build system of the code to the structure of the community, not a single thing about Apache is what I would characterize as nimble or lightweight. Can you blame people for feeling helpless?

It's gotten to the point where I'll look anywhere for a competing implementation of an idea before using Apache code to do something. It's just not worth the pain.

1 more reply

acangiano15y ago

Agreed, alecco. We should all strive to be kinder and excellent to each other, even when being critical.

alecco15y ago

Well, to be fair, it's easy to fall in that behavior. I often catch myself before hitting send/post. But sometimes the filter isn't good or fast enough. This happens usually on caffeinated bad days working on frustrating bugs. It's likely the writer was experiencing that.

forgotusername15y ago· 4 in thread

> This means that the Hadoop process which was using 1GB will temporarily "use" 2GB when it shells out.

This is exactly not what happens, the new process copies only the parents address mappings (now marked read-only), which represents vastly less than 1gb physical memory.

I think only a single 4kb page or two will ultimately be copied, representing a chunk of the calling thread's stack used to prepare the new child before finally calling execve() or similar.

nathanmarz15y ago

As far as I understand it (I could be wrong as I've pieced this together from multiple sources), forking still reserves the same amount of memory that the parent process is using without actually copying the data (since it could very well use all that memory eventually).

The wrench is that Linux has a feature called memory overcommit (which I haven't been able to decipher completely). Supposedly it causes forking to not actually reserve that much space, but by default it's in a "heuristic" mode so it may or may not take effect.

These are the best resources I could find on what happens when you fork:

http://developers.sun.com/solaris/articles/subprocess/subpro... http://lxr.linux.no/linux/Documentation/vm/overcommit-accoun...

blocke15y ago

I'm not anywhere near being a Java expert but I've seen similar behavior with a non-Hadoop workload.

In my case I had several virtual machines hosting Java VMs with 2+GB heaps running an app that liked to fork and run external programs for short periods of time. If the entire heapsize was say 3GB and it forked twice Linux would act like it needed 9GB. The Linux overcommit heuristic regularly got things wrong and wanted RAM that it would never actually use. This usually resulted in the JVM failing in new and interesting ways.

The workaround is to allocate a crapload of swap (mainly more than heap size times a guestimate of number of concurrent forks). It will never actually USE the swap but having it there seems keep the overcommit heuristic happy.

Yay for cargo cult server tuning. I've never figured out how to get the kernel to not be so pessimistic and I can't modify the Java code in question so... eh.

zorked15y ago

I think there's a confusion here between the copy-on-write system that makes fork() so fast under Unix and Linux's overcommiting memory, which allows malloc() to return success for memory allocation that will in fact only be allocated as necessary. (e.g. you can allocate 32GB using malloc but as long as you don't write anything there, the kernel isn't really allocating that much memory).

The latter causes much trouble because people tend to assume that, say, a database can allocate a certain amount of memory and be sure that it will never run out of memory as long as it does not explicitly allocate more memory. This is not true for Linux at all, and a process can be terminated at any time (if I am not mistaken, with the confusing SIGILL signal and an OOM-killer message sent to the system log) if the system is running out of memory.

I suspect the diagnostics in the article may be wrong, and there is something else causing the problem there. My polite guess would be something with the JVM (you are all having problems with the JVM... hmmm...). Merely forking a process is a tiny operation under Linux and I don't believe memory overcommitting is relevant at all here. And in any case you can simply turn off overcommits with a sysctl to verify the theory.

technomancy15y ago

We got bitten by the same bug shelling out from the JVM. We ended up running socat in a separate process and bouncing all our external processes off it instead. But it's insane that you even have to think about this kind of thing.

1 more reply

earl15y ago· 3 in thread

Yeah, hadoop is a bear to get working. One of the benefits of working at a more established company like quantcast, google, or presumably facebook is there are internal infrastructure teams to smooth all that over. I pretty much get to type submitJob and it more or less just works...

jshen15y ago

I'm at a big company that uses hadoop, and someone else maintains the infrastructure. But, the problem we have is that hadoop is a very leaky abstraction and it's very easy for someone that doesn't know implementation details to break the grid. It's relatively easy to exhaust the memory of the jobtracker, we constantly have problems with mapper starvation because someones seemingly innocent job is causing issues, etc, etc.

My experience is that it doesn't just work, even when someone else maintains the infrastructure.

technomancy15y ago

It would probably work better in a large company. We spent a couple weeks evaluating Hadoop in our small startup and came to the conclusion that we couldn't afford to have such an infrastructure team, and I imagine most startups considering Hadoop are in a similar boat.

harryh15y ago

We use hadoop at foursquare and it was set up and maintained by 1 person. It was a real bear of a job but he got it done.

1 more reply

jamii15y ago· 1 in thread

I would be interested to hear why riak, disco etc aren't viable alternatives. I've seen very few good comparisons of the various options (the Mozilla data blog being the only one that comes to mind).

reinhardt15y ago

"Risking" a downvote for a basically "me-too" reply, I would also be very interested in such a comparison or even a simple blog post of the experience (good or bad) with alternatives, especially disco.

api15y ago· 1 in thread

I never got the Hadoop craze. It performs terribly on memory and CPU, which is odd for something that prides itself on being HPC-oriented.

tluyben215y ago

Alternatives?

Edit: Real alternatives I mean; something OSS (free is not so important, but no lock-in), able to manage huge amounts of data, actively developed, active projects built on top.

ssn15y ago· 1 in thread

Shouldn't Hadoop users help to document it and contribute to it? Why isn't this happening?

code_duck15y ago

A lot of users of Hadoop are companies with serious competition. If you took the time to figure all of this out, would you go post a howto for your competitors? No, might as well let them expend their own time and money to do what you did.

While that's not exactly the open source spirit, it is how plenty of people think and also happens to be the default, easiest thing to do. It would be nice if people could share to benefit themselves and each other.

rjurney15y ago· 1 in thread

I'd like to know what distribution they are running, and if they've tried others.

nathanmarz15y ago

We used to use CDH3b2, but in the past month we switched to EMR. These and other issues existed in both distributions.

firebones15y ago

We've seen all of these issues, but the primary causes appear to be either problems in configuration of the cluster (which you typically don't see until you're doing serious work, or working with an overstressed cluster at scale) and problems with underlying code quality of the jobs (e.g., processes hanging due to tasks which don't terminate due to inability to handle unexpected input gracefully--infinite loops or non-terminal operations). If you're working with big data, particularly data that isn't the cleanest, you'll start to see some of these issues and others arise.

Despite those issues, the most remarkable thing about Hadoop is the out-of-the-box resilience to get the work done. The strategy of no side-effects and a write-only approach (with failed tasks discarding work in progress) ensures predictable results--even if the time it takes to get those results can't be guaranteed.

The documentation isn't the greatest, and it's very confusing sorting out the sedimentary nature of the APIs and configuration (knowing which APIs and config match up across various versions such as 0.15, 0.17, 0.20.2, 0.21, etc., not to mention various distributions from Cloudera, Apache and Yahoo branches), but things are starting to finally converge. You're probably better off starting with one of the later, curated releases (such as the recent Cloudera distribution) where some work has been done to cherry pick features and patches from the main branches.

dougb15y ago

I've been using Hadoop in production for about 3 years on a small 48 node cluster. I've seen many issues, but none of the ones mentioned in the blog post.

My general theory is that if its an important tool for your business, you need at least 1 person to be an expert on it. The alternative is to pay Cloudera a significant amount per node for support. Another possible alternative is to use http://www.MapR.com/, they are in beta and claim to be api compatible with Hadoop, but they are not free.

tluyben215y ago

Memory; it's Java. Java is great, but is a memory hog. Doesn't matter so much these days (well it does, but what are you going to do... There isn't much OSS competition for Hadoop). The documentation is horrible though. I'm not sure if this is a 'new' OSS trend, but now that i'm working a lot with Rails and gems I notice that on that side of OSS it's pretty normal to produce no or completely horrible documentation; use the force read the source and that kind of hilarious tripe. Apache projects which are Hadoop related (Hadoop, Pig, Hbase) all suffer from this; you are hard pressed to find anything remotely helpful and not incredibly outdated. At least for rails you can find tons of examples (no docs though) of how to achieve things; for Hadoop/Hbase everything is outdated, not functional and requiring you to jump into tons of code to get stuff done.

Again; there is not so much competition for tasks you would accomplish with Hadoop (and Hbase) on the scale it has been tested (by Yahoo/Stumbleupon and many others).

kordless15y ago

If you want general distributed process management, check out Mesos: http://www.mesosproject.org/

The guys working on it are over at Twitter nowadays.

nivertech15y ago

  Hadoop is so old school...
  Even Google moving away from Map/Reduce.
  Prepare for new shiny things, 
  which will be better, faster and cheaper to operate!

j / k navigate · click thread line to collapse

41 comments

37 comments · 13 top-level

seiji15y ago· 9 in thread

A few months ago I got my hadoop cluster stabilized (64 cores with 64 TB hdfs space). It wasn't a pleasant experience. The most memorable frustrating parts of the setup:

Searching for help online results in solutions for ancient versions or incomplete wiki pages ("we'll finish this soon!" from three years ago).

Restarting a tracker doesn't make it vanish in the namenode, so you have to restart the namenode too (at least in my CDH3 setup).

Everything is held together with duct tape shell scripts.

On the good side, I got everything hadoop related managed in puppet. All I need to do for a cluster upgrade is load a new CDH repo, reboot the cluster, then make sure nothing is borked.

If I didn't have to deal with isomorphic SQL<->hadoop queries, I'd start over using http://discoproject.org/

xal15y ago

Disco looks incredible. Are you aware of any big production uses of it outside of nokia?

DavidMcLaughlin15y ago

I work at Nokia and I'm not aware of any big production uses of it inside Nokia. Most of the teams in my part of the company use Hadoop.

Oh well.

1 more reply

phren0logy15y ago

It seems to be a very active project:

https://github.com/tuulos/disco/commits/master

ajays15y ago

hvs15y ago

yid15y ago

Take a look at Hive if you need SQL semantics on Hadoop.

nickik15y ago

Backtype have there own language for quering. https://github.com/nathanmarz/cascalog

konradjg15y ago

http://hadoop.apache.org/mapreduce/docs/current/mapred_tutor...

If your problems is in remembering the difference perhaps you have not spent enough time understanding the tools you are using.

programnature15y ago

I think OP is saying he understands the difference in what they do, but can't keep the names straight. I felt the same way for the first few months of using hadoop.

alecco15y ago· 4 in thread

It's good to have critical reviews. But perhaps it could have been better if the tone were more respectful towards a free and open source project.

The approach of Hadoop is not my cup of tea, but I praise them for giving for free a working product solving such a hard problem.

Playing Hadoop's side, where are the test cases, the patches or bug reports? Or even some missing documentation blurbs, like you mention.

seiji15y ago

We've become spoiled.

timr15y ago

It's much easier to complain like the spoiled developers we've become."

It's gotten to the point where I'll look anywhere for a competing implementation of an idea before using Apache code to do something. It's just not worth the pain.

1 more reply

acangiano15y ago

Agreed, alecco. We should all strive to be kinder and excellent to each other, even when being critical.

alecco15y ago

forgotusername15y ago· 4 in thread

> This means that the Hadoop process which was using 1GB will temporarily "use" 2GB when it shells out.

This is exactly not what happens, the new process copies only the parents address mappings (now marked read-only), which represents vastly less than 1gb physical memory.

I think only a single 4kb page or two will ultimately be copied, representing a chunk of the calling thread's stack used to prepare the new child before finally calling execve() or similar.

nathanmarz15y ago

These are the best resources I could find on what happens when you fork:

http://developers.sun.com/solaris/articles/subprocess/subpro... http://lxr.linux.no/linux/Documentation/vm/overcommit-accoun...

blocke15y ago

I'm not anywhere near being a Java expert but I've seen similar behavior with a non-Hadoop workload.

Yay for cargo cult server tuning. I've never figured out how to get the kernel to not be so pessimistic and I can't modify the Java code in question so... eh.

zorked15y ago

technomancy15y ago

1 more reply

earl15y ago· 3 in thread

jshen15y ago

My experience is that it doesn't just work, even when someone else maintains the infrastructure.

technomancy15y ago

harryh15y ago

We use hadoop at foursquare and it was set up and maintained by 1 person. It was a real bear of a job but he got it done.

1 more reply

jamii15y ago· 1 in thread

I would be interested to hear why riak, disco etc aren't viable alternatives. I've seen very few good comparisons of the various options (the Mozilla data blog being the only one that comes to mind).

reinhardt15y ago

api15y ago· 1 in thread

I never got the Hadoop craze. It performs terribly on memory and CPU, which is odd for something that prides itself on being HPC-oriented.

tluyben215y ago

Alternatives?

Edit: Real alternatives I mean; something OSS (free is not so important, but no lock-in), able to manage huge amounts of data, actively developed, active projects built on top.

ssn15y ago· 1 in thread

Shouldn't Hadoop users help to document it and contribute to it? Why isn't this happening?

code_duck15y ago

rjurney15y ago· 1 in thread

I'd like to know what distribution they are running, and if they've tried others.

nathanmarz15y ago

We used to use CDH3b2, but in the past month we switched to EMR. These and other issues existed in both distributions.

firebones15y ago

dougb15y ago

I've been using Hadoop in production for about 3 years on a small 48 node cluster. I've seen many issues, but none of the ones mentioned in the blog post.

tluyben215y ago

Again; there is not so much competition for tasks you would accomplish with Hadoop (and Hbase) on the scale it has been tested (by Yahoo/Stumbleupon and many others).

kordless15y ago

If you want general distributed process management, check out Mesos: http://www.mesosproject.org/

The guys working on it are over at Twitter nowadays.

nivertech15y ago

  Hadoop is so old school...
  Even Google moving away from Map/Reduce.
  Prepare for new shiny things, 
  which will be better, faster and cheaper to operate!

j / k navigate · click thread line to collapse