undefined | Better HN

0 pointsearl14y ago0 comments

Obviously the ease of programming any given task in MR corresponds, to invent a word, with how cross-talky the task is. If it is no-communication parallelizable, then it's very easy to do. The more communication or the more data striping you do, the more interesting it becomes.

In any case, tons of people would like a simple sql like language for hadoop. There already exist some examples: pig, sawzall, etc. Unfortunately, the inefficiencies hurt. Say pig takes 2x as much data processing as hand coding java. At large scale, that can eat you alive. To get some intuition about scale, review Ron Bodkin's, former vp eng at quantcast, slides: [1], page 9 and on. Obviously if a 2x penalty means going from 4 to 8 machines in your cluster, it's not such a big deal. But if you buy clusters a datacenter's cage worth at a time or more, its painful. We haven't escaped the tradeoff between programmer time and computer cost.

People would also love something as easy to program as R or matlab that magically scales to large data. Nobody has written such a thing despite quite a lot of demand, which makes me think it's even harder than I thought it as, and I believe it to be a quite hard task.

For the tools: I'm not a qc spokesperson and none of this represents the opinion of my employer and wasn't endorsed by them. If you want qc's position on anything, ask our spokesperson.

[1] http://qconsf.com/dl/qcon-sanfran-2010/slides/RonBodkin_Larg...

0 comments

No comments yet.