Apache Mahout: Scalable machine learning for everyone (opens in new tab)

(ibm.com)

95 pointsbpuvanathasan14y ago25 comments

25 comments

21 comments · 8 top-level

law14y ago· 6 in thread

Honestly, frameworks like Mahout and Weka have their place, and that's typically for exploratory data analysis. My belief is that for large-scale, extremely intensive machine learning, your best bet is to implement algorithms tailored to the job at hand. Algorithms like logistic regression work fine if your data is linearly separable, but it's not a panacea. None of the algorithms are.

If you're interested in machine learning and artificial intelligence, I very strongly consider "enrolling" in Tom Mitchell's machine learning class at http://www.cs.cmu.edu/~tom/10701_sp11/lectures.shtml -- the lectures are long and the mid-term and final are extremely difficult, but the material covered is an outstanding primer for these types of analyses.

After going through all of the lectures, you will look at things like Mahout and Weka as mere toys, and will be equipped to write your own implementations for whatever task you and your company are working on. It's a lot of front-loading for rewards that may at first glance seem illusory, but investing the time now will pay dividends later.

tensor14y ago

Libraries like Weka and Mahout are no more toys than any other library that implements standard and widely applicable algorithms. Yes, you need to do a lot of extra work to properly model your problems, choose features, and combine different algorithms into a final product. But it's not often that you need to tweak the core algorithms that these libraries provide.

If you really understand enough to implement new classifiers or other types of learning algorithms, these libraries are still useful to you. For one, they provide a solid framework for allowing your new algorithm to easily interact with other algorithms. Two, it's not unlikely that your new algorithm is a variation on an existing one. Don't re-implement it. These libraries are open, so copy the source and modify it. And three, mahout uses hadoop. Distributed processing systems are another topic altogether. If you are proposing to write your own, I would hope that you have good reasons for spending the time. Hadoop is certainly no toy.

In summary, don't waste time reimplementing core algorithms unless you are doing it for a learning exercise. But do still take a good course on machine learning, because using the provided algorithms in these packages and others correctly is highly non-trivial.

dvcat14y ago

Dunno about weka but my last experience ( 5 months back) with Mahout was not good. There still are quite a few bugs and the fact that entire code base is in Java makes it extremely unpleasant for someone who wants to hack and modify the code to jump right in and start tweaking stuff. However, in its defense, it is open source is probably the only hadoopified ml library out there and has given me a ton of good ideas on how to write custom code.

1 more reply

law14y ago

You're correct to identify the point of libraries like Weka and Mahout, which are both written in Java, as providing a solid framework for interaction between and among your program and other algorithms. However, Java isn't the right solution for everyone. Moreover, in Weka's case, the GPL licensing may not comport with everyone's requirements. Mahout's license is more friendly to proprietary software, so it's admittedly a non-issue there.

I agree that hadoop is certainly not a toy, but using Mahout on hadoop clusters works better for analyzing large data sets that you've already collected and pre-processed. If you're doing any kind of active learning, or are designing software to run on a client's computer based on feedback that they provide, mahout probably isn't the best choice.

In the end, it requires understanding your problem completely enough to justify your decision.

1 more reply

_delirium14y ago

There aren't very many statisticians/MLers who suggest (or practice) reimplementing your own algorithms, except for quite simple things, because the risk of getting something wrong is pretty high, and the work to make things efficient is non-trivial. If anything, the current push is in the other direction, towards encouraging more people to share their code, and more people to use well-tested code, through initiatives like http://jmlr.csail.mit.edu/mloss/ , http://www.jstatsoft.org/ , and CRAN.

For example, you could reimplement your own SVM instead of using http://svmlight.joachims.org/ , but your chance of producing something correct and as efficient is pretty low...

fauigerzigerk14y ago

I think the choice between using existing libraries and implementing your own mostly depends on how central particular algorithms are to your product. If a better algorithm makes a great difference for my customers then it's insane for me to use an existing library.

I don't even find much value in looking at existing code as a starting point because it's bound to be either obscured by lots of optimizations or naive or it's university code left behind by someone finishing their thesis in a hurry. For code beyond a certain level of complexity I prefer to either use it as a black box or implement it myself.

Obviously, if the algorithm is not a core component of my product it's insane to waste time on reimplementing it, provided there is a good quality implementation that has the right license.

law14y ago

It's a fine line to walk. On the one hand, community-vetted code is a spectacular idea for the core algorithms, but on the other, overly-restrictive licenses (like [L]GPL) effectively preclude the maximum utility being derived from them.

1 more reply

reuser14y ago· 4 in thread

That's cool and stuff, but why do I have to write Java?

jwr14y ago

You don't have to. I use mahout from Clojure, not a single line of Java needs to be written.

ericmoritz14y ago

Then you have two problems... I kid.

moondowner14y ago

Because you want to write in the language the software was written.

reuser14y ago

It's software for everyone, as long as they are people who want to write Java.

dmk2314y ago· 1 in thread

Mahout is a great platform, but the real challenge is defining your learning problems, preparing data sets and choosing right algorithms.

Once you are clear as to what you actually want to accomplish chances are you are going to need some kind of significantly modified or hybrid algorithm. Packages like Mahout could help get started, but it is kinda funny that even quite a few examples in this article do not demonstrate actually good algorithm performance, like this one -

  Correctly Classified Instances : 41523 61.9219%
  Incorrectly Classified Instances : 25534 38.0781%
  Total Classified Instances : 67057
  =======================================================
  Confusion Matrix
  -------------------------------------------------------
  a b c d e f ><--Classified as
  190440 12 1069 0 0 | 20125 a= cocoon_apache_org_dev
  2066 0 1 477 0 0 | 2544 b= cocoon_apache_org_docs
  165480 2370 704 0 0 | 19622 c= cocoon_apache_org_users
  58 0 0 201090 0 | 20167 d= commons_apache_org_dev
  147 0 1 4451 0 0 | 4599 e= commons_apache_org_user

Radim14y ago

There are decimal dots missing in the confusion matrix numbers (i.e., 190440 should read 19044.0, in case anyone else was wondering why the numbers don't add up).

If anything, the article convinced me not to use Mahout. So, the author decided to use the simplest algorithm, Naive Bayes, and got miserable results (from the article: "This is possibly due to a bug in Mahout that the community is still investigating."). He then changed to problem formulation in order to get better results, and concluded by saying the outcome is still likely a bug, but he's happy with it anyway?

This would be probably fine if we were talking about a small, nimble project that you could go into and hack/fix yourself. But we're talking about a massive, Java codebase. The thought of customizing it makes me shudder.

EDIT: forgot to mention I agree with the parent comment completely, except I would add "... and choosing the right evaluation process" to the initial sentence.

mahmud14y ago· 1 in thread

I prefer Weka, mostly because it has excellent literature and has academic leanings, unburdened real-world issues of performance or scalability so it can afford to focus on accuracy.

paraschopra14y ago

The real value proposition of Hadoop isn't the algorithms but using Hadoop to massively parallelize the machine learning algorithms. Do you know any port of Weka that can be scaled in such a manner? Just curious.

tel14y ago· 1 in thread

Table 1 reminds me why even if these algorithms are available it's a big step to being able to understand and apply them. It's clear the author doesn't have a lot of familiarity with them.

mwexler14y ago

He is co-founder of the Mahout project with a pretty extensive background in text analysis. I suspect he's familiar with the algorithms. In fact, he may be showing the reader that they _aren't_ as magical as one may believe, by showing that they don't work perfectly oob.

Unless you are being sarcastic, in which case, forgive me for missing it.

srowen14y ago

Hey all, I'm one of main devs of Mahout and saw this article and commentary. I think it's basically right. I'd like to add my own perspective.

I think Mahout has one key problem, and that's its purported scope. The committers' attitude for a long while, which I didn't like myself, was to ingest as many different algorithms that had anything to do with large-scale machine learning.

The result is an impressive-looking array of algorithms. It creates a certain level of expectation about coverage. If there were no clustering algorithms, you wouldn't notice the lack of algorithm X or Y. But there are a few, so, people complain it's not supporting what they're looking for.

But there's also large variation in quality. Some pieces of the project are quite literally a code dump from someone 2 years ago. Now, some is quite excellent. But because there's a certain level of interest and hype and usage, finding anything a bit stale or buggy leaves a negative impression.

I do think Mahout is much, much better than nothing, at least. There is really only one game in town for "mainstream" distributed ML. If it is only a source of good ideas, and a framework to build on, then it's added a lot of value.

I also think that some corners of the project are quite excellent. The recommender portions are more mature as they predate Mahout and have more active support. Naive Bayes, for example, in contrast, I don't think has been touched in a while.

And I can tell you that Mahout is certainly really used by real companies to do real work! I doubt it solves everyone's problems, but it sure solves some problems better than they'd have solved them from scratch.

I strongly agree with here is that you're never likely to find an ML system that works well out-of-the-box. It's always a matter of tuning, customizing for your domain, preparing input, etc. properly. If that's true, then something like Mahout is never going to be satisfying, because any one system is going to be suboptimal as-is for any given system.

And for the specialist, no system, including Mahout, is ever going to look as smart or sophisticated as what you know and have done. There are infinite variations, specializations, optimizations possible for any algorithm.

So I do see a lot of feedback from smart people that, hmm, I don't think this all that great, and it's valid. For example, I wrote the recommender bits (mostly) and I think the ML implemented there is quite basic. But you see there's somehow a lot of enthusiasm for it, if only because it's managed to roughly bring together, simplify, and make practical the basic ML that people here take for granted. That's good!

mark_l_watson14y ago

Another good article by Grant Ingersoll on Mahout. I used Mahout on a customer project last year when it was not yet a complete machine learning system layered on Hadoop. Looking at Table 1. in this article, many of the previous gaps have been implemented. BTW, the book Mahout in Action is a good guide but the new MEAP released last week does not cover some of the new features, which is OK. Also, Grant has been working on "Taming Text" for a while, but a new MEAP has not been released in a year or two - I would bet that his energies have been focused on extending and using Mahout.

zgoldberg14y ago

The Google Prediction API (code.google.com/apis/predict) will help you get started with machine learning without the need to write any additional code (other than API calls)!

j / k navigate · click thread line to collapse

25 comments

21 comments · 8 top-level

law14y ago· 6 in thread

tensor14y ago

dvcat14y ago

1 more reply

law14y ago

In the end, it requires understanding your problem completely enough to justify your decision.

1 more reply

_delirium14y ago

For example, you could reimplement your own SVM instead of using http://svmlight.joachims.org/ , but your chance of producing something correct and as efficient is pretty low...

fauigerzigerk14y ago

Obviously, if the algorithm is not a core component of my product it's insane to waste time on reimplementing it, provided there is a good quality implementation that has the right license.

law14y ago

1 more reply

reuser14y ago· 4 in thread

That's cool and stuff, but why do I have to write Java?

jwr14y ago

You don't have to. I use mahout from Clojure, not a single line of Java needs to be written.

ericmoritz14y ago

Then you have two problems... I kid.

moondowner14y ago

Because you want to write in the language the software was written.

reuser14y ago

It's software for everyone, as long as they are people who want to write Java.

dmk2314y ago· 1 in thread

Mahout is a great platform, but the real challenge is defining your learning problems, preparing data sets and choosing right algorithms.

  Correctly Classified Instances : 41523 61.9219%
  Incorrectly Classified Instances : 25534 38.0781%
  Total Classified Instances : 67057
  =======================================================
  Confusion Matrix
  -------------------------------------------------------
  a b c d e f ><--Classified as
  190440 12 1069 0 0 | 20125 a= cocoon_apache_org_dev
  2066 0 1 477 0 0 | 2544 b= cocoon_apache_org_docs
  165480 2370 704 0 0 | 19622 c= cocoon_apache_org_users
  58 0 0 201090 0 | 20167 d= commons_apache_org_dev
  147 0 1 4451 0 0 | 4599 e= commons_apache_org_user

Radim14y ago

There are decimal dots missing in the confusion matrix numbers (i.e., 190440 should read 19044.0, in case anyone else was wondering why the numbers don't add up).

EDIT: forgot to mention I agree with the parent comment completely, except I would add "... and choosing the right evaluation process" to the initial sentence.

mahmud14y ago· 1 in thread

I prefer Weka, mostly because it has excellent literature and has academic leanings, unburdened real-world issues of performance or scalability so it can afford to focus on accuracy.

paraschopra14y ago

tel14y ago· 1 in thread

Table 1 reminds me why even if these algorithms are available it's a big step to being able to understand and apply them. It's clear the author doesn't have a lot of familiarity with them.

mwexler14y ago

Unless you are being sarcastic, in which case, forgive me for missing it.

srowen14y ago

Hey all, I'm one of main devs of Mahout and saw this article and commentary. I think it's basically right. I'd like to add my own perspective.

mark_l_watson14y ago

zgoldberg14y ago

The Google Prediction API (code.google.com/apis/predict) will help you get started with machine learning without the need to write any additional code (other than API calls)!

j / k navigate · click thread line to collapse