In particular, aspire to learn probabilistic graphical models + the libraries to train them (like pyro, tensorflow probability, Edward, Stan). They have a steep learning curve, especially if you're new to the game, but the reward is great.
All of these methods have their place. SVM's have their place, but also aren't great for probability calibration and non-linear SVM's like every single kernel method can scale absolutely terribly. Neural networks have their place, sometimes as a component of a larger statistical model, sometimes as a feature selector, sometimes in and of themselves. They're also very often the wrong choice for a problem.
Don't fall into the beginner trap: sometimes people tend to mistake 'what is the hottest research topic' for 'what is the right solution to my problem given my constraints, (data limitations, time limitations, skill limitations, etc.)'. Be realistic, don't use magical thinking, and have a strong basis in statistics to weed out the beautiful non-bullshit from the bullshit that is frustratingly prevalent (everyone and their mother is an ML expert today).
EDIT: I want to also clarify: I don't mean to suggest the author is new to ML, I just mean this as general advice for anyone coming here who is new to DS/ML. The article looks great!
A strong basis in statistics is certainly a great thing, but that can be maximum likelihood plus Bayes law (i.e. "MAP" estimation which is more of a hack to ML than an actual Bayesian method) and provide the big picture for almost everything.
Meanwhile a strong basis in "deterministic methods", as an alternative way to spend that learning effort, has its own rewards. The training algorithms for deep learning are also the hottest algorithm research area in machine learning, and are certainly applicable beyond deep learning. For that matter a thorough understanding of SVM delves into convex optimization, an extremely powerful framework as well.
I don’t know, I think it depends on what you mean by Bayesian. I would say understanding loss functions and regularization requires some understanding of Bayesian stats (just knowing that it comes from log p(x|q) + log p(q) and what both of those terms mean).
> Graphical models and Bayesian methods generally may make a comeback but such approaches have been superseded by other methods for good reasons, i.e. scaling
Can you be more specific here? It sounds like you’re talking about a particular problem or class of methods. PGMs/Bayesian methods can mean basically anything from logistic regression to running HMC on some hierarchical model using 10,000 CPU hours. I just mean aspiring to learn PGMs will force you to quickly learn and gain a deeper understanding of and appreciation for Bayesian stats, and then you can build on that for years and years. But it depends on what you’re interested in doing —- there’s a difference between model building and inference; you can spend your whole life using the same loss function and just focus on making your NN architecture better, you don’t need much Bayesian stats to do that.
> i.e. "MAP" estimation which is more of a hack to ML than an actual Bayesian method
Huh? Maybe we mean different things by Bayesian — the mode of your posterior seems pretty Bayesian to me!
> Meanwhile a strong basis in "deterministic methods", as an alternative way to spend that learning effort, has its own rewards. The training algorithms for deep learning are also the hottest algorithm research area in machine learning, and are certainly applicable beyond deep learning. For that matter a thorough understanding of SVM delves into convex optimization, an extremely powerful framework as well.
Would agree that optimization is an important part of ML/DS, but since nowadays virtually all of the most popular optimization algorithms are available at our fingertips in e.g. pytorch, I would still think its better to start with trying to build a fundamental understanding of how to frame problems. But that’s colored by my own experience and background, people’s priorities should be different depending on what they want to do.
The lore I've heard is that most new deep learning training algorithms (optimization algorithms) only work better on particular special cases, and it is hard to do better than the established algorithms in general.
I'm also not sure why you're saying they're applicable beyond deep learning--how do you plan to train a PGM or SVM using Adam?
As in, rather than learning in depth all the low level parts then finally putting it together at the end, start with a surface high-level understanding of a working prototype then expand into the details of how everything works inside.
In the case of ML, this could mean starting with a 5 line SciKit-learn prototype of a random forest model, seeing some working predictions, then expanding knowledge from there - what data is going in and what is coming out? What’s a classifier? What’s a decision tree? Etc
This would be in contrast to picking up one of the plethora of “ML” textbooks that mostly only describe the math behind all the algorithms. Which is not where you should begin, in my view (years of teaching experience). The use of such textbooks is as a reference to fill in details once your are curious about them.
And more than anything, the best way to learn practical ML is to “apprentice” to some experienced practitioners or team who are willing to act as mentors.
Anyway whatever works. Ultimate aim is to learn and have fun.
'State of the art' does not always mean 'best for your task', and in fact lately depending on your field SOTA sometimes simply means 'unaffordable' for anyone whose budget is under 1 million dollars.
Try linear methods first.
Ensembles of decent models are usually good models. The point above about probability calibration can be at least somewhat mitigated by using ensemble averages.
Don't just assume "the $MODEL will figure it out" if you give it shitloads of degrees of freedom. Machine learning efficiency all comes down to efficiency of representation, and feature engineering can achieve huge payoffs if/when you incorporate domain knowledge and expertise.
Once you gain a perspective into the "universality" of statistical methods, optimization, and Bayesian probability theory, your work will become a lot easier to reason about. As an example, try to see if you can explain why least-squares fit results from the assumption that model residuals are normally distributed (and what connections this may have to statistical physics!).
Also, if you know a few things about the data it becomes a little easier to explain what your model is doing and why it is producing those results.
Found a good resource which explained the trust component: https://arxiv.org/pdf/1602.04938.pdf
About Probabilistic Graphical Models, is there book other than Daphne Koller's book that you would suggest?
Bishop's Pattern Recognition and Machine Learning has a chapter thats free online: https://www.microsoft.com/en-us/research/wp-content/uploads/...
https://faculty.marshall.usc.edu/gareth-james/ISL/
Elements of Statistical Learning
https://web.stanford.edu/~hastie/ElemStatLearn/
Machine Learning: A Probabilistic Perspective
The former is a much recommended book since it's very comprehensive and builds everything from the ground up and was the basis for the entire course. The latter is a beast of it's own and we simply covered what was effectively the first chapter as part of the course.
Want to whet my appetite for your suggestion.
The above example is contrived, but makes more sense in the case of language modelling. Since a bag-of-words vector, containing say counts of words seen in a document, is typically sparse (most documents only contain a limited portion of the full vocabulary), a frequentist estimate of word probability will say that certain words can never occur, just because it's never seen them. The Bayesian estimate will still assign some non-zero chance of seeing that word.
Practically speaking, this leads to the idea of "smoothing" in tf-idf (text-frequency-inverse-document-frequency) vectors, by adding 1 to document frequencies. You don't need Bayesian statistics to do this, but maybe you never would have thought of it otherwise!
* do you want to have your NN model your uncertainty as well as the mean? How do you incorporate that into the loss function? Hint: loss = (yhat- y)^2/sigma_hat^2 is missing a term but you wouldn’t know that if you don’t come from Bayesian stats.
* the rabbit hole goes as deep as you want. Understanding Bayesian stats removes a lot of the “ad hoc” and intuitive guesswork that goes into ML when you don’t have a solid statistical foundation for what you’re doing.
I am more of a book person, if you have any other resource for probabilistic graphical models, please share here.
Bishop's "Pattern Recognition and Machine Learning" has a chapter on PGM's that's free online: https://www.microsoft.com/en-us/research/wp-content/uploads/...
Murphy's "Machine Learning: A Probabilistic Perspective" is another behemoth that covers this stuff, but it's really just your preference.
I say "aspire" because (1) depending on your background, it will likely be something that takes awhile to internalize and really understand, and you will probably realize many times over that you thought you understood something that you actually didn't (2) by learning PGM's, you learn a lot of Bayesian statistics as a side effect, hence why even learning a little bit about them is rewarding.
Once you learn a bit, I would use Pyro/other libraries and try to actually build PGM's for toy problems (or non-toy problems too..) because (1) it will force you to admit to yourself that you don't understand something, (2) the documentation for a lot of these libraries is also useful learning material, and (3) you will see once you learn these libraries that it is fairly easy to do something that would be astoundingly complex if you were to try and do it by hand.
You can basically build most standard ML algorithms as a PGM, so e.g. you can try to do logistic regression as a PGM and compare the results to scikit-learn.
They're the perfect blend of theoretically elegant and practically impractical. Training scales as O(n^3), serialized models are heavyweight, prediction is slow. They're like Gaussian Processes, except warped and without any principled way of choosing the kernel function. Applying them to structured data (mix of categorical & continuous features, missing values) is difficult. The hyperparameters are non-intuitive and tuning them is a black art.
GBMs/Random Forests are a better default choice, and far more performant. Even simpler than that, linear models & generalized linear models are my go-to most of the time. And if you genuinely need the extra predictiveness, deep learning seems like better bang for your buck right now. Fast.ai is a good resource if that's interesting to you.
Linear models are simpler. GBMs are more powerful, more flexible, and faster.
Every ML course I took had 3 weeks of problem sets on VC dimension and convex quadratic optimization in Lagrangian dual-space, while decision tree ensembles were lucky to get a mention. Meanwhile GBMs continue to win almost all the competitions where neural nets don't dominate.
I suspect my professors just preferred the nice theoretical motivation and fancy math.
You probably also know that decision tree boundaries are non Linear And piecewise. It’s not so straightforward to find splits on continuous features.
Ie If the data is linearly separable then why not. Even using hinge loss with nns is not uncommon.
You probably see gbms winning a lot of competitions compared to svms because a lot of competitions may have a lot of data and non linear decision boundaries. some problems don’t have these characteristics.
Prediction is not that slow with linear SVMs especially not compared to something like K-NN. The main hyperparamaters which matter are the "C" value and maybe class weights if you have recall or precision requirements. The C value is something that should be grid-searched, but you might as well be grid-searching everything that matters on every ML algorithm and in this regard SVMs are fast to iterate over (because the C value is all that matters).
Applying categorical and continuous features is not difficult if you choose to do it in anything more sophisticated than sklearn. Also, pd.get_dummies() exists (though it may lead to that slow prediction you're concerned about)
You're most likely right with GBM or Random Forests - though they can have all sorts of issues with parallelism if you're not on the right kind of system. You talk about linear models but SVMs are usually using linear kernals anyway and are a generalization of linear models (including lasso and ridge regression models).
But at that point, they also have a lot in common with linear models. Those also seem practical in that domain (though I have less experience here, tbh). And performant, when using SGD + feature hashing like e.g. vowpal wabbit.
My beef with non-linear kernels and structured data is a longer discussion, but I find kernel methods for structured data (which is usually high-dimension but low-rank -- lots of shared structure between features, shared structure between missingness of features) to be highly problematic.
Provided your structural dimensionality is below about 10 (ie. 10 dominant eigenvalues for your features), then KNN can be O(log(N)) for prediction via a well designed Kd-Tree.
KNN is also really simple to understand, and to design features for. It also never really tends to throw up surprises, which for production is the kind of thing you want. Most importantly, the failures tend to 'make sense' to humans, so you stay out of the uncanny valley.
This tutorial looks good, and well written.
Personally, I'm quite bullish on the resurgence of SVMs as SOTA. What did it for me was Mikhail Belkin's talk at IAS.[1]
[1] https://m.youtube.com/watch?index=15&list=PLdDZb3TwJPZ5dqqg_...
I feel like I've seem more tree ensembles in the wild than SVMs, though.
For more general tabular data, like trees, regression and even rule based models are more realistic.
My impression: SVMs are more of theoretical interest than practical interest. Yeah, learn your statistics. Loss functions. Additive models. Neural nets. Linear models. Decision trees, kNNs etc. SVM is more of a special interest, imho.
If someone has a suggestion on how I can improve the user experience feel free to hop in and let me know.