For example, understanding the spectral theorem makes SVD (hence PCA) and the DFT class of algorithms much clearer. Understand the notion of Lp-Norms, convexity, adjoints, loss functions and regularization and a whole bunch of seemingly different algorithms collapse into facets of the same thing. Hook it up to automatic differentiation then some optimization algorithms and you can write anything from Neural networks, SVMs, regularized logistic regression to Non negative tensor factorization in a few lines. You stop making arbitrary divisions between classification or optimization. Much the same kind of collapse can be done for the dual [2] notion of probabilistic algorithms by thinking in terms of graphs, simplices, parametrizations, families and conjugacy.
The best thing from all this is you stop thinking of which algorithm should I use and start thinking of what do I want to do? What is the best mathematical model for this? What would really be great would be a machine learning language. Where one could work with things akin to folds and maps on various structures and manifolds and disappear the incidental complexity. Stuff like [3] is really encouraging for that direction.
[1] The problem of learning a distribution usually is called unsupervised learning, but in this case, supervised learning formally is a special case of unsupervised learning; if we admit that all the functional relations or associations that we are trying to learn have any element of noise or stochasticity, then this connection between supervised and unsupervised problems is quite general.
http://www.princeton.edu/~wbialek/our_papers/bnt_01a.pdf
[2] http://golem.ph.utexas.edu/category/2007/01/duality_between_...
[3] http://www.ipam.ucla.edu/publications/gss2012/gss2012_10605....