[1] https://arxiv.org/pdf/2404.19756 - "Both MLPs and KANs are trained with LBFGS for 1800 steps in total."
[2] https://en.wikipedia.org/wiki/Limited-memory_BFGS
(Quasi-)Newton methods approximate learning rate using local curvature which gradient-based methods do not do.The post relies on Tinygrad because it is familiar to author and author tinkers with batch size and learning rate, but not with optimizer itself.
I think that even line search for minimum on the direction of the batch gradient can provide most of the benefits of LBFGS. It is easy to implement.
One of the differences is a dynamic learning rate guided by approximation of the local curvature.
Back in grad school I tried to use LBFGS to optimize a small lenet network. I want to say it used over 128GB before OOM.
Also, as a quasi-newton method, L-BFGS does not require explicit (pre-)computation of the hessian (it implicitly iteratively estimates its inverse in an online manner).
Second order methods are fun, actually. I like them. ;)
My intuitions about KANs and visual data comes from an impression that it would be hard for a decision boundary on visual data to behave nicely if it could only be built from b-splines.
Judging the usefulness of a machine learning architecture is not a matter of determining which architecture will perform the best in all scenarios.
But MLPs are not good for everything. Where Simulated annealing works better than auto-diff is the classic example that is easier to visualize, at least for me.
Even if the sequence 'exists', finding it is the problem, it doesn't matter if a method can represent an unfindable sequence.
That said, IMHO, MLP vs KAN is probably safer to think of as horses for courses, they are better at different things.
At least with your definition of 'usable' being undefined.
And you can choose which ones to invert automatically using the free+Free https://invertornot.com/ API - IoN will correctly return that eg https://i.ameo.link/caa.png (and the other two) should be inverted.
Until that time we will be stuck using empirical methods (read: trial and error) and KANs are at best just another thing to try.
I think it’s probably worth clarifying a little here that a Bspline is essentially a little MLP, where, at least for uniform Bsplines, the depth is equal to the polynomial degree of the spline. (That’s also the width of the first layer.)
So those two network diagrams are only superficially similar, but the KAN is actually a much bigger network if degree > 1 for the splines. I wonder if that contributed to the difficulty of training it. It is possible some of the “code smell” you noticed and got rid of is relatively important for achieving good results. I’d guess the processes for normalizing inputs and layers of a KAN need to be a bit different than for standard nets.
""" ... the most significant factor controlling performance is just parameter count. """
""" No matter what I did, the most simple neural network was still outperforming the fanciest KAN-based model I tried. """
I suspected this was the case when I first heard about KANs. Its nice to see someone diving into a bit more, even if it is just anecdotal.