Trying Kolmogorov-Arnold Networks in Practice (opens in new tab)

(cprimozic.net)

146 pointsAmeo1y ago22 comments

22 comments

21 comments · 8 top-level

thesz1y ago· 7 in thread

The original paper [1] used LBFGS [2], it is quasi-second-order optimization algorithm.

  [1] https://arxiv.org/pdf/2404.19756 - "Both MLPs and KANs are trained with LBFGS for 1800 steps in total."
  [2] https://en.wikipedia.org/wiki/Limited-memory_BFGS

(Quasi-)Newton methods approximate learning rate using local curvature which gradient-based methods do not do.

The post relies on Tinygrad because it is familiar to author and author tinkers with batch size and learning rate, but not with optimizer itself.

I think that even line search for minimum on the direction of the batch gradient can provide most of the benefits of LBFGS. It is easy to implement.

abetusk1y ago

You're saying that LBFGS is fundamental to the success of KANs? Why is this so?

thesz1y ago

I can only state that KAN paper used LBFGS and that LBFGS is remarkably different from SGD.

One of the differences is a dynamic learning rate guided by approximation of the local curvature.

1 more reply

adw1y ago

Will deal better with less well-conditioned parameters, maybe? That's a bit of a wild-ass guess, but it's not an entirely uninformed one.

buildbot1y ago

And if it is, good luck scaling LBFGS to anything useable, like vgg-16 scale…let alone a 7B param LLM.

Back in grad school I tried to use LBFGS to optimize a small lenet network. I want to say it used over 128GB before OOM.

1 more reply

osipov1y ago

What's your basis for claiming that Tinygrad can't compute 2nd order partial derivatives (i.e. Hessians) needed for LBFGS? Tinygrad like PyTorch uses automatic differentiation which has no problem supporting nth order derivatives.

fjkdlsjflkds1y ago

OP does not (seemingly) claim that tinygrad can't compute hessians, only that a first-order optimization method was the only thing tried.

Also, as a quasi-newton method, L-BFGS does not require explicit (pre-)computation of the hessian (it implicitly iteratively estimates its inverse in an online manner).

thesz1y ago

As someone with highly unpronoceable nickname said, my only complaint is that only first order methods are used.

Second order methods are fun, actually. I like them. ;)

davesque1y ago· 4 in thread

KANs seem like a great tool for the right job. However, based on my understanding of how they work, my intuitions tell me that they would be awful at image processing, which I think was one of the author's test beds.

ein0p1y ago

On what do you base your intuitions? The way I see it is either they are good universal approximators (like ANNs), and then they’re usable for everything, or they aren’t, and then they’re at best very narrow in application. I’ve yet to see any evidence for the latter.

davesque1y ago

Then why would you choose to use a CNN over a fully connected neural network for visual data? For the same reason you'd choose a KAN over a traditional neural network if you were trying to fit a continuous function that can easily be modeled as a combination of b-splines. Machine learning models aren't magic and their internal function very much determines the kinds of problems to which they are well applied.

My intuitions about KANs and visual data comes from an impression that it would be hard for a decision boundary on visual data to behave nicely if it could only be built from b-splines.

Judging the usefulness of a machine learning architecture is not a matter of determining which architecture will perform the best in all scenarios.

nyrikki1y ago

Universal approximation just indicates existence of a suitable sequence, not that it is findable and doesn't apply with finite numbers of neurons.

But MLPs are not good for everything. Where Simulated annealing works better than auto-diff is the classic example that is easier to visualize, at least for me.

Even if the sequence 'exists', finding it is the problem, it doesn't matter if a method can represent an unfindable sequence.

That said, IMHO, MLP vs KAN is probably safer to think of as horses for courses, they are better at different things.

At least with your definition of 'usable' being undefined.

eximius1y ago

Uninformed response: if two different universal approximators can have different modes/rates of learning, then even if they're usable in theory, they could be less good in practice.

gwern1y ago· 1 in thread

Web design note for OP: you designed your site for dark-mode, and your initial SVGs are correct, but then it clashes with your graphs which are all light-mode. You can invert them in CSS, and they'll look a lot better.

And you can choose which ones to invert automatically using the free+Free https://invertornot.com/ API - IoN will correctly return that eg https://i.ameo.link/caa.png (and the other two) should be inverted.

AmeoOP1y ago

Very good suggestion, thanks. I'll fix that

anonymousDan1y ago· 1 in thread

'hacky feeling techniques' - as opposed to the rest of DNN research??! More seriously, I wonder if some kind of hybrid approach could be possible/beneficial (e.g. KANs for some layers?)

janalsncm1y ago

That’s a fair criticism but it should count for something that we’ve spent the last 20 years ironing out standard “hacks” to get NNs working. Someday we will hopefully have some theoretical grounding for basic things like network shape and size, optimizer, learning rate schedule, etc.

Until that time we will be stuck using empirical methods (read: trial and error) and KANs are at best just another thing to try.

dahart1y ago

> And here's a neural network/multi-layer perceptron with the same number of layers and nodes: One big difference to note is that there are far fewer connections between nodes in KANs compared to neural networks/MLPs.

I think it’s probably worth clarifying a little here that a Bspline is essentially a little MLP, where, at least for uniform Bsplines, the depth is equal to the polynomial degree of the spline. (That’s also the width of the first layer.)

So those two network diagrams are only superficially similar, but the KAN is actually a much bigger network if degree > 1 for the splines. I wonder if that contributed to the difficulty of training it. It is possible some of the “code smell” you noticed and got rid of is relatively important for achieving good results. I’d guess the processes for normalizing inputs and layers of a KAN need to be a bit different than for standard nets.

slashdave1y ago

It is important to remember that the optimizers used in mainstream deep learning models have all been developed and fine tuned to work well with classic NN architectures. There is no such thing as a generic optimization algorithm.

imalexsk1y ago

the loss value he mentions for the KAN is ~1/5 (0.00011) of the NN loss (0.006). Could be a massive difference in an actual task with larger/complex datasets.

abetusk1y ago

Here are what I think are the main conclusions of the article:

""" ... the most significant factor controlling performance is just parameter count. """

""" No matter what I did, the most simple neural network was still outperforming the fanciest KAN-based model I tried. """

I suspected this was the case when I first heard about KANs. Its nice to see someone diving into a bit more, even if it is just anecdotal.

1 more reply

j / k navigate · click thread line to collapse

22 comments

21 comments · 8 top-level

thesz1y ago· 7 in thread

The original paper [1] used LBFGS [2], it is quasi-second-order optimization algorithm.

  [1] https://arxiv.org/pdf/2404.19756 - "Both MLPs and KANs are trained with LBFGS for 1800 steps in total."
  [2] https://en.wikipedia.org/wiki/Limited-memory_BFGS

(Quasi-)Newton methods approximate learning rate using local curvature which gradient-based methods do not do.

The post relies on Tinygrad because it is familiar to author and author tinkers with batch size and learning rate, but not with optimizer itself.

I think that even line search for minimum on the direction of the batch gradient can provide most of the benefits of LBFGS. It is easy to implement.

abetusk1y ago

You're saying that LBFGS is fundamental to the success of KANs? Why is this so?

thesz1y ago

I can only state that KAN paper used LBFGS and that LBFGS is remarkably different from SGD.

One of the differences is a dynamic learning rate guided by approximation of the local curvature.

1 more reply

adw1y ago

Will deal better with less well-conditioned parameters, maybe? That's a bit of a wild-ass guess, but it's not an entirely uninformed one.

buildbot1y ago

And if it is, good luck scaling LBFGS to anything useable, like vgg-16 scale…let alone a 7B param LLM.

Back in grad school I tried to use LBFGS to optimize a small lenet network. I want to say it used over 128GB before OOM.

1 more reply

osipov1y ago

fjkdlsjflkds1y ago

OP does not (seemingly) claim that tinygrad can't compute hessians, only that a first-order optimization method was the only thing tried.

Also, as a quasi-newton method, L-BFGS does not require explicit (pre-)computation of the hessian (it implicitly iteratively estimates its inverse in an online manner).

thesz1y ago

As someone with highly unpronoceable nickname said, my only complaint is that only first order methods are used.

Second order methods are fun, actually. I like them. ;)

davesque1y ago· 4 in thread

ein0p1y ago

davesque1y ago

My intuitions about KANs and visual data comes from an impression that it would be hard for a decision boundary on visual data to behave nicely if it could only be built from b-splines.

Judging the usefulness of a machine learning architecture is not a matter of determining which architecture will perform the best in all scenarios.

nyrikki1y ago

Universal approximation just indicates existence of a suitable sequence, not that it is findable and doesn't apply with finite numbers of neurons.

But MLPs are not good for everything. Where Simulated annealing works better than auto-diff is the classic example that is easier to visualize, at least for me.

Even if the sequence 'exists', finding it is the problem, it doesn't matter if a method can represent an unfindable sequence.

That said, IMHO, MLP vs KAN is probably safer to think of as horses for courses, they are better at different things.

At least with your definition of 'usable' being undefined.

eximius1y ago

Uninformed response: if two different universal approximators can have different modes/rates of learning, then even if they're usable in theory, they could be less good in practice.

gwern1y ago· 1 in thread

AmeoOP1y ago

Very good suggestion, thanks. I'll fix that

anonymousDan1y ago· 1 in thread

'hacky feeling techniques' - as opposed to the rest of DNN research??! More seriously, I wonder if some kind of hybrid approach could be possible/beneficial (e.g. KANs for some layers?)

janalsncm1y ago

Until that time we will be stuck using empirical methods (read: trial and error) and KANs are at best just another thing to try.

dahart1y ago

slashdave1y ago

imalexsk1y ago

the loss value he mentions for the KAN is ~1/5 (0.00011) of the NN loss (0.006). Could be a massive difference in an actual task with larger/complex datasets.

abetusk1y ago

Here are what I think are the main conclusions of the article:

""" ... the most significant factor controlling performance is just parameter count. """

""" No matter what I did, the most simple neural network was still outperforming the fanciest KAN-based model I tried. """

I suspected this was the case when I first heard about KANs. Its nice to see someone diving into a bit more, even if it is just anecdotal.

1 more reply

j / k navigate · click thread line to collapse