undefined | Better HN

0 pointsbuildbot1y ago0 comments

And if it is, good luck scaling LBFGS to anything useable, like vgg-16 scale…let alone a 7B param LLM.

Back in grad school I tried to use LBFGS to optimize a small lenet network. I want to say it used over 128GB before OOM.

0 comments

1 comments · 1 top-level

thesz1y ago

This is why I mentioned batch gradient line search. You can combine it with conjugate gradient.

And small LeNet (I think it is first convolutional network that obtained good score on MNIST) is orders of magnitude bigger than KAN's in the original paper. And it will be, if we believe scaling claims from the KAN paper.

j / k navigate · click thread line to collapse