This is why I mentioned batch gradient line search. You can combine it with conjugate gradient.
And small LeNet (I think it is first convolutional network that obtained good score on MNIST) is orders of magnitude bigger than KAN's in the original paper. And it will be, if we believe scaling claims from the KAN paper.