If I understand the paper (which is questionable), that's what the author is aiming for.
E.g. he's saying
1) We can make these amazing black boxes
2) We don't really understand them
3) But when we make them with gradient descent they end up being almost kernel machines
4) We know a lot about kernel machines, so we can use that to "remove some uncertainty"