I think that's perfectly said. Humans are prone to the same thing, but we've developed better coping mechanisms.
In ML, techniques to avoid overfitting or reporting spurious relationships are a first-order, 101 topic, especially among the type of ML engineer a hedge fund might hire (they are not hiring data science hacks).
On the flip side, I worked in a quant finance firm before that mostly did factor investing with some twists, and the overall statistical rigor was embarrassing. Even with simple regressions, nobody was asking basic robustness questions, p-hacking was daily life, directly comparing t-stats from different univariate model fits was considered “advanced feature selection.”
If a firm is goimg to do bad stats, they don’t need machine learning for that.
Is this partly a problem with interpretation? Let's say I do a binary (supervised) classification with an algorithm that is also capable of assessing probabilities. If I generate a data set consisting of a randomized bag of words, and randomly assign them to 0 and 1 categories, and run it through a supervised ML classifier, then yeah, everything in the test set will get assigned to something.
But if you look at the probability estimates resulting from the ML, you'd almost certainly see something that indicates a high degree of randomness in the assignments (various techniques such as cross validation, or probabilities that indicate a high degree of uncertainty for almost all of the predictions).
I'm not sure this is a problem with the algorithm itself, because the output from many of these algorithms does indicate low predictive value.
Spoiler: the neutral net thinks it's doing a really good job!
The paper certainly does appear to address the question of categorizing completely randomized input:
From the abstract
"...our experiments establish that state-of-the-art convolutional networks for image classification trained with stochastic gradient methods easily fit a ran- dom labeling of the training data. This phenomenon is qualitatively unaffected by explicit regularization, and occurs even if we replace the true images by com- pletely unstructured random noise.
As mlthoughts pointed out in a different comment, any kind of regression technique faces issues about goodness of fit. The thing is, there are techniques to show you that the fit isn't very good. A simple linear regression will fit randomized noise, but there are outputs that can show you that the fit isn't good and the regression may not be reliable.
The question I have here is whether ML techniques are failing in a different way, that it is fitting to randomized noise while appearing by various tests to be a very strong fit. If they're failing the same way that regression would (i.e.., someone applies it and fails to do basic tests for goodness of fit), that's a problem I suppose, but is it really a unique failing of ML or neural nets? It sounds like more like a standard misapplication of predictive modeling...
CNN's exploit spatial locality and LSTMs exploit temporal locality. The SOTA models are architect-ed with even stronger assumptions about the nature of the task. Methods like Neural Networks, Random Forests and SVMs when used as unconstrained universal function approximators for unstructured data only learn some non-linear polynomial/ exponential/logarithmic combination of data itself, without much nuance.
It is critical to help a model out by constraining the space of models it searches over to find the right answer. I think, unless we figure out a way to constrain architectures to exploit specific traits of task they are trying to solve, (universal function approximator type) ML won't succeed in the same way that it has in vision / language.
As it of now, the alternative is to use PGMs where the model is fully interpretable as a graph structured combination of explicitly parameterized random variables. PGMs work well with low data and give really good uncertainty estimates, to evaluate the quality of a model. PGMs of course suffer from the problem where they are excruciatingly slow for large datasets and require require a decent amount of prior knowledge about the problem to explicitly define the type of graph structure / random variables we are going to be using.
I think ML is most certainly capable of solving this problem, but the community is probably waiting for another break through along the lines of AlexNet/LSTMs before that it the case.