Seeing as this would be easy to do, I imagine that if it is at all plausible from what they know that it is getting information from anything other than the x-ray scan, that they would have already tried this?
I do wonder how good of a predictor something would be if it just went off the average brightness of the image. Probably very bad, but maybe better than chance? Well, better than chance on the training set is to be expected, the question I guess is whether it would be better than chance on the test or validation set (I’m not confident in my understanding of the distinction between testing set and validation set. Is the idea that if you are using the score on the testing set to decide when to stop training, and maybe what hyper parameters to use or something, and other things to determine which model, you only try the model on the validation set once you have decided on your final version of the model?)
It's confusing, not least because people refer to "testing" when they mean "validation".
So, suppose you have a dataset, let's call it D, and it doesn't matter what's in it other than "instances". To train a classifier you start by creating two partitions of D: a trainign partition (the "training set"), and a testing partition (the "testing set"). We'll denote them by T₁ for the training set and T₂ for the testing set.
It's typical to use most of D as a training set, for example you may choose 80% of D to be T₁ and 20% to be T₂. Obviously T₁ ∩ T₂ = ∅ and T₁ ∪ T₂ = D.
Now, because T₁ is four times the size of T₂ it's very likely that when you test your classifer on T₂, it will appear much better than it is, just because most of the instances in T₁ aren't represented (by similar instances) in T₂. This is called overfitting to the training set. One way to mitigate it is to perform cross-validation, the most common type of which is k-fold cross-validation.
In k-fold cross-validation, you further partition T₁ to k partitions, or "folds", and then hold out each i'th partition, for i ∈ [1,k], use all the rest k-1 partitions as a training set and test on the i'th held-out partition _during training_. So you train your classifier on partitions 1 ... k minus i, test it on partition i, and repeat this process for all i, recording the performance (accuracy, F1, ROC etc, whatever your metric is). Then you choose the model that performed the best on your chosen metric.
And then you test it on T₂.
To avoid confusion between the k folds of T₁ that you use for testing your training models during cross-validation, on the one hand, and T₂, that you use for testing the model that performed best on cross-validation, on the other hand, we call the testing process performed on the k folds "validation" and each i'th subset of T₁ used for validation a "validation set". And we just call T₂ the "testing set".
The confusion arises because we do actually _test_ on sub-sets of T₁. But T₂ is always the "testing set" and it's never "seen" during training.
As to hyperparameter tuning, this is done _on the testing set_, i.e. T₂. This is A Very Bad Thing™ but there you go. Once you train a classifier and find out that it sucks on T₂, what do you do? Well, you tune the classifier's hyperparameters. Or do a grid search to automate the process. So eventually you overfit your classifier to the test set, because you now essentially have no "unseen" data instances in T₂ - the classifier didn't see the instances in T₂ during training but the trainer did, or, worse, the grid search did, and the classifier's hyperparameters were tuned according to that knowledge. How to avoid that, is a big question, but anyway that's what is done in practice, and the reason for that is that when you do Big Data, you end up needing so much data that despite having terrabytes of it, you never have enough.
That... that doesn't influence one of the presumed ways the NN categorizes images: the trend in bone geometry. The "blobs", while fuzzy, still largely retain the relative proportions to each other. Or, in other words, proportions of image elements are invariant for operations of scaling and of blurring.