What do you mean, not working? That the AI was randomly choosing the correct race 82% of the time by luck?
I'm confused by what your implying because it would seem to me that the authors went through many steps to try to pinpoint how the AI was doing this identification and how baffling it was to everyone that even with a lot of x-ray information removed (8x8 pixels compared to say 4k), it somehow was still correctly picking the race.
What would this "something else entirely" that you are implying actually be?
No; as with the article I linked elsewhere in the thread (https://techcrunch.com/2018/12/31/this-clever-ai-hid-data-fr...), that the AI might have found some other indicator, like filenames in the data set, or metadata in the images that included patient name, or differences in the length of patient name (often redacted by black rectangles in x-rays in training data), or any number of other factors.
This happens all the time in science. As another recent example of "whoops, turned out we were measuring the wrong thing", https://en.wikipedia.org/wiki/Faster-than-light_neutrino_ano...
Another example around AI: https://www.vox.com/recode/2019/12/12/20993665/artificial-in...
> One such résumé-screening tool identified being named Jared and having played lacrosse in high school as the best predictors of job performance, as Quartz reported.
Are lacrosse players naturally better workers? Probably not. Are they probably whiter, wealthier, better networks, etc. than the average population? Probably. These sorts of things - as with the 8x8 pixel example - start to point to confounding variables that need to be worked out and accounted for.
The paper quite explicitly goes into testing and disseminating what exactly the AI detects. Two observations:
- the classification clearly was primarily based on the visual content rather than spurious metadata, because various transformations of the visual content had the expected impact on classification correctness
- the classification clearly wasn't based on one specific feature of the visual content but rather on multiple factors in the visuals, because various transformations to features (including masking out specific features like bone density) produced results matching expectations (usually gradual decrease in accuracy, with some thresholds).
Conversely, if the classification was primarily based on factors other than the visual content, the visual transformations would have had negligible effect - possibly up to a threshold, and then would throw the AI completely off.
The same may be true here, and I think it's the most likely explanation.
I'd be interested in whether the same model can be trained to predict patient wealth, hair color, style of clothing, religion, etc. from the same x-ray data sets.
The fact that trained neural networks cannot tell us why they give an answer and the best tool we have to explore that is to wiggle the inputs and see how the black box responds is a major concern for the whole space. Figuring out how to tag data with enough information to generate a "why" was an active area of research ten years ago and still is.
https://www.darpa.mil/program/explainable-artificial-intelli...
Perhaps hospitals that treat a disproportionate share of poor people (which themselves are disproportionately not white), tend to use a different brand of X-ray film, and that brand has different contrast ratios than that of the brand preferred by rich hospitals. Thus, they'd be detecting the different brand of X-ray film rather than anything about the patients themselves.
Of course, at this level it's still hard to imagine generating that 82% hit rate. But maybe there are multiple factors along these lines.
Most of us radiology folk abandoned film 20 years ago and went to digital systems (CR or DR). This doesn’t negate your query though, as vendors do have different technologies and their images do not look the same.