For example, there's been plenty of issues like racial bias with the computer vision algorithms that police uses, which effectively is data science on pictures. But nobody knows why the issue occurs and nobody can specifically fix it without the risk of breaking a thousand unrelated other things.
Sure we do. Photography itself has a racial bias. [0] Different skin tones produce different levels of detail, and it has been an uphill battle to be able to capture those details since the advent of photography.
So long as facial recognition relies on photography, and photography is flawed, then every dataset is biased. Which will exacerbate the bias of the AI, which already has its own set of problems from where those datasets are formed.
[0] https://www.nytimes.com/2019/04/25/lens/sarah-lewis-racial-b...
They do know why that occurs. It's because the data set is biased.
edit: And a statistical analysis isn't some sort of magic data genie. Statistics can give rigorous results because it makes strong assumptions. If those assumptions don't hold then the results aren't rigorous anymore. A trillion parameters model can pull interactions out of your data that almost no statistical analysis of the data would identify ahead of time. So what you need to analyze is the model and try to infer why it's predicting different certain results and then work backwards from there.
Second: even if you do, you're going to have a hard time controlling for the fact that the police and criminal justice system has a long history of disproportionately enforcing the law against people of color. The base data about who commits crimes, gets convicted, etc. for well over a hundred years is going to reflect this bias.
I'm not claiming to have done my homework here either, same disclaimer applies. I suspect you could find somebody who does study this if you wanted to look.
All of the above assumes we're discussing the US, btw.
Too bad I wasn't a data scientist or else I could just get a passing grade by claiming the questions were chosen from a biased data set, or retake the exam until the data set matched the questions I studied for, at which point the data set would no longer be biased, lol.
Funny line of work, this data 'science' where you only use the results that fit the narrative you wanted in the first place.
We're in full doublethink mode, just keep repeating data 'science', 'science', 'science'. :)