Humans have a myriad of visible characteristics: height, weight, skin color, eye color and shape, hair color and type, shapes of facial features, and so on.
All these characteristics vary continuously.
If you see that data as points filling a high-dimensional cube, there's not going to be an empty space there.
Some areas are going to be denser than the others, but there are no gaps there.
What you try to do with "race" is you're trying to cluster this data.
But there really is only one cluster. Might as well call rand() a million times to get a bunch of points in 0..1, and cluster that.
Oh, but you've see black people! And white people!
Well yeah, but it's all those people filling the spaces in between any two points that make it impossible to draw the line.
The only way to draw the line is to make a call on where to draw it — that is, to make an arbitrary choice. Without it, your clustering algorithms would fail.
Yes, there are high-density peaks on this data, especially if you look at any single characteristic.
Yes, you can separate the peaks. But deciding on where to put the the threshold is choice — a social construct — that can leaves a lot of points without a "race" label (which race is Irish - Mexican?) and/or change which peaks make the cutoff (are Armenians a race, or noise in the dataset?).
>Maybe if diseases are distributed differently across different races, it can make testing and treating them more cost effective
The scientists have two options:
A) Look at the original data which you used to assign the race label (skin color, hair type, etc), and see if there's any correlation of that data with diseases
B)Look at the data, cluster it using an arbitrary choice to be able to get more than one cluster, ignore a lot of people below the threshold, assign the labels, IGNORE THE RAW DATA, and then look for correlations between labels and diseases.
Which approach do you think is more scientific?