"A 2000 study found that 87 percent of the U.S. population can be identified using a combination of their gender, birthdate and zip code."
https://en.wikipedia.org/wiki/Data_re-identification
UK postcodes are more specific I believe.
"Data that directly identifies patients will be replaced with unique codes in the new data set, but the NHS will hold the keys to unlock the codes “in certain circumstances, and where there is a valid legal reason”, according to its website. "
Does anyone actually know what that means? I wouldn't know from a medical record how much data would need to be removed to make it anonymous. It likely depends on the record. And there are different answers that can both be right (so which are they using?).
Edit: as an example, I was in a car accident as a 17 year old and broke my jaw. If you Google my name and the location you'll see the news article. I was the only person treated at the local hospital for a broken jaw that day. You just deannonimised me.
Or go again: I'm the only person from my town who went to my university in the year I went. Just look for someone treated in <home area> term time and <uni medical clinic> term time 2002-2008.
Giving access to nation's healthcare data for statistical and ML uses can speed up development for ML in diagnostics by a huge amount.
Once you scrub the source data to remove birth dates, report creation dates & zip codes, it should be sufficiently anonymized to be traceable back to the individual. We can enable some level of differential privacy on top as well.
ML's 2 big leaps of the last decade 2012 CNNs and 2017-18 pre-trained transformers both came off the back of a leap in data availability (Imagenet for CNNs and Scraping the entire internet for BERT).
Individual hospitals and the startups they bankroll have their inhouse ML teams, but closed data and unwillingness to disseminate has made the field move at a snail's pace. Additionally, Generalizability of any kind won't be achieved until the data gets scaled up past small geographic pockets and patient sets. This is especially true in medicine which has a long-tail problem. Lastly, aggregating data together lends a natural anonymity to each user who's data is shared within the dataset.
IMO, disease diagnostics is one the most ideal castings for a problem in ML. A purely technical trade where data and decisions have a degree of exactness and concepts like conditional probability are a natural fit. The only problem is that the pipeline is largely still analog. This means that the data collected about the doctor's diagnostic processes still comes out incomplete and privacy protections make sure it stays on a scale small enough to make ML difficult.
"Data that directly identifies patients will be replaced with unique codes in the new data set, but the NHS will hold the keys to unlock the codes “in certain circumstances, and where there is a valid legal reason”, according to its website."
Its hard to compare NLP (such as pretrained transformer models) to medical ML because there are real and potentially fatal implications to misdiagnosis. The focus should be on small scale and explainable ML, not brute forcing patterns across large populations (which is more effective for insurance companies than clinicians). FWIW, I'm a massive fan of the potential of CV in diagnosis, and in aiding spotting abnormalities early, but I think the proposed opening up of data is absolutely the wrong way to see innovation in this field.
This link is also relevant for people registered with the NHS: https://www.nhs.uk/your-nhs-data-matters/manage-your-choice/
Whilst legally permissible, being opt-out with highly sensitive information is detestable and shows the GDPR doesn't go far enough.
A case in point: some dentists in our area are now asking for a comprehensive medical questionnaire (far more than just dental history or medical conditions that might affect appropriate dental treatment) to be completed for any new patient, and then emailed to them. There's not even a pretence of acceptable security and privacy protections. With some other dentists, they ask you to use an online system to send them similar information, and that system is run by a commercial entity not based in this country.
Given how badly the figures right at the top of the health service and the members of the Government who are responsible for it dropped the ball when it came to privacy and COVID apps, it's hard to have much faith in them to properly operate centralised systems that hold substantial information about everyone for very generic-sounding purposes like "planning" and "research".
The fact that this particular opt-out can be completed easily online by adults yet requires a parent or guardian to jump through hoops involving filling in PDF forms in order to opt out of sharing sensitive data about a child says a lot about the level of ethics involved here, and none of it is good.
As a final observation, nothing about opting out of this kind of generic, large-scale system precludes participating in legitimate research conducted with appropriate safeguards and ethical standards. Doctors in a certain field may be working with a research group to investigate a particular condition in their field and its treatment, and can forward an invitation to any of their patients who might have that condition explaining the research and asking if the patient would be willing to participate in the research. I've seen one of these, and the information provided was very clear about exactly what data would be shared, what it would be used for, who would have access to it and with what safeguards to prevent unauthorised access, arrangements for destroying it after the research had been completed -- basically everything you'd hope a responsible organisation doing legitimate medical research would be careful about. So the kind of useful research where someone privacy-conscious might still choose to participate for the greater good isn't necessarily undermined by opting out of generic data-sharing consents.
Citation needed (sorry).
It should be noted that the NHS is one of the largest, longest, most standardised medical record sets in the world. This is because it's for the whole (67m person) country, and because the NHS is so old records are old and centralised to a degree you don't see in private-healthcare, federal or smaller countries. That makes this of interest imho.
The NHS DBs have been used for really good medical research in the past (exactly because of the reasons above). That's fine. Buts it's different to just sharing the data with anyone...