This is a pretty wild leap. Code has a lot of hooks for training via hill-climbing during post-training. During post-training, you can literally set up arbitrary scenarios and give the bot more or less real feedback (actual programs, actual tests, actual compiler errors).
It's not impossible we'll get a training regime that does the "same thing" for medicine that we're doing for code, but I don't know that we've envisioned what it looks like.
The AI coding improvement should be partially transferrable to other disciplines without recreating the training environment that made it possible in the first place. The model itself has learned what correct solutions "feel like", and the training process and meta-knowledge must have improved a huge amount.
An ER staff is frequently making inferences based on a variety of things like weather, what the pt is wearing, what smells are present, and a whole lot of other intangibles. Frequently the patients are just outright lying to the doctor. An AI will not pick up on any of that.
It will if it trains on data like that. It's all about the training data.
Diagnostic standards in (at least emergency, but I think other specialties) medicine are largely a joke -- ultimately it's often either autopsy or "expert consensus."
We get to bill more for more serious diagnoses. The amount of patients I see with a "stroke" or "heart attack" diagnosis that clearly had no such thing is truly wild.
We can be sued for tens of millions of dollars for missing a serious diagnosis, even if we know an alternative explanation is more likely.
If AI is able to beat an average doctor, it will be due to alleviating perverse incentives. But I can't imagine where we could get training data that would let it be any less of a fountain of garbage than many doctors.
Without a large amount of good training data, how could AI possibly be good at doctoring IRL?
Pt's chart is complex/wrong? Gotta ingest that into context.
Chart contains images/scanned and not OCR'd text? Gotta do an image recognition pass.
Diagnosis needs to know what the pt's wearing (i.e. radiation badge)? Gotta do an image recognition pass.
Diagnosis needs to know what the weather's like? Internet API access of some kind. Hope the WAN/API are all working! If they're not, do you fail open or closed?
Patient might be lying? Gotta do video/audio analysis to assess that likelihood--oh, and train a model that fully solves one of the holy grails of computer vision/audio analysis reliably and with a super low false-positive rate before you do. And if it guesses wrong, enjoy the incredibly easy-to-prosecute lawsuit.
Patient might be lying, but the biggest clue is e.g. smell of alcohol on their breath? Now you need some sort of olfactory sensor kit and training for it--a lot more than just "low quality body cam and a mic".
Patient's ODing on a street drug that became abundant in the last few months? Gotta somehow learn about recent local medical/police history that post-dates the training set, or else you might be pouring gas on a fire if you give them Narcan. And that's assuming you know enough to search for information about that drug, and that they didn't lie to you about what they took. Addicts never do that.
Failures in each of those systems bring down the chance of an effective diagnosis, so they need a fairly obsessive amount of model introspection/thinking/double-checking, and humans on standby as a fallback if the AI's less than confident (assuming that LLMs can be given a sense of a confidence level in the future, versus the current state of the art of "text-predict a guess about what your confidence level might be").
Put that all together, and even with the AI compute speed available years from now and a perfectly trained futuristic model that's preternaturally good at this stuff, I'm not sure that that the reliability and, more importantly, the turnaround time of that diagnostic pass is going to be any good compared to a human ER doc.
I suspect even prose is largely considered acceptable in professional uses because we haven’t developed a sensitivity to the artifice, and we probably won’t catch up to the LLMs in that arms race for a bit. However, we always manage to develop a distaste for cheap imitations and relegate them to somewhere between the ‘utilitarian ick’ and ‘trashy guilty pleasure’ bins of our cultures, and I predict this will be the same. The cultural response is already bending in that direction, and AI writing in the wild— the only part that culturally matters— sounds the same to me as it did a year and a half ago. I think they’re prairie dogging, but when(/if) they drop that bomb is entirely a matter of product development. You can’t un-drop a bomb and it will take a long time to regain status as a serious tool once society deems it gauche.
The assumption that LLMs figuring out coding means they can figure out anything is a classic case of Engineer’s Disease. Unfortunately, this hubris seems damn near invisible to folks in the tech industry, these days.
Claude can’t really write Openscad and when I was debugging some map projections code last week it struggled a lot more than usual.