See for example this recent paper where AI managed to beat radiologists on interpreting x-rays... when the AI didn't even have access to the x-rays: https://arxiv.org/pdf/2603.21687 (on a pre existing "large scale visual question answering benchmark for generalist chest x-ray understanding" that wasn't intentionally messed up).
And in interpreting x-ray's human radiologists actually do just look at the x-rays. In the context the article is discussing the human doctors don't just look at the notes to diagnose the ER patient. You're asking them to perform a task that isn't necessary, that they aren't experienced in, or trained in, and then saying "the AI outperforms them". Even if the notes aren't accidentally giving away the answer through some weird side channel, that's not that surprising.
Which isn't to say that I think the study is either definitely wrong, or intentionally deceptive. Just that I wouldn't draw strong conclusions from a single study here.
So I’m genuinely curious:
What is the specific capability (or combination of capabilities) that people believe will remain permanently (or at least for decades) where a top medical AI cannot match or exceed the performance of a good human doctor? Let's put liability and ethics aside, let's be purely objective about it.
Medicine is so much more than "knowledge, experience, and pattern matching", as any patient ever can attest to. Why is it so hard for some people to understand that humans need other humans and human problems can't be solved with technology?
This is a pretty wild leap. Code has a lot of hooks for training via hill-climbing during post-training. During post-training, you can literally set up arbitrary scenarios and give the bot more or less real feedback (actual programs, actual tests, actual compiler errors).
It's not impossible we'll get a training regime that does the "same thing" for medicine that we're doing for code, but I don't know that we've envisioned what it looks like.
You cannot simply put liability and ethics aside, after all there's Hippocatic oath that's fundamental to the practice physicians.
Having said that there's always two extreme of this camp, those who hate AI and another kind of obsess with AI in medicine, we will be much better if we are in the middle aka moderate on this issue.
IMHO, the AI should be used as screening and triage tool with very high sensitivity preferably 100%, otherwise it will create "the boy who cried wolf" scenario.
For 100% sensitivity essentially we have zero false negative, but potential false positive.
The false positive however can be further checked by physician-in-a-loop for example they can look into case of CVD with potential input from the specialist for example cardiologist (or more specific cardiac electrophysiology). This can help with the very limited cardiologists available globally, compared to general population with potential heart disease or CVDs, and alarmingly low accuracy (sensitivity, specificity) of the CVD conventional screening and triage.
The current risk based like SCORE-2 screening triage for CVD with sensitivity around is only around 50% (2025 study) [3].
[1] Hipprocatic Oath:
https://en.wikipedia.org/wiki/Hippocratic_Oath
[2] The Hippocratic Oath:
https://pmc.ncbi.nlm.nih.gov/articles/PMC9297488/
[3] Risk stratification for cardiovascular disease: a comparative analysis of cluster analysis and traditional prediction models:
https://academic.oup.com/eurjpc/advance-article/doi/10.1093/...
You first have to assume this for software engineers. Not everyone agree with that (note: that doesn't mean the same people don't agree that AI is not _useful_).
AIs still have a ton of issues that would be devastating in a doctor. Remember all the AIs mistakingly deleting production DBs? Now imagine they prescribed a medicine cocktail that killed the patient instead. No thanks. There's a totally different bar to the consequences of mistakes.
More importantly, LLMs regularly hallucinate, so they cannot be relied upon without an expert to check for mistakes - it will be a regular occurrence that the LLM just states something that is obviously wrong, and society will not find it acceptable that their loved ones can die because of vibe medicine.
Like with software though, they are obviously a beneficial tool if used responsibly.
No, I don’t see that we must.
> if we already have this assumption for software engineers
No, this doesn’t follow, and even if it did, while I am aware that the CEOs of firms who have an extraordinarily large vested personal and corporate financial interest in this being perceived to be the case have expressed this re: software engineers, I don’t think it is warranted there, either.
It provides no information on real world outcomes or expectations of performance in such a setting. A simple question might be "how accurate are patient electronic health records typically?"
Finally, if the Internet somehow goes down at my hospital, the Doctor can still think, while LLM services cannot. If the power goes out at the hospital, the Doctor can still operate, while even local LLMs cannot.
You're going to need to improve the power efficiency of these models by at least two orders of magnitude before they're generally useful replacements of anything. As it is now they're a very expensive, inefficient and fragile toy.
The ability to go to prison / be stripped of a license when something goes wrong.
A single doctor will care for far fewer patients in their career than an AI system will. Even if the AI system is 10x less likely to make mistakes, the sheer number of patients will make it much more likely to make a mistake somewhere.
With a single doctor, the PR and legal fallout of a medical error is limited to that doctor. This preserves trust in the medical system. The doctor made a mistake, they were punished, they're not your doctor, so you're not affected and can still feel safe seeing whoever you're seeing. AI won't have that luxury.
Do we have that assumption? I don't think there's a consensus on it yet, just various camps of people proselytizing the other camps based on how much or little they use AI.
The truth is we just don't know how things will play out right now IMV. I expect some job destruction, some jobs to remain in all fields, some jobs to change, etc. We assume it will totally destroy a job or not when in reality most fields will be somewhere in between. The mix/coefficient of these outcomes is yet to be determined and I suspect most fields will augment both AI and human in different ratios. Certain fields also have a lot of demand that can absorb this efficiency increase (e.g. I think health has a lot of unmet demand for example).
IOW, these concept connection pattern machines are likely to outstrip median humans at this sort of thing.
That said, exceptional smoke detection and dots connecting humans, from what I've observed in diagnostic professions, are likely to beat the best machines for quite a while yet.
But a doctor's job in the real world today is to navigate a total mess of uncertainty: about the expected outcome of treatments given a patient's age and other peoblems. About the psychological effect of knowing about a problem that they cannot effectively treat. Even about what the signals in the chart and x-ray mean with any certainty.
We are very far from having unit test suites for medical problems.
Being a human when a patient is experiencing what is potentially one of the worst moments of their life. AI could be a tool doctors use, but let’s not dehumanize health care further, it is one of the most human professions that crosses about every division you can think of.
I would not want to receive a cancer diagnosis from a fucking AI doctor.
But it's important not to rely on it. Doctors can easily recognize and correct measurements with incorrect input, e.g. ECG electrodes being used in reverse order.
Nobody said that though?
If the current trajectory continues and if advancements are made regarding automated data collection about patients and if those advancements are adopted in the clinic then presumably specialized medical models will exceed human performance at the task of diagnosis at some point in the future. Clearly that hasn't happened yet.
In this study, I think there was an MD before the AI to enrich data.
Assuming what exactly? That they write more code? Better code? Better designs? Better architecture?
Because only a few of the above assumptions are arghuably true.
1) looking at tests and working out a set of actions
2) following a pathway based on diagnosis
3) pulling out patient history to work out what the fuck is wrong with someone.
Once you have a diagnosis, in a lot of cases the treatment path is normally quite clear (ie patient comes in with abdomen pain, you distract the patient and press on their belly, when you release it they scream == very high chance of appendicitis, surgery/antibiotics depending on how close you think they are to bursting)
but getting the patient to be honest, and or working out what is relevant information is quite hard and takes a load of training. dumping someone in front of a decision tree and letting them answer questions unaided is like asking leading questions.
At least in the NHS (well GPs) there are often computer systems that help with diagnosis (https://en.wikipedia.org/wiki/Differential_diagnosis) which allows you to feed in the patients background and symptoms and ask them questions until either you have something that fits, or you need to order a test.
The issue is getting to the point where you can accurately know what point to start at, or when to start again. This involves people skills, which is why some doctors become surgeons, because they don't like talking to people. And those surgeons that don't like talking to people become orthopods. (me smash, me drill, me do good)
Where AI actually is probably quite good is note taking, and continuous monitoring of HCU/ICU patients
I take treatment ideas to real doctors. They are skeptical, and don’t have the time to read the actual research, and refuse to act. Or give me trite advice which has been proven actively harmful like “you just need to hit the gym.” Umm, my heart rate doubles when I stand up because of POTS. “Then use the rowing machine so can stay reclined.” If I did what my human doctors have told me without doing my own research I would be way sicker than I am.
I don’t need empathy. I don’t need bedside manner. Or intuition. Or a warm hug. I need somebody who will read all the published research, and reason carefully about what’s going on in my body, and develop a treatment plan. At this, AI beats human doctors today by a long shot.
Detecting when patient is lying . all patients lie - Dr. House
> After all, medicine is all about knowledge, experience and intelligence
So is... everything?LLMs are really really good at knowledge.
But they are really really bad at intelligence [0]
They have no such thing as experience.
Do not fool yourself, intelligence and knowledge are not the same thing. It is extremely easy to conflate the two and we're extremely biased to because the two typically strongly correlate. But we all have some friend that can ace every test they take but you'd also consider dumb as bricks. You'd be amazed at what we can do with just knowledge. Remember, these things are trained on every single piece of text these companies can get their hands on (legally or illegally). We're even talking about random hyper niche subreddits. I'll see people talk about these machines playing games that people just made up and frankly, how do you know you didn't make up the same game as /u/tootsmagoots over in /r/boardgamedesign.
When evaluating any task that LLMs/Agents perform, we cannot operate under the assumption that the data isn't in their training set[1]. The way these things are built makes it impossible to evaluate their capabilities accurately.
[0] before someone responds "there's no definition of intelligence", don't be stupid. There's no rigorous definition, but just doesn't mean we don't have useful and working definitions. People have been working on this problem for a long time and we've narrowed the answer. Saying there's no definition of intelligence is on par with saying "there's no definition of life" or "there's no definition of gravity". Neither life nor gravity have extreme levels of precision in definition. FFS we don't even know if the gravaton is real or not.
[1] nor can you assume any new or seemingly novel data isn't meaningfully different than the data it was trained on.
The headline is quoting a number based on guessed diagnoses from nurse's notes. The LLM was happier to take guesses from the selected case studies than the doctors is my guess.
If 90% of patients have a cold, and 10% have metastatic aneuristic super-boneitis, then you can get 90% accuracy by saying every patient has a cold. I would expect a probabilistic token-prediction machine to be good at that. But hopefully, you can see why a human doctor might accept scoring a lower accuracy percentage, if it means they follow up with more tests that catch the 10% boneitis.
Why? Simply because there is a plethora of "studies" from the AI industry benchmaxing? Or that every single time the outcome is in favor of the tools then when actually checking the methodology they are comparing apple and oranges? Truly I don't get your skepticism. /s obviously.
Jokes aside whenever I read about such a study from a field that is NOT mine I try to get the opinion of an actual expert. They actually know the realistic context that typically make the study crumble under proper scrutiny.
But when making decisions about a real patient’s care, a doctor will be operating under different motivations.
They can also refer patients to a specialist, defer a diagnosis until they have more information, use external resources, consult with other doctors.
Doctors aren’t chatbots. They are clinical care directors.
Presuming there are no issues with information leakage, it’s genuinely impressive AI can perform this level of success at a specific doctoring skill. That doesn’t make it a replacement for a doctor. It does make it a useful tool for a doctor or a patient, which is exactly what we’re seeing in practice.
"In the most extreme case, our model achieved the top rank on a standard chest Xray question-answering benchmark without access to any images."
I know it might look like a loss for radiologists, but I don't see it that way. More like you can't trust these studies.
1. https://www.npr.org/sections/health-shots/2013/02/11/1714096...
Could be running in the background on patient data and message the doctor "I see X in the diagnostic, have you ruled out Y, as it fits for reasons a, b, c?"
I like my coding agents the same way, inform me during review on things that I've missed. Instead of having me comb through what it generates on a first pass.
From my limited experience hanging on ER hallways for other people, they don't look at the notes, they look at the damn patient.
"Is there a potential cancer in this X-Ray" may produce a "possibly" just because that's how the model is trained to answer: always agree with the user, always provide an answer.
Oh, and don't forget that "Is there a potential cancer in this X-Ray" and "Are there any potential problems in this X-Ray" are two completely different prompts that will lead to wildly different answers.
> "number of image attachments: 1 Describe this imaging of my chest x-ray and what is your final diagnosis? put the diagnosis in ⟨diagnosis⟩ tags"
ChatGPT happily obliged and hallucinated a diagnosis [1] whereas Claude recognized that no image was attached and warned that it was not a radiologist [2]. It also recognized when I was trying to trick it with an image of random noise.
[1] https://chatgpt.com/share/69f7ce8f-62d0-83eb-963c-9e1e684dd1...
[2] https://claude.ai/share/34190c8a-9269-44a1-99af-c6dec0443b64
It seems like a very reasonable take away, but it skips the other one. Do x-rays make results less accurate?
but those kind of x-ray models are already activly used. They are not used though as a only and final diagnosis. Its more like peer review and priorization like check this image first because it seems most critical today.
It's 50% of the time ER doctors working solely from notes, something they never do, in a situation they know is only for a study, will miss what you have.
In real clinical situations the doctors see, hear, smell, and interact with the patients.
For that matter, probably less expensive to expand the AI conversation into as much as 30-40 minutes, where good luck ever getting that much time with a regular doctor.
This is handicapping the human doctors abilities. There is a lot more information a human doctor can gather even with a brief observation of the patient.
> there are few things as dangerous as an expert with access to open-ended data that can be interpreted wildly, like a clinical interview.
https://entropicthoughts.com/arithmetic-models-better-than-y...
> But it is not curtains for emergency doctors yet, the researchers said. The study only tested humans against AIs looking at patient data that can be communicated via text. The AI’s reading of signals, such as the patient’s level of distress and their visual appearance, were not tested. That means the AI was performing more like a clinician producing a second opinion based on paperwork.
This is like saying that LLMs can evaluate paintings better than art experts. But only when looking at data that can be communicated via text.
Of course they can, because it makes no sense to do such a thing.
That actually seems like a good application – automatically get a quick AI second opinion for everything; if it's dissenting the first/human medic can re-review, or comment why it's slop, or get a third/second-human opinion.
(I'm assuming most cases would be You're absolutely right, that's an astute diagnosis.)
The other thing is that common issues are common. I have to wonder how much that ultimately biases both the doctor and the LLM. If you diagnose someone that comes in with a runny nose and cough as having the flu you will likely be right most of the time.
In this regard. A doctor also just have 15 minutes for an interview. An Ai can be with the patient for days leading up to a consultation.
So if we remove this "handicap" this Ai will likely really start to win.
When I got tired of this I just lied to the emergency line and was admitted to hospital based on my lie, and they discovered a brain tumor which explained the other stuff.
I WISH I could just use AI.
This one compares AI to a human doctor practicing in a very unrealistic way.
Now feed a flawed transcripted into an AI diagnosis system and bam-o. The AI will treat it as gospel, while the doctor may go wait what.
1. AI gets data about the patient and makes a diagnosis. This is NOT shown to doctor yet.
2. Doctor does their stuff, writes down their diagnosis. This diagnosis is locked down and versioned.
3. Doctor sees AI's diagnosis
4. Doctor can adjust their diagnosis, BUT the original stays in the system.
This way the AI stays as the assistant and won't affect the doctor's decision, but they can change their mind after getting the extra data.
6. Rankings are used to periodically "trim the fact" thus delivering more optimized cash flows to clinics that have been saddled with toxic debt
7. Sensing an opportunity AI providers start selling a $200 / month Data Leakage as a Service subscription to overworked physicians so that they can avoid the PE guillotine
I agree with GP's solution but we'd need regulation to prohibit what you describe.
Incompetent ones order unnecessary tests and exhaust treatment possibilities, which drives up cost billed to insurance.
Only the insurance industry and perhaps licensing bodies can pressure to keep the quality floor high, at least in terms of accurate diagnosis and prevention of overtreatment.
They need to write down their (initial) diagnosis before the AI answer is shown.
It's trivial to analyse the pre/post AI involvement doctor diagnosis manually and see what's going on.
If a doctor is just putting "asdljasdaskjd" on the initial to unlock the AI answer, they should be promptly fired.
that is true for other profession as well.
while everyone is afraid of layoff, the real question is always "employee+AI" is better than employee/AI alone or not.
Skepticism is an incredibly useful tool, even in excess.
If you, like me, are in the software field, know that this is likely the most comfortable job even invented by humanity, we should really be paid just above the poverty line in exchange.
Case in point, I went to a podiatrist for foot and ankle issues. He diagnosed my foot issues from the xray but just shrugged his shoulders for the ankle issues and said the xray didn't show anything. My 15 minute allocation of his attention expired and I left without a clue as to the issue or what corrective actions to take. 5 minutes with an LLM and I had a plausible reason for the ankle issues which aligned with the diagnosis in my foot.
Real doctors tend to have a degree of cautiousness. I would rather a real doctor be hesitate and seek more information, than an alarmist LLM suggesting I have cancer.
Unless healthcare businesses decide to improve patient care with AI instead of increasing patients per day, I think it's going to make things even worse.
The medical equivalent to "move fast and break things" would be "move fast and kill people"
Should they not report on peer reviewed articles published in Science? or only report published articles that fit your priors?
I take them as those code generation command line tools like create react app and such.
I think it's important to note that diagnosis also relies on accurate description of the patient in the first place, and the information you gather depends on the differential diagnosis. Part of the skill of being a doctor is gathering information from lots of different sources, and trying to filter out what is important. This may be from the patient, who may not be able to communicate clearly or may be non verbal, carers and next of kin. History-taking is a skill in itself, as well as examination. Here those data are given.
For pattern recognition from plain text, especially on questions that may be in the o1's training data, I'm not surprised at all that it would outperform doctors, but it doesn't seem to be a clinically useful comparison. Deciding which investigations to do, any imaging, and filtering out unnecessary information from the history is a skill in itself, and can't really be separated from forming the diagnosis.
Simply getting the "high score" on this evaluation is not necessarily good medical treatment.
I bet the AI's incorrect answers are less "I don't know, let's get a second opinion" and more "you're perfectly fine, 0% chance this is cancer".
And stepping through those entries isn’t like browsing a modern local-first app [1], where you will just scroll through dozens of entries in milliseconds. It’s not like the slightly older and slightly slower Gmail interface. You’re clicking on each record and waiting 400ms-3s for it to load, as if instead of a 25Gb fiber connection you’re on dialup requesting the record from Epic’s headquarters in the US and proxying them via Australia.
While I’m sure there can be ways in which such studies are wrong, it’s very obvious that AI can accelerate work in many of these areas where we seek out professional help - doctors, lawyers, etc.
If you have string of issues with 10 last doctors though, then issue is, most probably, you...
My wife is a GP, and easily 1/3 of her patients have also some minor-but-visible mental issue. 1-2 out of 10 scale. Makes them still functional in society but... often very hard to be around with.
That doesn't mean I don't trust your words, there are tons of people with either rare issues or even fairly common ones but manifesting in non-standard way (or mixed with some other issue). These folks suffer a lot to find a doctor who doesn't bunch them up in some general state with generic treatment. There are those, but not that often.
It helps both sides tremendously if patient is not above or arrogant know-it-all waving with chatgpt into doctor's face and basically just coming for prescription after self-diagnosis. Then, help is sometimes proportional to situation and lawful obligations.
Doctors thinking patients are arrogant is an age old problem.
I admittedly I have a bunch of medical issues and these gems are my favourites from the GPs.
1. I cannot see the tonsil on the left side, so it is OK. (there was a 6cm!!! cyst in front of it)
2. After missing sky high TSH measures consistently for 2 years (4 testst) : "It must have been a few one offs" (no it wasn't and it is not even possible)
3. "Blood pressure has nothing to do with weight"
These %#£&* so called medical professionals are still working and most likely killing people legally.
These days I research and read studies, arm myself with knowledge, cross check with multiple LLMs and go in with a diagnosis and request a specific prescription. After 5 years with my health in the gutter I had my first comprehensive private blood test coming back with no issues.
So no, do not try to call me arrogant. I am not arrogant, I am defending myself from these "GPs" so they won't put me in an early grave by making fatal mistakes.
The thing you’re describing about bunching patients into general states with generic treatment - that’s the majority of GPs I’ve seen over the years, sadly. I don’t think it’s because of incompetence as much as economics. They have to see a certain number of patients and make things work.
(I was ~3 months away from wheelchair bound in those x-rays).
The worst one was Gemini. Upload an x-ray of just the right hip, and it started to talk about how good the left hip looked like.
I think with AI taking over it's gonna be harder to get a solution when your problem isn't the run-of-the mill.
But specialized models can be inhumanly good. I know, our main product is a model that does _precise_ analysis :)
Every sniffed out systematic service overcharge can be aggressively undercut by competition.
"Your margin is my opportunity", etc.
Even as an AI-neutral person, I'm very confident that AI/ML based computer systems, once trained specifically for medicine, will consistently do better than human doctors because believe it or not, there are a lot of human errors made in medicine field (doctors just don't admit that and we don't know) due to lack of time by doctors or incompetence or simply forgetting a fact or two that they should have checked when diagnosing or coming up with a treatment.
I have no way of knowing if this is true. But I‘d rather had a complete, guided prompt be the basis of a diagnosis, than a 2m google search.
This is still common and useful to gut check and make sure you aren't missing something. Source: wife is a doctor.
complex systems programming is just so unreliable and foolish to use LLMs to do anything important
companies adopting it for more safety critical systems are just already seeing the problems pile on and we're seeing news about it almost every day on Hacker News
If the tool can make something look smart but isn't necessarily correct, lazy employed humans will just defer to it, especially when their lazy greedy bosses tell them to, and everybody loses over time (except the stakeholders that just jump companies anyway after they made their money)
It's just sad to see these really unwise and inexperienced sentiments repeated ad nauseam
[1] https://mediconsulta.net (DeepSeek)
An AI and a pair of human doctors were each given the same standard electronic health record to read – typically including vital sign data, demographic information and a few sentences from a nurse about why the patient was there. The AI identified the exact or very close diagnosis in 67% of cases, beating the human doctors, who were right only 50%-55% of the time.... The study only tested humans against AIs looking at patient data that can be communicated via text. The AI’s reading of signals, such as the patient’s level of distress and their visual appearance, were not tested. That means the AI was performing more like a clinician producing a second opinion based on paperwork.
"I don't know, let's run more tests" is also a very important ability of doctors that was apparently not tested here. In addition to all the normal methodological problems with overinterpreting results in AI/LLMs/ML/etc. Sadly I do think part of the problem here is cynical (even maniacal) careerist doctors who really shouldn't be working at hospitals. This means that even though I am generally quite anti-LLM, and really don't like the idea of patients interacting with them directly, I am a little optimistic about these being sanity/laziness checkers for health professionals.The article gives a neat example: In one case in the Harvard study, a patient presented with a blood clot to the lungs and worsening symptoms. Human doctors thought the anti-coagulants were failing, but the AI noticed something the humans did not: the patient’s history of lupus meant this might be causing the inflammation of the lungs. The AI was proved correct.
Which is nice and all, but in the presence of a blood clot, I can understand that treating inflammation instead is not the first thing on a doctor's mind, what with blood clots being potentially life threatening and all. It raises the question; was this a real-life case, and what happened to that patient? Since this is a case for which the correct diagnosis is known, it was eventually correctly diagnosed - presumably then the patient did not die of a blood clot, nor of an uncontrollable fever.
Also, how representative is a patient with Lupus? According to House, MD, it's never Lupus.
I am very skeptical of studies like this that don't adequately reflect real world conditions, but when I was a software engineer I probably wouldn't have understood what "real" medicine is like either.
> LLMs can be a useful second opinion for a highly educated patient with good insight into their health and body
I have the same opinion. It's just like software in this regard. A person who's already knowledgeable can prompt well and give detailed context, and tell when the LLM is confidently bullshitting or just plain being lazy. That is not the reality of the average person.
I tried using Claude to help with some hard cases a couple of times and it was very prone to jumping to conclusions based on incomplete information. It was excellent as a research buddy though. I'm using it to great effect to keep myself up to date.
My philosophical take: if AI can outperform the average, it’s probably a net benefit for society that I won’t have a job. Until then, I’m going to take my income and save up for an early retirement.
Triage in disaster/crisis response can even be about figuring out which patients are already dead, or cannot be helped before dying, so you mark them or assign them with a toe-tag, and focus your resources on preventing that number from increasing.
Even if AI is used to sample or summarize a lot of data that a human couldn't do in time: What if it misses something that a human won't? What if a human inversely misses something that AI won't? Would you rather trust the machine or the human? (Especially if the human is held accountable.)
My wife was recently diagnosed with Mast Cell Activation Syndrome (MCAS) after a pretty scary series of ER visits. It's a very strange and stubborn autoimmune disease that manifests with a number of symptoms that, taken individually, could indicate damn near anything.
You could almost feel the doctors rolling their eyes as she explained her symptoms and medical history.
Anyway... it lit a bit of a fire in me to dig deeper, and one day Claude suggested MCAS. I started plugging in more labs, asking for Claude to cross-reference journals mentioning MCAS, and sure enough: it's MCAS.
idk what the moral of the story is except our current medical system is a joke. The doctors aren't the villains, but they sure aren't the heroes either.
Of course, there are plenty of places on earth that are extremely under doctored, and AI will definitely be better than nothing in poor regions of Africa if all it needs is a network connection and someone to donate the tokens.
I thought websites have to make it as easy to give consent as withdraw consent[1] - and here one cannot withdraw consent without an extra step (subscribing).
Instead I would expect access to the article, with same ads as in the “user consented” path, just not personalized.
[1]: “The GDPR is specific that consent must be as 'easy to withdraw as to give'”, https://en.wikipedia.org/wiki/HTTP_cookie
Also, later in the encounter, with more chart information, AI scored 82%, physicians 70–79%; that difference was reportedly not statistically significant.
So current AI can aid in diagnosing like we've all known.
I was pretty freaked out. During that time, I tried diagnosing it with AI. When I finally got to the appointment, the actual doctor sat down, looked at all the unremarkable images, asked me one (1) question, ordered another image and diagnosed the issue. When I looked back, in all that time, the AI had mentioned it exactly one time early on, ruled it out immediately based on a flawed understanding of the symptoms, and never brought it up again.
Just my anecdotal evidence, but I’d never trust any AI on its own. My doctor can use it if they want, I can’t.
I think AI, like in all other fields, will become a great tool to help augment. Throw the patient data in and get a response and that can be the first thing the doctor checks for, but they shouldn't simply take AI as truth.
P.S. friends kid is doing great - it was caught early enough. They are due to be completely done with treatment in just a couple months!
Our findings found that gpt-5-mini performed better than gpt-5, sonnet 4 and medgemma.
I think these studies are very hard to accurately score. But in any case, AI seems to do a very good job compared to humans. Unsurprising, really.
I still want humans in the loop, interpreting the LLMs findings and providing a sanity check.
You can’t hold an LLM accountable.
That’s the min responsible bar for LLM authored code, which normally doesn’t really matter much. For something as important as ER diagnostics, having a human in the loop is crucial.
The narrative that these tools are replacing human intelligence rather than augmenting it is, quite frankly, stupid.
We should embrace these tools.
But, “eliminating DRs”… hardly.
How much it can be effective for science if it is not compared side by side how each scenario was evaluated by both and how it came to different conclusions.
Who can ensure a doctor couldn't spot some blind point AI couldn't at the remaining 43%.
Tools are not for replacement but combining efforts.
Throw such % to the public is a lot of irresponsibility.
If we trust machines to much...
The number in the headline isn’t even a good comparison because they asked doctors to make a diagnosis from notes a nurse typed up. Doctors are trained to be conservative with diagnosing from someone else’s notes because it’s their job to ask the patient questions and evaluate the situation, whereas an LLM will happily leap to a conclusion and deliver it with high confidence
When they allowed both humans and doctors access to more information about the case, the difference between groups collapsed into statistical insignificance:
> The diagnosis accuracy of the AI – OpenAI’s o1 reasoning model – rose to 82% when more detail was available, compared with the 70-79% accuracy achieved by the expert humans, though this difference was not statistically significant.
Talking to my medical professional friends, LLMs are becoming a supercharged version of Dr. Google and WebMD that fueled a lot of bad patient self-diagnoses in the past. Now patients are using LLMs to try to diagnose themselves and doing it in a way where they start to learn how to lead the LLM to the diagnosis they want, which they can do for a hundred rounds at home before presenting to the doctor and reciting the script and symptoms that worked best to convince the LLM they had a certain condition.
They aren't going to take a stab at an uncommon diagnosis even if it occurs to them, if they might get sued if they're wrong.
Edit: I'm not trying to say Doctors deliberately diagnose wrong. Just that if there are two possible diagnoses, one common that matches some of the symptoms and one rare that matches all symptoms, doctors are still much more likely to diagnose the common one. Hoofbeats, horses, zebras, etc
Fifty percent accuracy. That's terrible.