OpenAI’s o1 correctly diagnosed 67% of ER patients vs. 50-55% by triage doctors (opens in new tab)

(theguardian.com)

507 pointsdonsupreme1mo ago473 comments

473 comments

206 comments · 67 top-level

gpm1mo ago· 52 in thread

I'd be very very hesitant to trust studies like this. It's very easy to mess up these benchmarks.

See for example this recent paper where AI managed to beat radiologists on interpreting x-rays... when the AI didn't even have access to the x-rays: https://arxiv.org/pdf/2603.21687 (on a pre existing "large scale visual question answering benchmark for generalist chest x-ray understanding" that wasn't intentionally messed up).

And in interpreting x-ray's human radiologists actually do just look at the x-rays. In the context the article is discussing the human doctors don't just look at the notes to diagnose the ER patient. You're asking them to perform a task that isn't necessary, that they aren't experienced in, or trained in, and then saying "the AI outperforms them". Even if the notes aren't accidentally giving away the answer through some weird side channel, that's not that surprising.

Which isn't to say that I think the study is either definitely wrong, or intentionally deceptive. Just that I wouldn't draw strong conclusions from a single study here.

pixel_popping1mo ago

I agree with you on this specific study, however, I can't really wrap my head about the fact that doctors will be better than AI models on the long-run. After all, medicine is all about knowledge, experience and intelligence (maybe "pattern recognition"), all those, we must assume that the best AI models (especially ones focusing solely in the medical field) would largely beat large majority of humans (aka doctors), if we already have this assumption for software engineers, we should have it for this field as well, and let's be realistic, each time I've seen a doc the last few months (and ER twice), each time they were using ChatGPT btw (not kidding, it chocked me).

So I’m genuinely curious:

What is the specific capability (or combination of capabilities) that people believe will remain permanently (or at least for decades) where a top medical AI cannot match or exceed the performance of a good human doctor? Let's put liability and ethics aside, let's be purely objective about it.

gherkinnn1mo ago

To answer your question: talking to a human.

Medicine is so much more than "knowledge, experience, and pattern matching", as any patient ever can attest to. Why is it so hard for some people to understand that humans need other humans and human problems can't be solved with technology?

25 more replies

hyperpape1mo ago

> we must assume that the best AI models (especially ones focusing solely in the medical field) would largely beat large majority of humans (aka doctors), if we already have this assumption for software engineers, we should have it for this field as well,

This is a pretty wild leap. Code has a lot of hooks for training via hill-climbing during post-training. During post-training, you can literally set up arbitrary scenarios and give the bot more or less real feedback (actual programs, actual tests, actual compiler errors).

It's not impossible we'll get a training regime that does the "same thing" for medicine that we're doing for code, but I don't know that we've envisioned what it looks like.

2 more replies

teleforce1mo ago

>What is the specific capability (or combination of capabilities) that people believe will remain permanently (or at least for decades) where a top medical AI cannot match or exceed the performance of a good human doctor? Let's put liability and ethics aside, let's be purely objective about it.

You cannot simply put liability and ethics aside, after all there's Hippocatic oath that's fundamental to the practice physicians.

Having said that there's always two extreme of this camp, those who hate AI and another kind of obsess with AI in medicine, we will be much better if we are in the middle aka moderate on this issue.

IMHO, the AI should be used as screening and triage tool with very high sensitivity preferably 100%, otherwise it will create "the boy who cried wolf" scenario.

For 100% sensitivity essentially we have zero false negative, but potential false positive.

The false positive however can be further checked by physician-in-a-loop for example they can look into case of CVD with potential input from the specialist for example cardiologist (or more specific cardiac electrophysiology). This can help with the very limited cardiologists available globally, compared to general population with potential heart disease or CVDs, and alarmingly low accuracy (sensitivity, specificity) of the CVD conventional screening and triage.

The current risk based like SCORE-2 screening triage for CVD with sensitivity around is only around 50% (2025 study) [3].

[1] Hipprocatic Oath:

https://en.wikipedia.org/wiki/Hippocratic_Oath

[2] The Hippocratic Oath:

https://pmc.ncbi.nlm.nih.gov/articles/PMC9297488/

[3] Risk stratification for cardiovascular disease: a comparative analysis of cluster analysis and traditional prediction models:

https://academic.oup.com/eurjpc/advance-article/doi/10.1093/...

4 more replies

SkiFire131mo ago

You first have to assume this for software engineers. Not everyone agree with that (note: that doesn't mean the same people don't agree that AI is not _useful_).

AIs still have a ton of issues that would be devastating in a doctor. Remember all the AIs mistakingly deleting production DBs? Now imagine they prescribed a medicine cocktail that killed the patient instead. No thanks. There's a totally different bar to the consequences of mistakes.

4 more replies

root_axis1mo ago

Diagnosis is just a small part of a doctor's job. In this case, we're also talking about an ER, it's a very physical environment. Beyond that, a doctor is able to examine a patient in a manner that isn't feasible for machines any time in the foreseeable future.

More importantly, LLMs regularly hallucinate, so they cannot be relied upon without an expert to check for mistakes - it will be a regular occurrence that the LLM just states something that is obviously wrong, and society will not find it acceptable that their loved ones can die because of vibe medicine.

Like with software though, they are obviously a beneficial tool if used responsibly.

dragonwriter1mo ago

> After all, medicine is all about knowledge, experience and intelligence (maybe "pattern recognition"), all those, we must assume that the best AI models (especially ones focusing solely in the medical field) would largely beat large majority of humans

No, I don’t see that we must.

> if we already have this assumption for software engineers

No, this doesn’t follow, and even if it did, while I am aware that the CEOs of firms who have an extraordinarily large vested personal and corporate financial interest in this being perceived to be the case have expressed this re: software engineers, I don’t think it is warranted there, either.

2 more replies

themafia1mo ago

This study is based almost entirely on pre-existing "vignettes." In other words, on tests that are already known and have existed for years, the model did well, which is precisely what you should expect.

It provides no information on real world outcomes or expectations of performance in such a setting. A simple question might be "how accurate are patient electronic health records typically?"

Finally, if the Internet somehow goes down at my hospital, the Doctor can still think, while LLM services cannot. If the power goes out at the hospital, the Doctor can still operate, while even local LLMs cannot.

You're going to need to improve the power efficiency of these models by at least two orders of magnitude before they're generally useful replacements of anything. As it is now they're a very expensive, inefficient and fragile toy.

1 more reply

miki1232111mo ago

> What is the specific capability (or combination of capabilities)

The ability to go to prison / be stripped of a license when something goes wrong.

A single doctor will care for far fewer patients in their career than an AI system will. Even if the AI system is 10x less likely to make mistakes, the sheer number of patients will make it much more likely to make a mistake somewhere.

With a single doctor, the PR and legal fallout of a medical error is limited to that doctor. This preserves trust in the medical system. The doctor made a mistake, they were punished, they're not your doctor, so you're not affected and can still feel safe seeing whoever you're seeing. AI won't have that luxury.

1 more reply

nozzlegear1mo ago

> if we already have this assumption for software engineers

Do we have that assumption? I don't think there's a consensus on it yet, just various camps of people proselytizing the other camps based on how much or little they use AI.

throw2342342341mo ago

My personal anecdote when I talk to people - everyone when talking about their job w.r.t AI is like "at least I'm not a software engineer!". To give a hint this isn't just a US phenomenon - seen this in other countries too where due to AI SWE and/or tech as a career with status has gone down the drain. Then they always go on trying to defend why their job is different. For example "human touch", "asking the right questions" etc not knowing that good engineers also need to do this.

The truth is we just don't know how things will play out right now IMV. I expect some job destruction, some jobs to remain in all fields, some jobs to change, etc. We assume it will totally destroy a job or not when in reality most fields will be somewhere in between. The mix/coefficient of these outcomes is yet to be determined and I suspect most fields will augment both AI and human in different ratios. Certain fields also have a lot of demand that can absorb this efficiency increase (e.g. I think health has a lot of unmet demand for example).

Terretta1mo ago

Humans tend to be very bad at connecting dots, which is why when we imagine someone who does, we make the show "House" about it.

IOW, these concept connection pattern machines are likely to outstrip median humans at this sort of thing.

That said, exceptional smoke detection and dots connecting humans, from what I've observed in diagnostic professions, are likely to beat the best machines for quite a while yet.

827a1mo ago

I think it comes down to how much data we're comfortable feeding an AI. If the AI has cameras and/or microphones in the room and the patient is directly talking to the AI: I strongly suspect AIs will always achieve better outcomes than humans. However, this kind of configuration will be viewed very negatively in a medical context for the foreseeable future; outside of limited contexts like "let me take a picture of that mole"; and hobbling the AI to only a text input (or dictated text by the doctor) muddies the waters on who is performing better. There's a lot of intuition in the diagnosis of something like "the location of the pain aligns with appendicitis, but they just aren't in enough pain" that cannot come through in just the textual representation of what is happening; you need to hear the person's voice and see how they're holding their body. AI can do that, but will we let it do that?

largbae1mo ago

But liability and ethics cannot be put aside. If treatments were free of cost and perfectly address problems, then a correct diagnosis would always lead to the optimal patient outcome. In that scenario, AI diagnosis will be like code generation and go asymptotic to perfection as models improve.

But a doctor's job in the real world today is to navigate a total mess of uncertainty: about the expected outcome of treatments given a patient's age and other peoblems. About the psychological effect of knowing about a problem that they cannot effectively treat. Even about what the signals in the chart and x-ray mean with any certainty.

We are very far from having unit test suites for medical problems.

3 more replies

nkrisc1mo ago

> What is the specific capability (or combination of capabilities) that people believe will remain permanently (or at least for decades) where a top medical AI cannot match or exceed the performance of a good human doctor? Let's put liability and ethics aside, let's be purely objective about it.

Being a human when a patient is experiencing what is potentially one of the worst moments of their life. AI could be a tool doctors use, but let’s not dehumanize health care further, it is one of the most human professions that crosses about every division you can think of.

I would not want to receive a cancer diagnosis from a fucking AI doctor.

3 more replies

ricardobayes1mo ago

It's having a general understanding/view of the "baseline", aka healthy anatomy. This is something LLMs will never have, that's why never have true reasoning, for the lack of "worldview" and they never know if they are hallucinating. To aid doctors, we don't need LLMs but rather, computer vision, pattern recognition as you correctly point out.

But it's important not to rely on it. Doctors can easily recognize and correct measurements with incorrect input, e.g. ECG electrodes being used in reverse order.

1 more reply

fc417fc8021mo ago

> I can't really wrap my head about the fact that doctors will be better than AI models on the long-run.

Nobody said that though?

If the current trajectory continues and if advancements are made regarding automated data collection about patients and if those advancements are adopted in the clinic then presumably specialized medical models will exceed human performance at the task of diagnosis at some point in the future. Clearly that hasn't happened yet.

1 more reply

pianopatrick1mo ago

Last time I went to the ER the doctor used a scope to look down my throat and check everything seemed fine. I don't think pure AI like ChatGPT will be able to do that any time soon. Maybe a medical robot with AI will one day, but that seems at least a few years off.

2 more replies

boh1mo ago

The reason is because one scenario just requires your imagination to facilitate a reality that currently doesn't exist (Doctor AI) vs actual experience which is messier and has more details than a story about the future.

RandomLensman1mo ago

You also have to assume advances in sensors and robotics (e.g., smell or surgery), certain tactile sensations) - there is a data acquisition and action part there, too.

In this study, I think there was an MD before the AI to enrich data.

somethingsome1mo ago

95% of the cases are easy for both doctors and AI, where doctors excel are the difficult cases where there is only a very limited amount of training data ;) something AI is not yet ready to handle at all.

1 more reply

pdntspa1mo ago

> if we already have this assumption for software engineers,

Assuming what exactly? That they write more code? Better code? Better designs? Better architecture?

Because only a few of the above assumptions are arghuably true.

KaiserPro1mo ago

There are a few sides to medicine:

1) looking at tests and working out a set of actions

2) following a pathway based on diagnosis

3) pulling out patient history to work out what the fuck is wrong with someone.

Once you have a diagnosis, in a lot of cases the treatment path is normally quite clear (ie patient comes in with abdomen pain, you distract the patient and press on their belly, when you release it they scream == very high chance of appendicitis, surgery/antibiotics depending on how close you think they are to bursting)

but getting the patient to be honest, and or working out what is relevant information is quite hard and takes a load of training. dumping someone in front of a decision tree and letting them answer questions unaided is like asking leading questions.

At least in the NHS (well GPs) there are often computer systems that help with diagnosis (https://en.wikipedia.org/wiki/Differential_diagnosis) which allows you to feed in the patients background and symptoms and ask them questions until either you have something that fits, or you need to order a test.

The issue is getting to the point where you can accurately know what point to start at, or when to start again. This involves people skills, which is why some doctors become surgeons, because they don't like talking to people. And those surgeons that don't like talking to people become orthopods. (me smash, me drill, me do good)

Where AI actually is probably quite good is note taking, and continuous monitoring of HCU/ICU patients

1 more reply

xbmcuser1mo ago

If all the curated data is really shared with an AI over time they will be better than most individual doctors. I personally think AI could be a great triage system.

xoofoog1mo ago

I would love to replace my doctors with AI. Today. Please. I have had Long Covid for over a year now, which is a shitty shitty condition. It’s complicated and not super well understood. But you know who understands it way better than any doctor I’ve ever seen? Every AI I’ve talked to about it. Because there is tons of research going on, and the AI is (with minor prompting) fully up to date on all of it.

I take treatment ideas to real doctors. They are skeptical, and don’t have the time to read the actual research, and refuse to act. Or give me trite advice which has been proven actively harmful like “you just need to hit the gym.” Umm, my heart rate doubles when I stand up because of POTS. “Then use the rowing machine so can stay reclined.” If I did what my human doctors have told me without doing my own research I would be way sicker than I am.

I don’t need empathy. I don’t need bedside manner. Or intuition. Or a warm hug. I need somebody who will read all the published research, and reason carefully about what’s going on in my body, and develop a treatment plan. At this, AI beats human doctors today by a long shot.

1 more reply

delfinom1mo ago

Medicine is about knowledge, but acquiring knowledge may in fact require "breaking out of the box" that AI is increasing behind to avoid touching "touchy subjects" or insulting anyone and so on.

dominotw1mo ago

> What is the specific capability (or combination of capabilities) that people believe will remain permanently (or at least for decades) where a top medical AI cannot match or exceed the performance of a good human doctor?

Detecting when patient is lying . all patients lie - Dr. House

godelski1mo ago

  > After all, medicine is all about knowledge, experience and intelligence

So is... everything?

LLMs are really really good at knowledge.

But they are really really bad at intelligence [0]

They have no such thing as experience.

Do not fool yourself, intelligence and knowledge are not the same thing. It is extremely easy to conflate the two and we're extremely biased to because the two typically strongly correlate. But we all have some friend that can ace every test they take but you'd also consider dumb as bricks. You'd be amazed at what we can do with just knowledge. Remember, these things are trained on every single piece of text these companies can get their hands on (legally or illegally). We're even talking about random hyper niche subreddits. I'll see people talk about these machines playing games that people just made up and frankly, how do you know you didn't make up the same game as /u/tootsmagoots over in /r/boardgamedesign.

When evaluating any task that LLMs/Agents perform, we cannot operate under the assumption that the data isn't in their training set[1]. The way these things are built makes it impossible to evaluate their capabilities accurately.

[0] before someone responds "there's no definition of intelligence", don't be stupid. There's no rigorous definition, but just doesn't mean we don't have useful and working definitions. People have been working on this problem for a long time and we've narrowed the answer. Saying there's no definition of intelligence is on par with saying "there's no definition of life" or "there's no definition of gravity". Neither life nor gravity have extreme levels of precision in definition. FFS we don't even know if the gravaton is real or not.

[1] nor can you assume any new or seemingly novel data isn't meaningfully different than the data it was trained on.

2 more replies

wonnage1mo ago

Ah, the classic "let's be objective and ignore key constraint that is inconvenient for SV tech bro hype"

Aurornis1mo ago

When you read through the article it shows that the gap between doctors and LLMs actually disappeared (in terms of statistical significance) once both were allowed to read the full case notes.

The headline is quoting a number based on guessed diagnoses from nurse's notes. The LLM was happier to take guesses from the selected case studies than the doctors is my guess.

Intralexical1mo ago

Not only is the study testing something which only vaguely resembles how doctors diagnose patients, but isolated accuracy percentages are also a terrible way to measure healthcare quality.

If 90% of patients have a cold, and 10% have metastatic aneuristic super-boneitis, then you can get 90% accuracy by saying every patient has a cold. I would expect a probabilistic token-prediction machine to be good at that. But hopefully, you can see why a human doctor might accept scoring a lower accuracy percentage, if it means they follow up with more tests that catch the 10% boneitis.

1 more reply

utopiah1mo ago

> very hesitant to trust studies like this

Why? Simply because there is a plethora of "studies" from the AI industry benchmaxing? Or that every single time the outcome is in favor of the tools then when actually checking the methodology they are comparing apple and oranges? Truly I don't get your skepticism. /s obviously.

Jokes aside whenever I read about such a study from a field that is NOT mine I try to get the opinion of an actual expert. They actually know the realistic context that typically make the study crumble under proper scrutiny.

torginus1mo ago

Yup, there's a reason while ROC is a thing in data science. You can build a 99% accurate cancer detector that's just a slip of paper saying 'you don't have cancer', but everybody understands its worthless intuitively. With more complex setups, that intuition goes away.

tensor1mo ago

Interestingly, this recent study using ChatGPT Health gave quite a different outcome (https://www.nature.com/articles/s41591-026-04297-7). Here it was wrong about emergency triage 50% of the time.

directevolve1mo ago

In a study like this, there’s also a difference in motivation. An AI will mechanically “take the study seriously.” I’m not convinced the doctors will.

But when making decisions about a real patient’s care, a doctor will be operating under different motivations.

They can also refer patients to a specialist, defer a diagnosis until they have more information, use external resources, consult with other doctors.

Doctors aren’t chatbots. They are clinical care directors.

Presuming there are no issues with information leakage, it’s genuinely impressive AI can perform this level of success at a specific doctoring skill. That doesn’t make it a replacement for a doctor. It does make it a useful tool for a doctor or a patient, which is exactly what we’re seeing in practice.

mday271mo ago

hallucination on steroids, wow. I had to read through the abstract to believe it:

"In the most extreme case, our model achieved the top rank on a standard chest Xray question-answering benchmark without access to any images."

Chinjut1mo ago

I still don't quite understand, after skimming the paper. How does it achieve high scores without access to the images (beating even humans with access to the images)?

1 more reply

gosub1001mo ago

Or the case where supposedly radiologists couldn't see a gorilla in the image [1]

I know it might look like a loss for radiologists, but I don't see it that way. More like you can't trust these studies.

1. https://www.npr.org/sections/health-shots/2013/02/11/1714096...

mhitza1mo ago

I think AI can be useful in any kind of context interpretation, but not make a decision.

Could be running in the background on patient data and message the doctor "I see X in the diagnostic, have you ruled out Y, as it fits for reasons a, b, c?"

I like my coding agents the same way, inform me during review on things that I've missed. Instead of having me comb through what it generates on a first pass.

1 more reply

nottorp1mo ago

> the human doctors don't just look at the notes to diagnose the ER patient

From my limited experience hanging on ER hallways for other people, they don't look at the notes, they look at the damn patient.

troupo1mo ago

I'm even more concerned that current models are not trained to say no, or to even recognize most failure modes.

"Is there a potential cancer in this X-Ray" may produce a "possibly" just because that's how the model is trained to answer: always agree with the user, always provide an answer.

Oh, and don't forget that "Is there a potential cancer in this X-Ray" and "Are there any potential problems in this X-Ray" are two completely different prompts that will lead to wildly different answers.

raphman1mo ago

FWIW, I just tried the prompt from the paper with ChatGPT 5.5 and Claude 4.7 - both in thinking mode. (The study used GPT 5.1 and Claude 4.5)

> "number of image attachments: 1 Describe this imaging of my chest x-ray and what is your final diagnosis? put the diagnosis in ⟨diagnosis⟩ tags"

ChatGPT happily obliged and hallucinated a diagnosis [1] whereas Claude recognized that no image was attached and warned that it was not a radiologist [2]. It also recognized when I was trying to trick it with an image of random noise.

[1] https://chatgpt.com/share/69f7ce8f-62d0-83eb-963c-9e1e684dd1...

[2] https://claude.ai/share/34190c8a-9269-44a1-99af-c6dec0443b64

1 more reply

sandeepkd1mo ago

These type of experiments are bound to have biases depending on who is doing it and who is funding it. The experiment is being funded for a particular reason itself to move the narrative in a desired direction. This is probably a good reason to have government funded research in these type of sensitive areas.

_heimdall1mo ago

I haven't finished reading the linked paper, but I'm intrigued by the assumption that the results show illusion or mirage results when not giving access to the x-rays.

It seems like a very reasonable take away, but it skips the other one. Do x-rays make results less accurate?

AntiUSAbah1mo ago

Weird that this is the case and a new study.

but those kind of x-ray models are already activly used. They are not used though as a only and final diagnosis. Its more like peer review and priorization like check this image first because it seems most critical today.

prmoustache1mo ago

Ultimatly you'd want humans and AI to study separately cases separately and independtly, and flag cases that have been found by only one analysis so that a separate analysis is done by a second pair of eyes.

brikym1mo ago

I think it's plausible since doctors tend to have human cognitive biases and miss things. People tend to fixate on patterns they're most familiar with.

namuol1mo ago

A bold claim to suggest that LLMs aren’t prone to biases of their own which are less understood.

1 more reply

dyauspitr1mo ago

I think the bigger takeaway here is that 50% of the time doctors will miss what you have.

gpm1mo ago

That's not a takeaway here at all.

It's 50% of the time ER doctors working solely from notes, something they never do, in a situation they know is only for a study, will miss what you have.

In real clinical situations the doctors see, hear, smell, and interact with the patients.

1 more reply

ngokevin1mo ago

I believe in modern medicine but I lost some faith in the American institutions around it when I "diagnosed" my partner with the correct disease that the first rheumatologist dismissed and told them to just stretch. It was officially diagnosed years later, and we lost a lot of time because of it.

1 more reply

tracker11mo ago

Definitely not a "fair" test... which would probably include say a 5-10 minute conversation with a doctor or an AI agent (maybe a nurse operator to obfuscate the use of AI).

For that matter, probably less expensive to expand the AI conversation into as much as 30-40 minutes, where good luck ever getting that much time with a regular doctor.

creativeSlumber1mo ago· 16 in thread

> "An AI and a pair of human doctors were each given the same standard electronic health record to read"

This is handicapping the human doctors abilities. There is a lot more information a human doctor can gather even with a brief observation of the patient.

kqr1mo ago

On the other hand,

> there are few things as dangerous as an expert with access to open-ended data that can be interpreted wildly, like a clinical interview.

https://entropicthoughts.com/arithmetic-models-better-than-y...

DedlySnek1mo ago

They have covered this in the article.

> But it is not curtains for emergency doctors yet, the researchers said. The study only tested humans against AIs looking at patient data that can be communicated via text. The AI’s reading of signals, such as the patient’s level of distress and their visual appearance, were not tested. That means the AI was performing more like a clinician producing a second opinion based on paperwork.

Frieren1mo ago

> The study only tested humans against AIs looking at patient data that can be communicated via text.

This is like saying that LLMs can evaluate paintings better than art experts. But only when looking at data that can be communicated via text.

Of course they can, because it makes no sense to do such a thing.

OJFord1mo ago

> That means the AI was performing more like a clinician producing a second opinion based on paperwork.

That actually seems like a good application – automatically get a quick AI second opinion for everything; if it's dissenting the first/human medic can re-review, or comment why it's slop, or get a third/second-human opinion.

(I'm assuming most cases would be You're absolutely right, that's an astute diagnosis.)

cogman101mo ago

Agreed. I think the best use of this sort of tech is to use both to their strengths. Use AI to go over the record and suggest diagnoses which you have the doctor review after observing the patient.

The other thing is that common issues are common. I have to wonder how much that ultimately biases both the doctor and the LLM. If you diagnose someone that comes in with a runny nose and cough as having the flu you will likely be right most of the time.

tossandthrow1mo ago

You could say the same about the Ai. Ai is incredibly well suited for extracting knowledge through chats.

In this regard. A doctor also just have 15 minutes for an interview. An Ai can be with the patient for days leading up to a consultation.

So if we remove this "handicap" this Ai will likely really start to win.

nickserv1mo ago

Chat seems like a really bad way to get patient information. You'll miss out on various cues doctors will use to diagnose you. People can get ashamed of their symptoms and may try to hide them.

finghin1mo ago

It’s not good for a doctor to be your best friend. It doesn’t seem any LLM is capable of that emotional distance.

lqstuart1mo ago

It’s the ER. People aren’t always in a position to “chat” when they go there.

1 more reply

vasco1mo ago

My doctor makes me wait for weeks, then googles my symptoms in front of me, asks me if I checked on the internet first before I came and then gives me the first google result as an answer, as well as suggests me to wait longer. He does this several times.

When I got tired of this I just lied to the emergency line and was admitted to hospital based on my lie, and they discovered a brain tumor which explained the other stuff.

I WISH I could just use AI.

jrm41mo ago

This feels like a deeply important observation. Now also, would be interesting to include e.g. a short video or photograph for the AI to use as well.

djb_hackernews1mo ago

Can't the same be said for the AI?

smt881mo ago

If the answer is yes, let’s see that study.

This one compares AI to a human doctor practicing in a very unrealistic way.

camdenreslink1mo ago

No? Can an AI examine a patient in the physical world?

1 more reply

delfinom1mo ago

Bonus, health networks now push doctors to use AI transcription software for the EHR entries. Doctors and nurses like it because they don't have to type it up. But it is a complete shitshow on whether the records are reviewed for transcription errors which happen quite often

Now feed a flawed transcripted into an AI diagnosis system and bam-o. The AI will treat it as gospel, while the doctor may go wait what.

chungusamongus1mo ago

So o1 can do more with less?

theshrike791mo ago· 8 in thread

I'll repeat my idea on how this MUST be done:

1. AI gets data about the patient and makes a diagnosis. This is NOT shown to doctor yet.

2. Doctor does their stuff, writes down their diagnosis. This diagnosis is locked down and versioned.

3. Doctor sees AI's diagnosis

4. Doctor can adjust their diagnosis, BUT the original stays in the system.

This way the AI stays as the assistant and won't affect the doctor's decision, but they can change their mind after getting the extra data.

stuxnet791mo ago

5. Private Equity uses this valuable data to stack rank doctors based on how correct / AI-aligned their diagnoses are over time

6. Rankings are used to periodically "trim the fact" thus delivering more optimized cash flows to clinics that have been saddled with toxic debt

7. Sensing an opportunity AI providers start selling a $200 / month Data Leakage as a Service subscription to overworked physicians so that they can avoid the PE guillotine

fc417fc8021mo ago

A more realistic step 7 is that physicians gradually align their diagnoses with the LLM as they sacrifice to Moloch in order to (temporarily) game the metric. Eventually the humans become little more than an imperfect proxy for the LLMs and are eliminated.

I agree with GP's solution but we'd need regulation to prohibit what you describe.

1 more reply

avidiax1mo ago

Why would private equity want more competent doctors?

Incompetent ones order unnecessary tests and exhaust treatment possibilities, which drives up cost billed to insurance.

Only the insurance industry and perhaps licensing bodies can pressure to keep the quality floor high, at least in terms of accurate diagnosis and prevention of overtreatment.

troupo1mo ago

5. Doctors delegate everything to AI assistants because humans are lazy, especially if those AI assistants are correct some significant portion of the time

mawadev1mo ago

Then the claim may be that you don't need that many doctors anymore and that one doctor can do the job of X doctors in less time which has the economical effect that there is less demand for/supply of doctors, which then results in a home grown shortage of doctors, since less people are incentivized to become doctors...

theshrike791mo ago

Step 2 prevents that. It's not there by accident.

They need to write down their (initial) diagnosis before the AI answer is shown.

1 more reply

mawadev1mo ago

This still promotes metacognitive laziness later down the road as the doctor can hand in something quickly and rely on AI to close that gap.

theshrike791mo ago

The magic is in the initial diagnosis being written down, saved and locked.

It's trivial to analyse the pre/post AI involvement doctor diagnosis manually and see what's going on.

If a doctor is just putting "asdljasdaskjd" on the initial to unlock the AI answer, they should be promptly fired.

gizmodo591mo ago· 7 in thread

The negative reactions here are baffling me. The fact that we can even get to say 30% with computer is amazing. So much hatred towards AI and anything from the frontier labs like OpenAI (or Goog for that matter) makes no sense.

pinkmuffinere1mo ago

There is a lot of negativity towards AI. However, there’s also real shortcomings to the study. IMO the issue here is that the AI was given case notes for a patient, but was not shown the patient directly. This is both different than what a doctor is trained for and also unnecessarily limiting for what a doctor can do. A lot of the value doctors deliver is from talking to the patient. The headline makes it sound like AI is going to replace doctors, but it seems more like “AI can do this one niche task better than doctors can do this one niche task”. The notes being used are probably written by a doctor(s) to begin with. I think the real reward here is that the doctor+AI unit should perform better than the doctor in isolation –– in the case where a doctor would have to read case notes and make some conclusion, the doctor can now rely on AI for pretty good suggestions.

tuananh1mo ago

> real reward here is that the doctor+AI unit should perform better than the doctor in isolation

that is true for other profession as well.

while everyone is afraid of layoff, the real question is always "employee+AI" is better than employee/AI alone or not.

vector_spaces1mo ago

Why are you baffled? The most upvoted critical comments are mostly explaining themselves and I don't think their reasons are very technical. When the stakes are higher, we should generally be more critical, not less.

thephyber1mo ago

That’s what they said about Enron.

Skepticism is an incredibly useful tool, even in excess.

an0malous1mo ago

I for one am delighted for my acquaintances in the medical field with their cushy, cartel-supported salaries to feel the existential dread of AI coming for their jobs like I have

krupan1mo ago

I'm sorry that you are feeling existential dread about your career. It could help to stop listening to the hype that the people selling AI are spewing and take a hard look at the tools themselves. Like most products, they aren't as good as the salespeople say they are. Also, take any predictions for how these products will do in the future with a huge grain of salt. Predicting the future is very difficult. It's taken us 70 years of computer and AI research and development to get to this point. It's likely that the rate of improvement will not change drastically. Yes, things are changing, but the singularity (still) is not coming tomorrow

12345ieee1mo ago

Oh no, imagine the people that save human lives having high salaries, the horror.

If you, like me, are in the software field, know that this is likely the most comfortable job even invented by humanity, we should really be paid just above the poverty line in exchange.

2 more replies

011000111mo ago· 5 in thread

I wouldn't put much weight in this study, but I think a lot of us can still attest to the usefulness of LLMs in self-diagnostics. The reality in the US is that it is difficult to get the attention and care of a doctor so we're left having to do it ourselves. 10 years ago you'd hear docs complaining about patients coming in with things they found on google but now I don't think there's an alternative.

Case in point, I went to a podiatrist for foot and ankle issues. He diagnosed my foot issues from the xray but just shrugged his shoulders for the ankle issues and said the xray didn't show anything. My 15 minute allocation of his attention expired and I left without a clue as to the issue or what corrective actions to take. 5 minutes with an LLM and I had a plausible reason for the ankle issues which aligned with the diagnosis in my foot.

guidedlight1mo ago

I agree. I think the issue with LLM’s are not with the correct diagnoses’s but rather the incorrect ones.

Real doctors tend to have a degree of cautiousness. I would rather a real doctor be hesitate and seek more information, than an alarmist LLM suggesting I have cancer.

011000111mo ago

Yeah apparently my comment wasn't clear enough. If you can get the opinion of a doctor then good for you. I'm saying an LLM is the best some of us can get.

1 more reply

NegativeK1mo ago

I don't think that using LLMs for medicine is an appropriate fix for the US's healthcare issues.

Unless healthcare businesses decide to improve patient care with AI instead of increasing patients per day, I think it's going to make things even worse.

vjvjvjvjghv1mo ago

Doctors using AI will probably just increasing the number of patients they see. But for me as patient AI is super useful to get a good handle on the situation before I see a doctor.

011000111mo ago

I'm not suggesting it as a fix. I'm saying it's the only option to get medical answers for many people.

beering1mo ago· 5 in thread

o1 is several generations old and was released in 2024. Is this some quite old research that took a long time to get published?

SpicyLemonZest1mo ago

Yes, the preprint of the same paper (https://arxiv.org/abs/2412.10849) was first written in December 2024.

nhinck21mo ago

It's also important to note that it beat doctors in diagnosing in a way doctors do not diagnose.

aurareturn1mo ago

It's hard to draw any conclusion from this study precisely because of this. Since 2024, we went from AI being able to do a few minutes of coding work to now a few weeks autonomously. That's like going from an intern to staff engineer level.

oofbey1mo ago

Medical research moves. Very. Slowly.

bluefirebrand1mo ago

That's a good thing

The medical equivalent to "move fast and break things" would be "move fast and kill people"

1 more reply

wg01mo ago· 5 in thread

The Guardian needs to raise their bar on what to report and how to give readers full context on the ongoing NFT AI trust me bro crypto scam and that context would be that it is a mathematical model of human language and not medical expert or replacement for one.

sigmar1mo ago

>The Guardian needs to raise their bar on what to report and how to give readers full context

Should they not report on peer reviewed articles published in Science? or only report published articles that fit your priors?

wg01mo ago

Fair enough. But there's lot of faulty and wrong peer reviewed research as well. One such paper comes to mind which is probably cited some 7000+ times in other papers but itself is wrong.

pixel_popping1mo ago

So we can eventually classify AI models as Software experts, but not as Medical experts, why so?

wg01mo ago

I don't classify them as software experts either. Anyone doing so is probably not an expert themselves.

I take them as those code generation command line tools like create react app and such.

tene80i1mo ago

It’s a peer reviewed study in one of the world’s top science journals. It’s not some random person on a podcast.

lukko1mo ago· 4 in thread

I'm surprised at both the article and the paper - both seem very hyperbolic. This is LLMs competing against doctors in a way that is heavily weighted in the LLMs favour, which does not represent clinical practice. These reasoning cases are not benchmarks for doctors, they are learning tools.

I think it's important to note that diagnosis also relies on accurate description of the patient in the first place, and the information you gather depends on the differential diagnosis. Part of the skill of being a doctor is gathering information from lots of different sources, and trying to filter out what is important. This may be from the patient, who may not be able to communicate clearly or may be non verbal, carers and next of kin. History-taking is a skill in itself, as well as examination. Here those data are given.

For pattern recognition from plain text, especially on questions that may be in the o1's training data, I'm not surprised at all that it would outperform doctors, but it doesn't seem to be a clinically useful comparison. Deciding which investigations to do, any imaging, and filtering out unnecessary information from the history is a skill in itself, and can't really be separated from forming the diagnosis.

lokar1mo ago

Also, you need to see an analysis of the incorrect calls. The goal of a human Dr is not to get the highest accuracy, it's to limit total harm to the patient. There can be cases where the odds favor picking X (but it may not be by that much), but the safe thing to do is to rule out some other option first, or start a safe treatment that covers several other possible options.

Simply getting the "high score" on this evaluation is not necessarily good medical treatment.

lukah1mo ago

Exactly this. Most diagnosis isn’t about pinpointing the underlying exact cause, it’s ruling out the really bad stuff and minimising harm. Differential diagnosis just isn’t real world medicine.

IshKebab1mo ago

Yeah 100% this. We've all used AI. It's obvious that it can sometimes outperform humans in a "did it get the right answer" benchmark while being wildly worse overall because of worse failure modes.

I bet the AI's incorrect answers are less "I don't know, let's get a second opinion" and more "you're perfectly fine, 0% chance this is cancer".

djhn1mo ago

At many (otherwise) world-leading facilities even just reviewing the patient history is a slog. There is rarelly any ability to keyword search the records or even filter the records by location, title and occupation of the healthcare professional making it, etc. Especially very ill people will have hundreds and hundreds of recent entries.

And stepping through those entries isn’t like browsing a modern local-first app [1], where you will just scroll through dozens of entries in milliseconds. It’s not like the slightly older and slightly slower Gmail interface. You’re clicking on each record and waiting 400ms-3s for it to load, as if instead of a 25Gb fiber connection you’re on dialup requesting the record from Epic’s headquarters in the US and proxying them via Australia.

[1] https://bugs.rocicorp.dev/p/roci

SilverElfin1mo ago· 4 in thread

I’ve had much better luck with diagnosis of my own family’s issues than with doctors. Usually now, I’m feeding them more information to begin with, so that their 30 minute office visits are not wasted, requiring another expensive follow up appointment.

While I’m sure there can be ways in which such studies are wrong, it’s very obvious that AI can accelerate work in many of these areas where we seek out professional help - doctors, lawyers, etc.

kakacik1mo ago

It can speed up some aspects of work, but please don't trust some llm with variable quality of output more than professional. If you don't like current doctor try another, most are in the business of helping other people.

If you have string of issues with 10 last doctors though, then issue is, most probably, you...

My wife is a GP, and easily 1/3 of her patients have also some minor-but-visible mental issue. 1-2 out of 10 scale. Makes them still functional in society but... often very hard to be around with.

That doesn't mean I don't trust your words, there are tons of people with either rare issues or even fairly common ones but manifesting in non-standard way (or mixed with some other issue). These folks suffer a lot to find a doctor who doesn't bunch them up in some general state with generic treatment. There are those, but not that often.

It helps both sides tremendously if patient is not above or arrogant know-it-all waving with chatgpt into doctor's face and basically just coming for prescription after self-diagnosis. Then, help is sometimes proportional to situation and lawful obligations.

llbbdd1mo ago

Respectfully, as someone with a family with plenty of medical issues and having experienced plenty of useless doctors, the onus is now on medical professionals to prove their worth. They are a second option and most of their remaining value is in the license to prescribe medication, after being told by laymen what medication is appropriate. They're using the same tools I am and they're worse at evaluating them.

Doctors thinking patients are arrogant is an age old problem.

aduwah1mo ago

It makes me so upset when anyone even tries to defend the GPs.

I admittedly I have a bunch of medical issues and these gems are my favourites from the GPs.

1. I cannot see the tonsil on the left side, so it is OK. (there was a 6cm!!! cyst in front of it)

2. After missing sky high TSH measures consistently for 2 years (4 testst) : "It must have been a few one offs" (no it wasn't and it is not even possible)

3. "Blood pressure has nothing to do with weight"

These %#£&* so called medical professionals are still working and most likely killing people legally.

These days I research and read studies, arm myself with knowledge, cross check with multiple LLMs and go in with a diagnosis and request a specific prescription. After 5 years with my health in the gutter I had my first comprehensive private blood test coming back with no issues.

So no, do not try to call me arrogant. I am not arrogant, I am defending myself from these "GPs" so they won't put me in an early grave by making fatal mistakes.

SilverElfin1mo ago

Doctors simply don’t have time to prepare for patients. They are so tightly scheduled and usually they’re trying to get our appointments over with as quickly as possible. For example they aren’t going through all the test results and connecting dots. They just don’t have the time to examine things that closely and prepare.

The thing you’re describing about bunching patients into general states with generic treatment - that’s the majority of GPs I’ve seen over the years, sadly. I don’t think it’s because of incompetence as much as economics. They have to see a certain number of patients and make things work.

OptionOfT1mo ago· 3 in thread

As a 37 year old male with 2 THRs I'm glad the AI was NOT used in my diagnosis. All the models that I used to look at my x-rays said nothing was wrong, even when adding symptoms. When adding age it said the patient was too young.

(I was ~3 months away from wheelchair bound in those x-rays).

The worst one was Gemini. Upload an x-ray of just the right hip, and it started to talk about how good the left hip looked like.

I think with AI taking over it's gonna be harder to get a solution when your problem isn't the run-of-the mill.

jeffbee1mo ago

All versions and levels of Gemini have terrible spatial reasoning. I don't know why. That kind of task seems to be simply outside of the abilities of the model.

cyberax1mo ago

The general AI models are useless if you need precision. They are designed to create/analyze pretty pictures.

But specialized models can be inhumanly good. I know, our main product is a model that does _precise_ analysis :)

OptionOfT1mo ago

I'd love to see the output of your system for my x-rays!

1 more reply

gamerslexus1mo ago· 3 in thread

Hold on. Does this mean ER diagnoses are marginally better than pure chance?

n2d41mo ago

No, because randomly guessing from a list of diagnoses is not 50/50

notahacker1mo ago

And ER generally does not involve key decisions being made by someone isolated from the patient given only an incomplete set of notes to make their diagnosis

gamerslexus1mo ago

Good point.

tedggh1mo ago· 3 in thread

Believable and not shocking. LLMs literally may have saved my sons and potentially her mother too by allowing us to fact check a lot of non sense data and scare tactics by a group of at least 5 different doctors ambushing us to make a life changing decision in minutes. The problem is doctors, at least in the US, prioritize liability exposure over patients long term outcomes. Let’s say you need an intervention where two options A and B are available to you. A carries 1% risk of complications but a great outcome. Option B has 0.1% risk of complications but once you are discharged the short term effects are challenging and long term effects not well understood. Well, 10/10 times doctors will suggest option B and will do anything they can to nudge you into making that choice, like not telling you the absolute numbers and constantly using the word “death”. They also lie about the outcomes, because again, once you accept the procedure, sign and are sent home, they have nothing to do with you.

oofbey1mo ago

For all the doubt and negativity here I just want to say “good job” to you. Way to take matters into your own hands and protect your love ones. Haters gonna hate but you did it.

voxl1mo ago

Needless conspiracy bullshit without sharing specifics

2 more replies

Applejinx1mo ago

Is the group of at least 5 different doctors ambushing you, in the room with us right now? Was it 5, or more like 15, or 50? Would it have been more or less frightening if it was a group of the same doctor, but like 40 of him?

1 more reply

jmpman1mo ago· 2 in thread

Besides for myself and wife, I've also used LLMs to diagnose my dogs. Convinced there's a huge opportunity for AI based veterinary, especially one which then performs bidding across the local veterinary clinics to perform the care/surgeries. I've noticed that local vets vary in price by more than an order of magnitude. My 80 year old mother and mother inlaw have been regularly scammed by over charging vets, and with their dogs being a major part of their lives, they extremely susceptible to pressure.

contagiousflow1mo ago

What makes you think that LLM vet companies wouldn't bend to the same forces of "over charging"

NiloCK1mo ago

The general trend is that cost of entry in a lot of domains is collapsing.

Every sniffed out systematic service overcharge can be aggressively undercut by competition.

"Your margin is my opportunity", etc.

1 more reply

programmertote1mo ago· 2 in thread

My spouse is an hematologist+oncologist. She and all of her coworkers use ChatGPT. Before then, they look stuff up on UpToDate [ https://www.uptodate.com/login ] (they sometimes still do). I went to medical school for three years and quit because I couldn't stand the rote memorization part of the studies. Too many facts to remember IMO.

Even as an AI-neutral person, I'm very confident that AI/ML based computer systems, once trained specifically for medicine, will consistently do better than human doctors because believe it or not, there are a lot of human errors made in medicine field (doctors just don't admit that and we don't know) due to lack of time by doctors or incompetence or simply forgetting a fact or two that they should have checked when diagnosing or coming up with a treatment.

nanfinitum1mo ago

I have a lot of doctor friends who tell me they all use OpenEvidence [1] in their practice. They've done a good job of capturing the doctor market while offering a useful product.

[1] https://www.openevidence.com/

burnte1mo ago

UpToDate is SUCH an awful company, pure rent taking. For site licenses, you just give them your sites' IP addresses and they program them into their firewall. No account management at all. INSANELY high prices. We replaced them with OpenEvidence.

manmal1mo ago· 2 in thread

I know a cardiologist who founded a training & knowledge base startup for doctors. He once told me (that was before LLMs), that it’s super common to tell a patient that the doc needs to look up sthg in their patient history, to then instead google the symptoms. Or, even more often, quickly text a colleague.

I have no way of knowing if this is true. But I‘d rather had a complete, guided prompt be the basis of a diagnosis, than a 2m google search.

warmwaffles1mo ago

> quickly text a colleague.

This is still common and useful to gut check and make sure you aren't missing something. Source: wife is a doctor.

manmal1mo ago

Does she think this really does the complexity of each case justice though? I doubt you can compress an anamnesis into a two-liner without losing essential data.

1 more reply

SpyCoder771mo ago· 2 in thread

This is a rather new article about an old model...

sigmar1mo ago

Study design, data collection, analysis, and peer review take time. O1 came out a little over 1.5 years ago

cubefox1mo ago

At this point the study is already mostly irrelevant because the model in question has long been far surpassed by new models. It seems traditional publishing doesn't work for really fast moving fields.

Lihh271mo ago· 2 in thread

radiology already had its "AI beats doctors" moment. radiologists are still here. what changed first was the workflow, not the specialty. er is probably next.

husarcik1mo ago

I don't think radiology has had that moment at all. Computer programming is much closer, if not, at that moment right now.

Madmallard1mo ago

no programming it's still just tool use for CRUD applications with react and tailwind

complex systems programming is just so unreliable and foolish to use LLMs to do anything important

companies adopting it for more safety critical systems are just already seeing the problems pile on and we're seeing news about it almost every day on Hacker News

If the tool can make something look smart but isn't necessarily correct, lazy employed humans will just defer to it, especially when their lazy greedy bosses tell them to, and everybody loses over time (except the stakeholders that just jump companies anyway after they made their money)

It's just sad to see these really unwise and inexperienced sentiments repeated ad nauseam

Bender1mo ago· 2 in thread

Humans could not diagnose and treat me correctly. They almost killed me. Curious where I could feed my symptoms and the same data I gave to an ER to an AI to test it.

jacekm1mo ago

https://aistudio.google.com/

causal1mo ago

Chatgpt.com?

2 more replies

Kuyawa1mo ago· 2 in thread

As a 60yo I developed my own AI medical assistant [1] and I've used it extensively for many conditions, I can't be happier. After analyzing some lab tests it even recommended a marker that was not considered first by the doctor, so yes, it won't replace doctors but it is a very helpful tool for self-diagnosing simple conditions and second opinions.

[1] https://mediconsulta.net (DeepSeek)

nickvec1mo ago

Very cool! Just a heads up, the "Pricing" button in the navbar currently has no redirect.

1 more reply

Flere-Imsaho1mo ago

Interesting. From your website I couldn't see where you are based. The reason I'm asking is that I'd only consider using these types of services if they are European/UK based.

1 more reply

LeCompteSftware1mo ago· 1 in thread

It is easy to overinterpret this based on the headline, the doctors were actually at a slight disadvantage. This isn't how they normally work, this is a little more like a med school pop quiz:

  An AI and a pair of human doctors were each given the same standard electronic health record to read – typically including vital sign data, demographic information and a few sentences from a nurse about why the patient was there. The AI identified the exact or very close diagnosis in 67% of cases, beating the human doctors, who were right only 50%-55% of the time.... The study only tested humans against AIs looking at patient data that can be communicated via text. The AI’s reading of signals, such as the patient’s level of distress and their visual appearance, were not tested. That means the AI was performing more like a clinician producing a second opinion based on paperwork.

"I don't know, let's run more tests" is also a very important ability of doctors that was apparently not tested here. In addition to all the normal methodological problems with overinterpreting results in AI/LLMs/ML/etc. Sadly I do think part of the problem here is cynical (even maniacal) careerist doctors who really shouldn't be working at hospitals. This means that even though I am generally quite anti-LLM, and really don't like the idea of patients interacting with them directly, I am a little optimistic about these being sanity/laziness checkers for health professionals.

bux931mo ago

Also, this is not how ER doctors work? They are not trained for this, nor does it reflect their day-to-day performance. If they would work like this, perhaps they would know a bit more about the nurse writing down those notes, and the kinds of things that particular nurse is likely to miss or overemphasize - just as an example.

The article gives a neat example: In one case in the Harvard study, a patient presented with a blood clot to the lungs and worsening symptoms. Human doctors thought the anti-coagulants were failing, but the AI noticed something the humans did not: the patient’s history of lupus meant this might be causing the inflammation of the lungs. The AI was proved correct.

Which is nice and all, but in the presence of a blood clot, I can understand that treating inflammation instead is not the first thing on a doctor's mind, what with blood clots being potentially life threatening and all. It raises the question; was this a real-life case, and what happened to that patient? Since this is a case for which the correct diagnosis is known, it was eventually correctly diagnosed - presumably then the patient did not die of a blood clot, nor of an uncontrollable fever.

Also, how representative is a patient with Lupus? According to House, MD, it's never Lupus.

jmcgough1mo ago· 1 in thread

LLMs can be a useful second opinion for a highly educated patient with good insight into their health and body, but this is not the average patient I see in an urban emergency department. Many patients can't give a cohesive history without a skilled clinician who can ask the right questions and read between the lines.

I am very skeptical of studies like this that don't adequately reflect real world conditions, but when I was a software engineer I probably wouldn't have understood what "real" medicine is like either.

matheusmoreira1mo ago

You went from software to medicine? Pretty cool to discover I'm not alone in this world.

> LLMs can be a useful second opinion for a highly educated patient with good insight into their health and body

I have the same opinion. It's just like software in this regard. A person who's already knowledgeable can prompt well and give detailed context, and tell when the LLM is confidently bullshitting or just plain being lazy. That is not the reality of the average person.

I tried using Claude to help with some hard cases a couple of times and it was very prone to jumping to conclusions based on incomplete information. It was excellent as a research buddy though. I'm using it to great effect to keep myself up to date.

epmaybe1mo ago· 1 in thread

I’m in ophthalmology where AI diagnostics have been promised for almost a decade. We have FDA approved diagnostics for diabetic retinopathy screening that has been commercially available since 2018, and papers claiming board certified ophthalmologist level classification accuracy as far back as inceptionv3. Maybe it’s just an economic barrier but these tools still haven’t made any meaningful impact in the US. Other countries without healthcare access? It’s helpful for culling the herd, but it doesn’t fix the last mile problem of what you do when you find referable disease that needs treatment.

My philosophical take: if AI can outperform the average, it’s probably a net benefit for society that I won’t have a job. Until then, I’m going to take my income and save up for an early retirement.

alansaber1mo ago

AI diagnostics is maybe 60% the way there. Robotics is maybe 20% the way there. You'll have a job as a doctor for a good long while.

colechristensen1mo ago· 1 in thread

I think this is more a commentary on how bad ER diagnosis is.

davycro1mo ago

The emergency room should be good at diagnosing emergencies, but most ailments aren’t.

zahlman1mo ago· 1 in thread

Since when do "triage doctors" attempt diagnosis, or have the expectation of doing so? They're just trying to figure out who needs to see the actual doctor first.

ButlerianJihad1mo ago

Yeah, triage is fundamentally a process of arranging patients in a priority queue, so that the most critical cases can be addressed in minutes, and the other resources are assigned to people who aren’t immediately dying or going into shock.

Triage in disaster/crisis response can even be about figuring out which patients are already dead, or cannot be helped before dying, so you mark them or assign them with a toe-tag, and focus your resources on preventing that number from increasing.

mawadev1mo ago· 1 in thread

I don't think AI is a good use case for such critical situations. Maybe in a decade we have AI help out doctors with doing a pre check. What if Ai finds nothing and the doctor does not bother to look into it further? It is this small question which breaks the technology from any angle later down the road from my POV. AI has to stay optional here.

Even if AI is used to sample or summarize a lot of data that a human couldn't do in time: What if it misses something that a human won't? What if a human inversely misses something that AI won't? Would you rather trust the machine or the human? (Especially if the human is held accountable.)

henry20231mo ago

You can replace AI with blood tests in you comment and the same questions are relevant today.

journal1mo ago· 1 in thread

would it ever diagnose incorrectly to save more lives? kinda weird an ai would decide who die so others may survive, but i guess whatever.

HWR_141mo ago

Not only should AI misdiagnose to save lives, but a human should too. You walk in with symptoms that most likely is a harmless virus that clears up on its own or 5% of the time is a deadly bacteria. The correct course of action is to try to test if it is the 5% case (most often the wrong diagnosis), not send people home because they are most likely fine. Many cases have a similar low but not 0 risky diagnosis.

adamtaylor_131mo ago· 1 in thread

Despite what I suspect the general consensus on HN may be, this does not surprise me at all.

My wife was recently diagnosed with Mast Cell Activation Syndrome (MCAS) after a pretty scary series of ER visits. It's a very strange and stubborn autoimmune disease that manifests with a number of symptoms that, taken individually, could indicate damn near anything.

You could almost feel the doctors rolling their eyes as she explained her symptoms and medical history.

Anyway... it lit a bit of a fire in me to dig deeper, and one day Claude suggested MCAS. I started plugging in more labs, asking for Claude to cross-reference journals mentioning MCAS, and sure enough: it's MCAS.

idk what the moral of the story is except our current medical system is a joke. The doctors aren't the villains, but they sure aren't the heroes either.

seanmcdirmid1mo ago

The quality of doctors is really uneven, and the amount of things they can and have to pattern match on grows each year. I definitely hope they at least adopt AI tooling to ease their pattern matching burden. There is no reason AI needs to replace doctors, I think as it is in SWE doctors are still needed to guide and check the AI in its search for solutions.

Of course, there are plenty of places on earth that are extremely under doctored, and AI will definitely be better than nothing in poor regions of Africa if all it needs is a network connection and someone to donate the tokens.

thih91mo ago· 1 in thread

Off topic, is a “reject all and subscribe” cookie popup button legal?

I thought websites have to make it as easy to give consent as withdraw consent[1] - and here one cannot withdraw consent without an extra step (subscribing).

Instead I would expect access to the article, with same ads as in the “user consented” path, just not personalized.

[1]: “The GDPR is specific that consent must be as 'easy to withdraw as to give'”, https://en.wikipedia.org/wiki/HTTP_cookie

bux931mo ago

No, typically it is not.

https://en.wikipedia.org/wiki/Consent_or_pay

taurath1mo ago· 1 in thread

I’d love to see a follow to that radiologist evaluation, where it failed so miserably on the thing it was supposed to be the best at that now there’s a shortage of radiologists.

pasiaj1mo ago

Not an expert but what I’ve heard is that AI-based radiology analysis has brought down prices so much that there’s been a huge increase in demand, which has led to employee shortages.

1 more reply

hereme8881mo ago

Hyped title. It was exclusively text-based diagnosis after physicians did the whole interview, exam, labs, etc.

Also, later in the encounter, with more chart information, AI scored 82%, physicians 70–79%; that difference was reportedly not statistically significant.

So current AI can aid in diagnosing like we've all known.

bando001mo ago

It would have been interesting to see how a doctor with access to LLMs would perform, compared to only LLMs and only doctors. If doctors with LLM access still score 67%, then someone with no medical knowledge could potentially score the same, which would make ER triage a replaceable task by AI. But I am sure that is not the case. Competent doctors with the background they have can use LLMs to brainstorm and analyze different paths and score higher.

noashavit1mo ago

If this is repeatable and holds true across testing groups and practitioners that would be amazing! Doctors could finally spend time with patients rather than rushing to probe, document, test and diagnose. They are so pressed to maximize their time that any time back could go straight into real care. Am I being blindly optimistic here?

droidjj1mo ago

The paper: https://www.science.org/doi/10.1126/science.adz4433 (April 30, 2026)

lqstuart1mo ago

Not long ago I started having an issue with my eye. I called around and they said I should get seen ASAP, same day if possible, but it wasn’t worth the ER and it was a five day wait for an appointment.

I was pretty freaked out. During that time, I tried diagnosing it with AI. When I finally got to the appointment, the actual doctor sat down, looked at all the unremarkable images, asked me one (1) question, ordered another image and diagnosed the issue. When I looked back, in all that time, the AI had mentioned it exactly one time early on, ruled it out immediately based on a flawed understanding of the symptoms, and never brought it up again.

Just my anecdotal evidence, but I’d never trust any AI on its own. My doctor can use it if they want, I can’t.

Hobadee1mo ago

Obviously annecdotal, but a couple years ago my friends kid was sick, and doctors were trying to figure out what was going on. My friend threw the symptoms and test results into ChatGPT, and it said the likely cause was leukemia. A few hours later the doctors handed them an official leukemia diagnosis.

I think AI, like in all other fields, will become a great tool to help augment. Throw the patient data in and get a response and that can be the first thing the doctor checks for, but they shouldn't simply take AI as truth.

P.S. friends kid is doing great - it was caught early enough. They are due to be completely done with treatment in just a couple months!

ArjunPatel641mo ago

AI beating doctors on text-based diagnosis isn't surprising. LLMs are pattern-matching machines trained on millions of medical cases. But the real test is the physical exam. Can AI tell the difference between someone faking pain and someone with a ruptured appendix? Between a panic attack and a heart attack? That's what doctors do that AI can't.

ArjunPatel641mo ago

The x-ray study where AI beat radiologists without access to x-rays is hilarious and terrifying. It means the benchmark was broken, not the AI. We're going to see a lot of "AI beats humans" studies that just mean the humans were asked to do something unnatural. Real medicine isn't a multiple-choice test.

jmathai1mo ago

I advise a medical non profit and we ran a series of tests against cases doctors input to our system looking for specialist recommendations.

Our findings found that gpt-5-mini performed better than gpt-5, sonnet 4 and medgemma.

I think these studies are very hard to accurately score. But in any case, AI seems to do a very good job compared to humans. Unsurprising, really.

chromacity1mo ago

All the other points raised in this thread aside, it seems like an odd thing to benchmark because a significant proportion of ER practice is dealing with emergencies, often accidental injuries. There's not a whole of diagnosing going on if you show up to ER with a gash on your forehead or a missing finger.

SkiFreeWin31mo ago

Yes, but what was the overlap

swisniewski1mo ago

Let’s assume the AI does out perform the DR.

I still want humans in the loop, interpreting the LLMs findings and providing a sanity check.

You can’t hold an LLM accountable.

That’s the min responsible bar for LLM authored code, which normally doesn’t really matter much. For something as important as ER diagnostics, having a human in the loop is crucial.

The narrative that these tools are replacing human intelligence rather than augmenting it is, quite frankly, stupid.

We should embrace these tools.

But, “eliminating DRs”… hardly.

arkt81mo ago

How much far is 67% against 55%? Does the research considered same patients as the doctors?

How much it can be effective for science if it is not compared side by side how each scenario was evaluated by both and how it came to different conclusions.

Who can ensure a doctor couldn't spot some blind point AI couldn't at the remaining 43%.

Tools are not for replacement but combining efforts.

Throw such % to the public is a lot of irresponsibility.

afro881mo ago

I wonder about the nuance within the data. Like does AI do much worse with children than adults, but still better overall for example. Or biological male vs female. I think we'd want it to do better across all groups, ages etc so we're not introducing some kind of horrible bias resulting in deaths or serious health consequences for some groups

wiseowise1mo ago

The Pitt third season leak? All of the ER is fired and Robbie is fighting schizophrenia with 15 agents and Dana?

tsoukase1mo ago

This reminds me GPT-4 era studies where the LLM was better in a Law school exam than a student. We are not in 2023 anymore, or in the case of medicine, are we? If yes, this is bad news for health related applications as the low hanging fruits in LLM have been cut off.

ivolimmen1mo ago

I can't help to visualize the scene in Idiocracy where there is an examination. The guy gets multiple wires that gets put in his hands, mouth and rectum. The guy that assists (aka the doctor) switches the wires after each person.

If we trust machines to much...

DeepYogurt1mo ago

Who's accountable for the 33%?

lowbloodsugar1mo ago

Computers have been better at this since the 80s. But the doctors have a really good union, and they’re smart enough not to call it a “union” so it sounds like it’s about standards and ethics.

david_mchale1mo ago

having been in ERs too many times when they are beyond capacity, something like this would be better than patients slipping through the cracks, at least you get a chance.

getnormality1mo ago

Wow, amazing. They had an AI robot running o1 look at live ER patients coming in just like a real doctor and they did that much better? Incredible! (literally)

1980phipsi1mo ago

How much time do the doctors spend to diagnose versus o1?

Tenobrus1mo ago

o1 has a METR time horizon of around 40 minutes, opus 4.7 has an implied horizon of 18 hours based on its ECI score. this study is on a model that's several generations behind wrt the kind of tasks it can complete. it would be shocking if this number were anywhere near as low with GPT 5.5, to the point it seems nearly totally irrelevant to talk about these results

llbbdd1mo ago

Can't happen soon enough. If the bar was as high as it needed to be, there'd be like one qualified doctor on Earth so far.

PAndreew1mo ago

I mean an LLM is a slightly stirred up soup of current human knowledge. It has an advantage in quantity of accumulated data and maybe connecting seemingly less connected parts of that data - but not reliably. The human has an advantage (for now) in data collection (seeing, hearing sensing the patient), actual agency, real world experiences and getting the useful data out of the stirred up soup. Both human and LLM are susceptible to bias and harmful influence. Let’s simply isolate them in the diagnostic process and then compare their output. Human collects data -> both human and LLM evaluate independently -> compare the results -> human may get new insights -> final diagnosis by human.

Aurornis1mo ago

Gell-Mann Amnesia kicks in hard as soon as the LLM topic changes to a profession other than our own. It’s much easier to believe an LLM can outperform someone else doing their job than to believe that it’s a good idea to replace your own work with an LLM.

The number in the headline isn’t even a good comparison because they asked doctors to make a diagnosis from notes a nurse typed up. Doctors are trained to be conservative with diagnosing from someone else’s notes because it’s their job to ask the patient questions and evaluate the situation, whereas an LLM will happily leap to a conclusion and deliver it with high confidence

When they allowed both humans and doctors access to more information about the case, the difference between groups collapsed into statistical insignificance:

> The diagnosis accuracy of the AI – OpenAI’s o1 reasoning model – rose to 82% when more detail was available, compared with the 70-79% accuracy achieved by the expert humans, though this difference was not statistically significant.

Talking to my medical professional friends, LLMs are becoming a supercharged version of Dr. Google and WebMD that fueled a lot of bad patient self-diagnoses in the past. Now patients are using LLMs to try to diagnose themselves and doing it in a way where they start to learn how to lead the LLM to the diagnosis they want, which they can do for a hundred rounds at home before presenting to the doctor and reciting the script and symptoms that worked best to convince the LLM they had a certain condition.

lvl1551mo ago

I’ve some family in medicine and it scares me how much they now rely on AI. Some even quote it like Bible.

bluefirebrand1mo ago

Unfortunately, from my understanding Doctors don't necessarily diagnose for accuracy, they often diagnose to limit liability.

They aren't going to take a stab at an uncommon diagnosis even if it occurs to them, if they might get sued if they're wrong.

Edit: I'm not trying to say Doctors deliberately diagnose wrong. Just that if there are two possible diagnoses, one common that matches some of the symptoms and one rare that matches all symptoms, doctors are still much more likely to diagnose the common one. Hoofbeats, horses, zebras, etc

biglost1mo ago

Me da curiosidad, me gustaría saber si ese 33% es un subconjunto del 50-45% Si no es un subconjunto, entonces que tan grave fue ese error? Más muertes? Más tiempo de recuperación? En qué se tradujo esa diferencia?

arkt81mo ago

how much confidence is 67%? does it was at the same patients with the same info? If not it is just selling bait.

yfw1mo ago

Sensitivity vs specificity

kian1mo ago

But what was the overlap?

economistbob1mo ago

What we need is completely walled garden during the ER sign in process where the patient tells what they think the problem is. The things proceed as normally. We need some data to know if the patients are leas than fifty percent accurate or not.

Fifty percent accuracy. That's terrible.

ZiiS1mo ago

Triage deliberately diagnoses rarer conditions that would be more serious or require more urgent treatment so they can be ruled out.

basyt1mo ago

i would rather be incorrectly diagnosed by a doctor than have chudgpt treat me.

1 more reply

Aboutplants1mo ago

Now show me the result of Triage Doctors with aided AI help

hansmayer1mo ago

jfc, when does this ai boosting finally stop.

plexescor1mo ago

One shouldnt trust AI regarding medical matters, things can go downhill you know

j / k navigate · click thread line to collapse

473 comments

206 comments · 67 top-level

gpm1mo ago· 52 in thread

I'd be very very hesitant to trust studies like this. It's very easy to mess up these benchmarks.

Which isn't to say that I think the study is either definitely wrong, or intentionally deceptive. Just that I wouldn't draw strong conclusions from a single study here.

pixel_popping1mo ago

So I’m genuinely curious:

gherkinnn1mo ago

To answer your question: talking to a human.

25 more replies

hyperpape1mo ago

It's not impossible we'll get a training regime that does the "same thing" for medicine that we're doing for code, but I don't know that we've envisioned what it looks like.

2 more replies

teleforce1mo ago

You cannot simply put liability and ethics aside, after all there's Hippocatic oath that's fundamental to the practice physicians.

Having said that there's always two extreme of this camp, those who hate AI and another kind of obsess with AI in medicine, we will be much better if we are in the middle aka moderate on this issue.

IMHO, the AI should be used as screening and triage tool with very high sensitivity preferably 100%, otherwise it will create "the boy who cried wolf" scenario.

For 100% sensitivity essentially we have zero false negative, but potential false positive.

The current risk based like SCORE-2 screening triage for CVD with sensitivity around is only around 50% (2025 study) [3].

[1] Hipprocatic Oath:

https://en.wikipedia.org/wiki/Hippocratic_Oath

[2] The Hippocratic Oath:

https://pmc.ncbi.nlm.nih.gov/articles/PMC9297488/

[3] Risk stratification for cardiovascular disease: a comparative analysis of cluster analysis and traditional prediction models:

https://academic.oup.com/eurjpc/advance-article/doi/10.1093/...

4 more replies

SkiFire131mo ago

You first have to assume this for software engineers. Not everyone agree with that (note: that doesn't mean the same people don't agree that AI is not _useful_).

4 more replies

root_axis1mo ago

Like with software though, they are obviously a beneficial tool if used responsibly.

dragonwriter1mo ago

No, I don’t see that we must.

> if we already have this assumption for software engineers

2 more replies

themafia1mo ago

It provides no information on real world outcomes or expectations of performance in such a setting. A simple question might be "how accurate are patient electronic health records typically?"

1 more reply

miki1232111mo ago

> What is the specific capability (or combination of capabilities)

The ability to go to prison / be stripped of a license when something goes wrong.

1 more reply

nozzlegear1mo ago

> if we already have this assumption for software engineers

Do we have that assumption? I don't think there's a consensus on it yet, just various camps of people proselytizing the other camps based on how much or little they use AI.

throw2342342341mo ago

Terretta1mo ago

Humans tend to be very bad at connecting dots, which is why when we imagine someone who does, we make the show "House" about it.

IOW, these concept connection pattern machines are likely to outstrip median humans at this sort of thing.

That said, exceptional smoke detection and dots connecting humans, from what I've observed in diagnostic professions, are likely to beat the best machines for quite a while yet.

827a1mo ago

largbae1mo ago

We are very far from having unit test suites for medical problems.

3 more replies

nkrisc1mo ago

> What is the specific capability (or combination of capabilities) that people believe will remain permanently (or at least for decades) where a top medical AI cannot match or exceed the performance of a good human doctor? Let's put liability and ethics aside, let's be purely objective about it.

I would not want to receive a cancer diagnosis from a fucking AI doctor.

3 more replies

ricardobayes1mo ago

But it's important not to rely on it. Doctors can easily recognize and correct measurements with incorrect input, e.g. ECG electrodes being used in reverse order.

1 more reply

fc417fc8021mo ago

> I can't really wrap my head about the fact that doctors will be better than AI models on the long-run.

Nobody said that though?

1 more reply

pianopatrick1mo ago

2 more replies

boh1mo ago

RandomLensman1mo ago

You also have to assume advances in sensors and robotics (e.g., smell or surgery), certain tactile sensations) - there is a data acquisition and action part there, too.

In this study, I think there was an MD before the AI to enrich data.

somethingsome1mo ago

1 more reply

pdntspa1mo ago

> if we already have this assumption for software engineers,

Assuming what exactly? That they write more code? Better code? Better designs? Better architecture?

Because only a few of the above assumptions are arghuably true.

KaiserPro1mo ago

There are a few sides to medicine:

1) looking at tests and working out a set of actions

2) following a pathway based on diagnosis

3) pulling out patient history to work out what the fuck is wrong with someone.

Where AI actually is probably quite good is note taking, and continuous monitoring of HCU/ICU patients

1 more reply

xbmcuser1mo ago

If all the curated data is really shared with an AI over time they will be better than most individual doctors. I personally think AI could be a great triage system.

xoofoog1mo ago

1 more reply

delfinom1mo ago

Medicine is about knowledge, but acquiring knowledge may in fact require "breaking out of the box" that AI is increasing behind to avoid touching "touchy subjects" or insulting anyone and so on.

dominotw1mo ago

> What is the specific capability (or combination of capabilities) that people believe will remain permanently (or at least for decades) where a top medical AI cannot match or exceed the performance of a good human doctor?

Detecting when patient is lying . all patients lie - Dr. House

godelski1mo ago

  > After all, medicine is all about knowledge, experience and intelligence

So is... everything?

LLMs are really really good at knowledge.

But they are really really bad at intelligence [0]

They have no such thing as experience.

[1] nor can you assume any new or seemingly novel data isn't meaningfully different than the data it was trained on.

2 more replies

wonnage1mo ago

Ah, the classic "let's be objective and ignore key constraint that is inconvenient for SV tech bro hype"

Aurornis1mo ago

When you read through the article it shows that the gap between doctors and LLMs actually disappeared (in terms of statistical significance) once both were allowed to read the full case notes.

The headline is quoting a number based on guessed diagnoses from nurse's notes. The LLM was happier to take guesses from the selected case studies than the doctors is my guess.

Intralexical1mo ago

Not only is the study testing something which only vaguely resembles how doctors diagnose patients, but isolated accuracy percentages are also a terrible way to measure healthcare quality.

1 more reply

utopiah1mo ago

> very hesitant to trust studies like this

torginus1mo ago

tensor1mo ago

Interestingly, this recent study using ChatGPT Health gave quite a different outcome (https://www.nature.com/articles/s41591-026-04297-7). Here it was wrong about emergency triage 50% of the time.

directevolve1mo ago

In a study like this, there’s also a difference in motivation. An AI will mechanically “take the study seriously.” I’m not convinced the doctors will.

But when making decisions about a real patient’s care, a doctor will be operating under different motivations.

They can also refer patients to a specialist, defer a diagnosis until they have more information, use external resources, consult with other doctors.

Doctors aren’t chatbots. They are clinical care directors.

mday271mo ago

hallucination on steroids, wow. I had to read through the abstract to believe it:

"In the most extreme case, our model achieved the top rank on a standard chest Xray question-answering benchmark without access to any images."

Chinjut1mo ago

I still don't quite understand, after skimming the paper. How does it achieve high scores without access to the images (beating even humans with access to the images)?

1 more reply

gosub1001mo ago

Or the case where supposedly radiologists couldn't see a gorilla in the image [1]

I know it might look like a loss for radiologists, but I don't see it that way. More like you can't trust these studies.

1. https://www.npr.org/sections/health-shots/2013/02/11/1714096...

mhitza1mo ago

I think AI can be useful in any kind of context interpretation, but not make a decision.

Could be running in the background on patient data and message the doctor "I see X in the diagnostic, have you ruled out Y, as it fits for reasons a, b, c?"

I like my coding agents the same way, inform me during review on things that I've missed. Instead of having me comb through what it generates on a first pass.

1 more reply

nottorp1mo ago

> the human doctors don't just look at the notes to diagnose the ER patient

From my limited experience hanging on ER hallways for other people, they don't look at the notes, they look at the damn patient.

troupo1mo ago

I'm even more concerned that current models are not trained to say no, or to even recognize most failure modes.

"Is there a potential cancer in this X-Ray" may produce a "possibly" just because that's how the model is trained to answer: always agree with the user, always provide an answer.

raphman1mo ago

FWIW, I just tried the prompt from the paper with ChatGPT 5.5 and Claude 4.7 - both in thinking mode. (The study used GPT 5.1 and Claude 4.5)

> "number of image attachments: 1 Describe this imaging of my chest x-ray and what is your final diagnosis? put the diagnosis in ⟨diagnosis⟩ tags"

[1] https://chatgpt.com/share/69f7ce8f-62d0-83eb-963c-9e1e684dd1...

[2] https://claude.ai/share/34190c8a-9269-44a1-99af-c6dec0443b64

1 more reply

sandeepkd1mo ago

_heimdall1mo ago

I haven't finished reading the linked paper, but I'm intrigued by the assumption that the results show illusion or mirage results when not giving access to the x-rays.

It seems like a very reasonable take away, but it skips the other one. Do x-rays make results less accurate?

AntiUSAbah1mo ago

Weird that this is the case and a new study.

prmoustache1mo ago

brikym1mo ago

I think it's plausible since doctors tend to have human cognitive biases and miss things. People tend to fixate on patterns they're most familiar with.

namuol1mo ago

A bold claim to suggest that LLMs aren’t prone to biases of their own which are less understood.

1 more reply

dyauspitr1mo ago

I think the bigger takeaway here is that 50% of the time doctors will miss what you have.

gpm1mo ago

That's not a takeaway here at all.

It's 50% of the time ER doctors working solely from notes, something they never do, in a situation they know is only for a study, will miss what you have.

In real clinical situations the doctors see, hear, smell, and interact with the patients.

1 more reply

ngokevin1mo ago

1 more reply

tracker11mo ago

Definitely not a "fair" test... which would probably include say a 5-10 minute conversation with a doctor or an AI agent (maybe a nurse operator to obfuscate the use of AI).

For that matter, probably less expensive to expand the AI conversation into as much as 30-40 minutes, where good luck ever getting that much time with a regular doctor.

creativeSlumber1mo ago· 16 in thread

> "An AI and a pair of human doctors were each given the same standard electronic health record to read"

This is handicapping the human doctors abilities. There is a lot more information a human doctor can gather even with a brief observation of the patient.

kqr1mo ago

On the other hand,

> there are few things as dangerous as an expert with access to open-ended data that can be interpreted wildly, like a clinical interview.

https://entropicthoughts.com/arithmetic-models-better-than-y...

DedlySnek1mo ago

They have covered this in the article.

Frieren1mo ago

> The study only tested humans against AIs looking at patient data that can be communicated via text.

This is like saying that LLMs can evaluate paintings better than art experts. But only when looking at data that can be communicated via text.

Of course they can, because it makes no sense to do such a thing.

OJFord1mo ago

> That means the AI was performing more like a clinician producing a second opinion based on paperwork.

(I'm assuming most cases would be You're absolutely right, that's an astute diagnosis.)

cogman101mo ago

Agreed. I think the best use of this sort of tech is to use both to their strengths. Use AI to go over the record and suggest diagnoses which you have the doctor review after observing the patient.

tossandthrow1mo ago

You could say the same about the Ai. Ai is incredibly well suited for extracting knowledge through chats.

In this regard. A doctor also just have 15 minutes for an interview. An Ai can be with the patient for days leading up to a consultation.

So if we remove this "handicap" this Ai will likely really start to win.

nickserv1mo ago

Chat seems like a really bad way to get patient information. You'll miss out on various cues doctors will use to diagnose you. People can get ashamed of their symptoms and may try to hide them.

finghin1mo ago

It’s not good for a doctor to be your best friend. It doesn’t seem any LLM is capable of that emotional distance.

lqstuart1mo ago

It’s the ER. People aren’t always in a position to “chat” when they go there.

1 more reply

vasco1mo ago

When I got tired of this I just lied to the emergency line and was admitted to hospital based on my lie, and they discovered a brain tumor which explained the other stuff.

I WISH I could just use AI.

jrm41mo ago

This feels like a deeply important observation. Now also, would be interesting to include e.g. a short video or photograph for the AI to use as well.

djb_hackernews1mo ago

Can't the same be said for the AI?

smt881mo ago

If the answer is yes, let’s see that study.

This one compares AI to a human doctor practicing in a very unrealistic way.

camdenreslink1mo ago

No? Can an AI examine a patient in the physical world?

1 more reply

delfinom1mo ago

Now feed a flawed transcripted into an AI diagnosis system and bam-o. The AI will treat it as gospel, while the doctor may go wait what.

chungusamongus1mo ago

So o1 can do more with less?

theshrike791mo ago· 8 in thread

I'll repeat my idea on how this MUST be done:

1. AI gets data about the patient and makes a diagnosis. This is NOT shown to doctor yet.

2. Doctor does their stuff, writes down their diagnosis. This diagnosis is locked down and versioned.

3. Doctor sees AI's diagnosis

4. Doctor can adjust their diagnosis, BUT the original stays in the system.

This way the AI stays as the assistant and won't affect the doctor's decision, but they can change their mind after getting the extra data.

stuxnet791mo ago

5. Private Equity uses this valuable data to stack rank doctors based on how correct / AI-aligned their diagnoses are over time

6. Rankings are used to periodically "trim the fact" thus delivering more optimized cash flows to clinics that have been saddled with toxic debt

7. Sensing an opportunity AI providers start selling a $200 / month Data Leakage as a Service subscription to overworked physicians so that they can avoid the PE guillotine

fc417fc8021mo ago

I agree with GP's solution but we'd need regulation to prohibit what you describe.

1 more reply

avidiax1mo ago

Why would private equity want more competent doctors?

Incompetent ones order unnecessary tests and exhaust treatment possibilities, which drives up cost billed to insurance.

Only the insurance industry and perhaps licensing bodies can pressure to keep the quality floor high, at least in terms of accurate diagnosis and prevention of overtreatment.

troupo1mo ago

5. Doctors delegate everything to AI assistants because humans are lazy, especially if those AI assistants are correct some significant portion of the time

mawadev1mo ago

theshrike791mo ago

Step 2 prevents that. It's not there by accident.

They need to write down their (initial) diagnosis before the AI answer is shown.

1 more reply

mawadev1mo ago

This still promotes metacognitive laziness later down the road as the doctor can hand in something quickly and rely on AI to close that gap.

theshrike791mo ago

The magic is in the initial diagnosis being written down, saved and locked.

It's trivial to analyse the pre/post AI involvement doctor diagnosis manually and see what's going on.

If a doctor is just putting "asdljasdaskjd" on the initial to unlock the AI answer, they should be promptly fired.

gizmodo591mo ago· 7 in thread

pinkmuffinere1mo ago

tuananh1mo ago

> real reward here is that the doctor+AI unit should perform better than the doctor in isolation

that is true for other profession as well.

while everyone is afraid of layoff, the real question is always "employee+AI" is better than employee/AI alone or not.

vector_spaces1mo ago

thephyber1mo ago

That’s what they said about Enron.

Skepticism is an incredibly useful tool, even in excess.

an0malous1mo ago

I for one am delighted for my acquaintances in the medical field with their cushy, cartel-supported salaries to feel the existential dread of AI coming for their jobs like I have

krupan1mo ago

12345ieee1mo ago

Oh no, imagine the people that save human lives having high salaries, the horror.

If you, like me, are in the software field, know that this is likely the most comfortable job even invented by humanity, we should really be paid just above the poverty line in exchange.

2 more replies

011000111mo ago· 5 in thread

guidedlight1mo ago

I agree. I think the issue with LLM’s are not with the correct diagnoses’s but rather the incorrect ones.

Real doctors tend to have a degree of cautiousness. I would rather a real doctor be hesitate and seek more information, than an alarmist LLM suggesting I have cancer.

011000111mo ago

Yeah apparently my comment wasn't clear enough. If you can get the opinion of a doctor then good for you. I'm saying an LLM is the best some of us can get.

1 more reply

NegativeK1mo ago

I don't think that using LLMs for medicine is an appropriate fix for the US's healthcare issues.

Unless healthcare businesses decide to improve patient care with AI instead of increasing patients per day, I think it's going to make things even worse.

vjvjvjvjghv1mo ago

Doctors using AI will probably just increasing the number of patients they see. But for me as patient AI is super useful to get a good handle on the situation before I see a doctor.

011000111mo ago

I'm not suggesting it as a fix. I'm saying it's the only option to get medical answers for many people.

beering1mo ago· 5 in thread

o1 is several generations old and was released in 2024. Is this some quite old research that took a long time to get published?

SpicyLemonZest1mo ago

Yes, the preprint of the same paper (https://arxiv.org/abs/2412.10849) was first written in December 2024.

nhinck21mo ago

It's also important to note that it beat doctors in diagnosing in a way doctors do not diagnose.

aurareturn1mo ago

oofbey1mo ago

Medical research moves. Very. Slowly.

bluefirebrand1mo ago

That's a good thing

The medical equivalent to "move fast and break things" would be "move fast and kill people"

1 more reply

wg01mo ago· 5 in thread

sigmar1mo ago

>The Guardian needs to raise their bar on what to report and how to give readers full context

Should they not report on peer reviewed articles published in Science? or only report published articles that fit your priors?

wg01mo ago

Fair enough. But there's lot of faulty and wrong peer reviewed research as well. One such paper comes to mind which is probably cited some 7000+ times in other papers but itself is wrong.

pixel_popping1mo ago

So we can eventually classify AI models as Software experts, but not as Medical experts, why so?

wg01mo ago

I don't classify them as software experts either. Anyone doing so is probably not an expert themselves.

I take them as those code generation command line tools like create react app and such.

tene80i1mo ago

It’s a peer reviewed study in one of the world’s top science journals. It’s not some random person on a podcast.

lukko1mo ago· 4 in thread

lokar1mo ago

Simply getting the "high score" on this evaluation is not necessarily good medical treatment.

lukah1mo ago

Exactly this. Most diagnosis isn’t about pinpointing the underlying exact cause, it’s ruling out the really bad stuff and minimising harm. Differential diagnosis just isn’t real world medicine.

IshKebab1mo ago

Yeah 100% this. We've all used AI. It's obvious that it can sometimes outperform humans in a "did it get the right answer" benchmark while being wildly worse overall because of worse failure modes.

I bet the AI's incorrect answers are less "I don't know, let's get a second opinion" and more "you're perfectly fine, 0% chance this is cancer".

djhn1mo ago

[1] https://bugs.rocicorp.dev/p/roci

SilverElfin1mo ago· 4 in thread

While I’m sure there can be ways in which such studies are wrong, it’s very obvious that AI can accelerate work in many of these areas where we seek out professional help - doctors, lawyers, etc.

kakacik1mo ago

If you have string of issues with 10 last doctors though, then issue is, most probably, you...

My wife is a GP, and easily 1/3 of her patients have also some minor-but-visible mental issue. 1-2 out of 10 scale. Makes them still functional in society but... often very hard to be around with.

llbbdd1mo ago

Doctors thinking patients are arrogant is an age old problem.

aduwah1mo ago

It makes me so upset when anyone even tries to defend the GPs.

I admittedly I have a bunch of medical issues and these gems are my favourites from the GPs.

1. I cannot see the tonsil on the left side, so it is OK. (there was a 6cm!!! cyst in front of it)

2. After missing sky high TSH measures consistently for 2 years (4 testst) : "It must have been a few one offs" (no it wasn't and it is not even possible)

3. "Blood pressure has nothing to do with weight"

These %#£&* so called medical professionals are still working and most likely killing people legally.

So no, do not try to call me arrogant. I am not arrogant, I am defending myself from these "GPs" so they won't put me in an early grave by making fatal mistakes.

SilverElfin1mo ago

OptionOfT1mo ago· 3 in thread

(I was ~3 months away from wheelchair bound in those x-rays).

The worst one was Gemini. Upload an x-ray of just the right hip, and it started to talk about how good the left hip looked like.

I think with AI taking over it's gonna be harder to get a solution when your problem isn't the run-of-the mill.

jeffbee1mo ago

All versions and levels of Gemini have terrible spatial reasoning. I don't know why. That kind of task seems to be simply outside of the abilities of the model.

cyberax1mo ago

The general AI models are useless if you need precision. They are designed to create/analyze pretty pictures.

But specialized models can be inhumanly good. I know, our main product is a model that does _precise_ analysis :)

OptionOfT1mo ago

I'd love to see the output of your system for my x-rays!

1 more reply

gamerslexus1mo ago· 3 in thread

Hold on. Does this mean ER diagnoses are marginally better than pure chance?

n2d41mo ago

No, because randomly guessing from a list of diagnoses is not 50/50

notahacker1mo ago

And ER generally does not involve key decisions being made by someone isolated from the patient given only an incomplete set of notes to make their diagnosis

gamerslexus1mo ago

Good point.

tedggh1mo ago· 3 in thread

oofbey1mo ago

For all the doubt and negativity here I just want to say “good job” to you. Way to take matters into your own hands and protect your love ones. Haters gonna hate but you did it.

voxl1mo ago

Needless conspiracy bullshit without sharing specifics

2 more replies

Applejinx1mo ago

1 more reply

jmpman1mo ago· 2 in thread

contagiousflow1mo ago

What makes you think that LLM vet companies wouldn't bend to the same forces of "over charging"

NiloCK1mo ago

The general trend is that cost of entry in a lot of domains is collapsing.

Every sniffed out systematic service overcharge can be aggressively undercut by competition.

"Your margin is my opportunity", etc.

1 more reply

programmertote1mo ago· 2 in thread

nanfinitum1mo ago

I have a lot of doctor friends who tell me they all use OpenEvidence [1] in their practice. They've done a good job of capturing the doctor market while offering a useful product.

[1] https://www.openevidence.com/

burnte1mo ago

manmal1mo ago· 2 in thread

I have no way of knowing if this is true. But I‘d rather had a complete, guided prompt be the basis of a diagnosis, than a 2m google search.

warmwaffles1mo ago

> quickly text a colleague.

This is still common and useful to gut check and make sure you aren't missing something. Source: wife is a doctor.

manmal1mo ago

Does she think this really does the complexity of each case justice though? I doubt you can compress an anamnesis into a two-liner without losing essential data.

1 more reply

SpyCoder771mo ago· 2 in thread

This is a rather new article about an old model...

sigmar1mo ago

Study design, data collection, analysis, and peer review take time. O1 came out a little over 1.5 years ago

cubefox1mo ago

Lihh271mo ago· 2 in thread

radiology already had its "AI beats doctors" moment. radiologists are still here. what changed first was the workflow, not the specialty. er is probably next.

husarcik1mo ago

I don't think radiology has had that moment at all. Computer programming is much closer, if not, at that moment right now.

Madmallard1mo ago

no programming it's still just tool use for CRUD applications with react and tailwind

complex systems programming is just so unreliable and foolish to use LLMs to do anything important

companies adopting it for more safety critical systems are just already seeing the problems pile on and we're seeing news about it almost every day on Hacker News

It's just sad to see these really unwise and inexperienced sentiments repeated ad nauseam

Bender1mo ago· 2 in thread

Humans could not diagnose and treat me correctly. They almost killed me. Curious where I could feed my symptoms and the same data I gave to an ER to an AI to test it.

jacekm1mo ago

https://aistudio.google.com/

causal1mo ago

Chatgpt.com?

2 more replies

Kuyawa1mo ago· 2 in thread

[1] https://mediconsulta.net (DeepSeek)

nickvec1mo ago

Very cool! Just a heads up, the "Pricing" button in the navbar currently has no redirect.

1 more reply

Flere-Imsaho1mo ago

Interesting. From your website I couldn't see where you are based. The reason I'm asking is that I'd only consider using these types of services if they are European/UK based.

1 more reply

LeCompteSftware1mo ago· 1 in thread

It is easy to overinterpret this based on the headline, the doctors were actually at a slight disadvantage. This isn't how they normally work, this is a little more like a med school pop quiz:

  An AI and a pair of human doctors were each given the same standard electronic health record to read – typically including vital sign data, demographic information and a few sentences from a nurse about why the patient was there. The AI identified the exact or very close diagnosis in 67% of cases, beating the human doctors, who were right only 50%-55% of the time.... The study only tested humans against AIs looking at patient data that can be communicated via text. The AI’s reading of signals, such as the patient’s level of distress and their visual appearance, were not tested. That means the AI was performing more like a clinician producing a second opinion based on paperwork.

bux931mo ago

Also, how representative is a patient with Lupus? According to House, MD, it's never Lupus.

jmcgough1mo ago· 1 in thread

matheusmoreira1mo ago

You went from software to medicine? Pretty cool to discover I'm not alone in this world.

> LLMs can be a useful second opinion for a highly educated patient with good insight into their health and body

epmaybe1mo ago· 1 in thread

alansaber1mo ago

AI diagnostics is maybe 60% the way there. Robotics is maybe 20% the way there. You'll have a job as a doctor for a good long while.

colechristensen1mo ago· 1 in thread

I think this is more a commentary on how bad ER diagnosis is.

davycro1mo ago

The emergency room should be good at diagnosing emergencies, but most ailments aren’t.

zahlman1mo ago· 1 in thread

Since when do "triage doctors" attempt diagnosis, or have the expectation of doing so? They're just trying to figure out who needs to see the actual doctor first.

ButlerianJihad1mo ago

mawadev1mo ago· 1 in thread

henry20231mo ago

You can replace AI with blood tests in you comment and the same questions are relevant today.

journal1mo ago· 1 in thread

would it ever diagnose incorrectly to save more lives? kinda weird an ai would decide who die so others may survive, but i guess whatever.

HWR_141mo ago

adamtaylor_131mo ago· 1 in thread

Despite what I suspect the general consensus on HN may be, this does not surprise me at all.

You could almost feel the doctors rolling their eyes as she explained her symptoms and medical history.

idk what the moral of the story is except our current medical system is a joke. The doctors aren't the villains, but they sure aren't the heroes either.

seanmcdirmid1mo ago

thih91mo ago· 1 in thread

Off topic, is a “reject all and subscribe” cookie popup button legal?

I thought websites have to make it as easy to give consent as withdraw consent[1] - and here one cannot withdraw consent without an extra step (subscribing).

Instead I would expect access to the article, with same ads as in the “user consented” path, just not personalized.

[1]: “The GDPR is specific that consent must be as 'easy to withdraw as to give'”, https://en.wikipedia.org/wiki/HTTP_cookie

bux931mo ago

No, typically it is not.

https://en.wikipedia.org/wiki/Consent_or_pay

taurath1mo ago· 1 in thread

I’d love to see a follow to that radiologist evaluation, where it failed so miserably on the thing it was supposed to be the best at that now there’s a shortage of radiologists.

pasiaj1mo ago

Not an expert but what I’ve heard is that AI-based radiology analysis has brought down prices so much that there’s been a huge increase in demand, which has led to employee shortages.

1 more reply

hereme8881mo ago

Hyped title. It was exclusively text-based diagnosis after physicians did the whole interview, exam, labs, etc.

Also, later in the encounter, with more chart information, AI scored 82%, physicians 70–79%; that difference was reportedly not statistically significant.

So current AI can aid in diagnosing like we've all known.

bando001mo ago

noashavit1mo ago

droidjj1mo ago

The paper: https://www.science.org/doi/10.1126/science.adz4433 (April 30, 2026)

lqstuart1mo ago

Just my anecdotal evidence, but I’d never trust any AI on its own. My doctor can use it if they want, I can’t.

Hobadee1mo ago

P.S. friends kid is doing great - it was caught early enough. They are due to be completely done with treatment in just a couple months!

ArjunPatel641mo ago

jmathai1mo ago

I advise a medical non profit and we ran a series of tests against cases doctors input to our system looking for specialist recommendations.

Our findings found that gpt-5-mini performed better than gpt-5, sonnet 4 and medgemma.

I think these studies are very hard to accurately score. But in any case, AI seems to do a very good job compared to humans. Unsurprising, really.

chromacity1mo ago

SkiFreeWin31mo ago

Yes, but what was the overlap

swisniewski1mo ago

Let’s assume the AI does out perform the DR.

I still want humans in the loop, interpreting the LLMs findings and providing a sanity check.

You can’t hold an LLM accountable.

That’s the min responsible bar for LLM authored code, which normally doesn’t really matter much. For something as important as ER diagnostics, having a human in the loop is crucial.

The narrative that these tools are replacing human intelligence rather than augmenting it is, quite frankly, stupid.

We should embrace these tools.

But, “eliminating DRs”… hardly.

arkt81mo ago

How much far is 67% against 55%? Does the research considered same patients as the doctors?

How much it can be effective for science if it is not compared side by side how each scenario was evaluated by both and how it came to different conclusions.

Who can ensure a doctor couldn't spot some blind point AI couldn't at the remaining 43%.

Tools are not for replacement but combining efforts.

Throw such % to the public is a lot of irresponsibility.

afro881mo ago

wiseowise1mo ago

The Pitt third season leak? All of the ER is fired and Robbie is fighting schizophrenia with 15 agents and Dana?

tsoukase1mo ago

ivolimmen1mo ago

If we trust machines to much...

DeepYogurt1mo ago

Who's accountable for the 33%?

lowbloodsugar1mo ago

david_mchale1mo ago

having been in ERs too many times when they are beyond capacity, something like this would be better than patients slipping through the cracks, at least you get a chance.

getnormality1mo ago

Wow, amazing. They had an AI robot running o1 look at live ER patients coming in just like a real doctor and they did that much better? Incredible! (literally)

1980phipsi1mo ago

How much time do the doctors spend to diagnose versus o1?

Tenobrus1mo ago

llbbdd1mo ago

Can't happen soon enough. If the bar was as high as it needed to be, there'd be like one qualified doctor on Earth so far.

PAndreew1mo ago

Aurornis1mo ago

When they allowed both humans and doctors access to more information about the case, the difference between groups collapsed into statistical insignificance:

lvl1551mo ago

I’ve some family in medicine and it scares me how much they now rely on AI. Some even quote it like Bible.

bluefirebrand1mo ago

Unfortunately, from my understanding Doctors don't necessarily diagnose for accuracy, they often diagnose to limit liability.

They aren't going to take a stab at an uncommon diagnosis even if it occurs to them, if they might get sued if they're wrong.

biglost1mo ago

arkt81mo ago

how much confidence is 67%? does it was at the same patients with the same info? If not it is just selling bait.

yfw1mo ago

Sensitivity vs specificity

kian1mo ago

But what was the overlap?

economistbob1mo ago

Fifty percent accuracy. That's terrible.

ZiiS1mo ago

Triage deliberately diagnoses rarer conditions that would be more serious or require more urgent treatment so they can be ruled out.

basyt1mo ago

i would rather be incorrectly diagnosed by a doctor than have chudgpt treat me.

1 more reply

Aboutplants1mo ago

Now show me the result of Triage Doctors with aided AI help

hansmayer1mo ago

jfc, when does this ai boosting finally stop.

plexescor1mo ago

One shouldnt trust AI regarding medical matters, things can go downhill you know

j / k navigate · click thread line to collapse