LLMD: A Large Language Model for Interpreting Longitudinal Medical Records (opens in new tab)

(arxiv.org)

48 pointstroyastorino1y ago19 comments

19 comments

16 comments · 4 top-level

infamouscow1y ago· 4 in thread

As someone that built an EMR that sold to Epic, I think I can say with some authority these studies don't suggest this is ready for the real world.

While tech workers are unregulated, clinicians are highly regulated. Ultimately the clinician takes on the responsibility and risk relying on these computer systems to treat a patient, tech workers and their employers aren't. Clinicians do not take risks with patients because they have to contend with malpractice lawsuits and licensing boards.

In my experience, anything that is slightly inaccurate permanently reduces a clinician's trust in the system. This matters when it comes time to renew your contracts in one, three, or five years.

You can train the clinicians on your software and modify your UI to make it clear that a heuristic should be only taken as a suggestion, but that will also result in a support request every time. Those support requests have be resolved pretty quickly because they're part of the SLA.

I just can't imagine any hospital renewing a contract when their support requests is some form of "LLMs hallucinate sometimes." I used to hire engineers from failed companies that built non-deterministic healthcare software.

troyastorinoOP1y ago

(Co-founder of PicnicHealth here; we trained LLMD)

Accuracy and deploying in appropriate use cases is key for real world use. Building guardrails, validation, continuous auditing, etc is a larger amount of work than model training.

We don't deploy in EHRs or sell to physicians or health systems. That is a very challenging environment, and I agree that it would be very difficult to appropriately deploy LLMs that way today. I know Epic is working on it, and they say it's live in some places, but I don't know if that's true.

Our main production use case for LLMD at PicnicHealth is to improve and replace human clinical abstraction internally. We've done extensive testing (only alluded to in the paper) comparing and calibrating LLMD performance vs trained human annotator performance, and for many structuring tasks LLMD outperforms human annotators. For our production abstraction tasks where LLMD does not outperform humans (or where regulations require human review), we use LLMD to improve the workflow of our human annotators. It is much easier to make sure that clinical abstractors, who are our employees doing well-defined tasks, understand the limitations in LLM performance than it would be to ensure that users in a hospital setting would.

ozborn1y ago

Haven't read the whole paper yet, but what are the possibilities for academic and evaluation use of this model?

troyastorinoOP1y ago

The answer is a little nuanced.

We train on real records, and even though they are de-identified in training we still have to keep the model closed and under careful management to protect against the possibility of information leaking.

We are, though, definitely invested in this corner of research, and want to be able to work with others to push medical AI forward.

Given that, the best model for us is to collaborate on an engagement-by-engagement basis. For now we'd look to find ways to do the work directly involving LLMD within our systems.

If you research in the field and have some ideas, I'd love to chat!

briandear1y ago

> Clinicians do not take risks with patients

Some nuance here — they absolutely take risks, but with informed consent.

motohagiography1y ago· 3 in thread

> We find strong evidence that accuracy on today's medical benchmarks is not the most significant factor when analyzing real-world patient data, an insight with implications for future medical LLMs.

I interpreted this as challenging whether answering PubMedQA questions as well as a physician is correlated to recommending successful care paths based on the results (and other outcomes) shown in the sample corpus of medical records.

The analogy is a joke I used to make about ML where it made for crappy self-driving cars but surprisingly good pedestrian and cyclist hunter-killer robots.

Really, LLMs aren't expert system reasoners (yet) and if the medical records all contain the same meta-errors that ultimately kill patients, there's a GIGO problem where the failure mode of AI medical opinions makes the same errors faster and at greater scale. LLMs may be really good at finding how internally consistent an ontology made of language is, where the quality of its results is the effect of that internal logical consistency.

There's probably a pareto distribution of cases where AI is amazing for basic stuff like, "see a doctor" and then conspicuously terrible in some cases where a human is obviously better.

kkielhofner1y ago

An often-ignored/forgotten/unknown fact about utilizing LLMs is that you really need to develop your own benchmark for your specific application/use-case. It’s step 1.

“This model scores higher on MMLU” or some other off-the-shelf benchmark may (likely?) have essentially nothing to do with performance on a given specific use-case, especially when it’s highly specialized.

They can give you a general idea of the capabilities of a model but if you don’t have a benchmark for what you’re trying to do in the end you’re flying blind.

st-at-picnic1y ago

I think that's very true -- and it felt like one of the real opportunities we had in the paper: that we have real production tasks whose results we need to stand behind, and so we can try to explain and show examples of what matters in that context.

One of the sentences near the end that speaks to this is "...[this shows] a case where the type of medical knowledge reflected in common benchmarks is little help getting basic, fundamental questions about a patient right." Point being that you can train on every textbook under the sun, but if you can't say which hospital a record came from, or which date a visit happened as the patient thinks of it, you're toast -- and those seemingly throwaway questions are way harder to get right than people realize. NER can find the dates in a record no problem, but intuitively mapping out how dates are printed in EHR software and how they reflect the workflow of an institution is the critical step needed to pick the right one as the visit date -- that's a whole new world of knowledge that the LLM needs to know, which is not characterized when just comparing results on medical QA.

Giving examples of the crazy things we have to contend is something I can (and will!) gladly talk about for hours...

st-at-picnic1y ago

One other interesting comment in there -- the note about how people think the worst records to deal with are the old handwritten notes. But actually, content-wise they tend to be very to-the-point. Clean printouts from EHR software have so much extra junk and redundancy that you end up with much lower SNR. Even just structuring a single EHR record can require you to look across many pages and do tons of filtering that doesn't come into play on the old handwritten notes (once you get past OCR).

Long way of saying: I feel for today's clinicians. EHRs were supposed to solve all problems, but they've also made things harder in a lot of ways.

1 more reply

JoshMandel1y ago· 3 in thread

>LLMD-8B achieves state of the art responses on PubMedQA over all models

Hang on -- while this is a cool result, beating a limited number of models that you chose to include in your comparison does not qualify LLMD-8B as SOTA. (For example, Claude 3 Sonnet scores 10 percentage points higher.)

>This result confirms the power of continued pretraining and suggests that records themselves have content useful for improving benchmark performance.

In support of this conclusion, it would be informative to include an ablation study, e.g. evaluating a continued pre-training data set of the same total size but omitting medical record content from the data mix.

st-at-picnic1y ago

Thanks for reading! We'll definitely include our Sonnet results in the next revision. It's worth pointing out that we're comparing accuracy on text responses and not log probability based scoring, which I think is the number you're referring to (based on Section E of this paper https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bb...). But if I'm mistaken and you have a direct pointer, that'd be super helpful! In general, we've been basing our comparisons against the models in the Open Medical LLM leaderboard here: https://huggingface.co/spaces/openlifescienceai/open_medical...

Also definitely a good idea on the ablation study. We had some results internally based on a production-tuned version of our model that includes a much higher weighting of records-data. It's an imperfect ablation, but it supports the story -- so I think it's there, but you're right that it would be more complete to develop and include the data directly.

JoshMandel1y ago

I appreciate the response!

I can't understand your methods without example prompts or code, so it's hard for me to interpret the data in figure 6. It will be important to document the methodology carefully to avoid concerns that your "text response" methodology is unfairly punishing other models.

In any case, since the methodology that Anthropic applied is documented and straightforward, it would be possible to do an apples to apples comparison with your model.

(I'm also very curious to know how 3.5 Sonnet performs.)

Is your text methodology based on CoT (like the "PubMedQA training dataset enriched with CoT" you trained on) or a forced single token completion like Anthropic used in their evaluation? In the latter case, I'm not sure how "text responses" differ from log probabilities at Temperature T=0 (i.e., isn't the most likely token always going to be the text response?)

st-at-picnic1y ago

A few thoughts. (Apologies, having trouble editing my response, will post a new message)

1 more reply

nmitchko1y ago· 2 in thread

Interesting they don't compare to open-bio. Page 7 charts are quite weak.

https://huggingface.co/aaditya/Llama3-OpenBioLLM-70B

st-at-picnic1y ago

Steve here, one of the co-authors. Totally valid on OpenBio. I will say that comparison numbers for this paper were such a challenge, in part because we found that a lot of the LLMs on the Medical LLM leaderboard struggled to follow even slight changes in instructions. On one hand it felt inaccurate to just print '[something very low]% Accuracy' on structuring/abstraction tasks and call it a day, but it also seemed like the amount of engineering effort needed to get non-trivial results from those LLMs was saying something important about how they worked.

I think that's especially true when you look at how well GPT-4o worked out of the box -- it makes clear what you get from the battle-hardening that's done to the big commercial models. For the numbers we did include, the thought was that was the most meaningful signal was that going from 8B to 70B with Llama3 actually gives you a lot in terms of mitigating that brittleness. That goes a step towards explaining the story of what we're seeing, moreso than showing a bunch of comparison LLMs fall over out of the box.

In the end, we presented those models that did best with light tuning and optimization (say a week's worth of iteration or so). I anticipate that we'll have to expand these results to include OpenBio as we work through the conference reviewer gauntlet. Any others you think we definitely should work to include? Would definitely be helpful!

nmitchko1y ago

No other models that are public worth comparing to... Hippocratic advertises good benchmarks but that might be marketing fluff.

Have you checked out dataset building with nemotron? The nemotron synthetic data builder is quite powerful.

Moreso, check out model merging. It's possible if you merge some of your model against llama3.1 base it may perform much better.

Check out max labonne's work on hugging face

j / k navigate · click thread line to collapse

19 comments

16 comments · 4 top-level

infamouscow1y ago· 4 in thread

As someone that built an EMR that sold to Epic, I think I can say with some authority these studies don't suggest this is ready for the real world.

In my experience, anything that is slightly inaccurate permanently reduces a clinician's trust in the system. This matters when it comes time to renew your contracts in one, three, or five years.

troyastorinoOP1y ago

(Co-founder of PicnicHealth here; we trained LLMD)

Accuracy and deploying in appropriate use cases is key for real world use. Building guardrails, validation, continuous auditing, etc is a larger amount of work than model training.

ozborn1y ago

Haven't read the whole paper yet, but what are the possibilities for academic and evaluation use of this model?

troyastorinoOP1y ago

The answer is a little nuanced.

We are, though, definitely invested in this corner of research, and want to be able to work with others to push medical AI forward.

Given that, the best model for us is to collaborate on an engagement-by-engagement basis. For now we'd look to find ways to do the work directly involving LLMD within our systems.

If you research in the field and have some ideas, I'd love to chat!

briandear1y ago

> Clinicians do not take risks with patients

Some nuance here — they absolutely take risks, but with informed consent.

motohagiography1y ago· 3 in thread

The analogy is a joke I used to make about ML where it made for crappy self-driving cars but surprisingly good pedestrian and cyclist hunter-killer robots.

There's probably a pareto distribution of cases where AI is amazing for basic stuff like, "see a doctor" and then conspicuously terrible in some cases where a human is obviously better.

kkielhofner1y ago

An often-ignored/forgotten/unknown fact about utilizing LLMs is that you really need to develop your own benchmark for your specific application/use-case. It’s step 1.

They can give you a general idea of the capabilities of a model but if you don’t have a benchmark for what you’re trying to do in the end you’re flying blind.

st-at-picnic1y ago

Giving examples of the crazy things we have to contend is something I can (and will!) gladly talk about for hours...

st-at-picnic1y ago

Long way of saying: I feel for today's clinicians. EHRs were supposed to solve all problems, but they've also made things harder in a lot of ways.

1 more reply

JoshMandel1y ago· 3 in thread

>LLMD-8B achieves state of the art responses on PubMedQA over all models

>This result confirms the power of continued pretraining and suggests that records themselves have content useful for improving benchmark performance.

st-at-picnic1y ago

JoshMandel1y ago

I appreciate the response!

In any case, since the methodology that Anthropic applied is documented and straightforward, it would be possible to do an apples to apples comparison with your model.

(I'm also very curious to know how 3.5 Sonnet performs.)

st-at-picnic1y ago

A few thoughts. (Apologies, having trouble editing my response, will post a new message)

1 more reply

nmitchko1y ago· 2 in thread

Interesting they don't compare to open-bio. Page 7 charts are quite weak.

https://huggingface.co/aaditya/Llama3-OpenBioLLM-70B

st-at-picnic1y ago

nmitchko1y ago

No other models that are public worth comparing to... Hippocratic advertises good benchmarks but that might be marketing fluff.

Have you checked out dataset building with nemotron? The nemotron synthetic data builder is quite powerful.

Moreso, check out model merging. It's possible if you merge some of your model against llama3.1 base it may perform much better.

Check out max labonne's work on hugging face

j / k navigate · click thread line to collapse