While tech workers are unregulated, clinicians are highly regulated. Ultimately the clinician takes on the responsibility and risk relying on these computer systems to treat a patient, tech workers and their employers aren't. Clinicians do not take risks with patients because they have to contend with malpractice lawsuits and licensing boards.
In my experience, anything that is slightly inaccurate permanently reduces a clinician's trust in the system. This matters when it comes time to renew your contracts in one, three, or five years.
You can train the clinicians on your software and modify your UI to make it clear that a heuristic should be only taken as a suggestion, but that will also result in a support request every time. Those support requests have be resolved pretty quickly because they're part of the SLA.
I just can't imagine any hospital renewing a contract when their support requests is some form of "LLMs hallucinate sometimes." I used to hire engineers from failed companies that built non-deterministic healthcare software.
Accuracy and deploying in appropriate use cases is key for real world use. Building guardrails, validation, continuous auditing, etc is a larger amount of work than model training.
We don't deploy in EHRs or sell to physicians or health systems. That is a very challenging environment, and I agree that it would be very difficult to appropriately deploy LLMs that way today. I know Epic is working on it, and they say it's live in some places, but I don't know if that's true.
Our main production use case for LLMD at PicnicHealth is to improve and replace human clinical abstraction internally. We've done extensive testing (only alluded to in the paper) comparing and calibrating LLMD performance vs trained human annotator performance, and for many structuring tasks LLMD outperforms human annotators. For our production abstraction tasks where LLMD does not outperform humans (or where regulations require human review), we use LLMD to improve the workflow of our human annotators. It is much easier to make sure that clinical abstractors, who are our employees doing well-defined tasks, understand the limitations in LLM performance than it would be to ensure that users in a hospital setting would.
Some nuance here — they absolutely take risks, but with informed consent.
I interpreted this as challenging whether answering PubMedQA questions as well as a physician is correlated to recommending successful care paths based on the results (and other outcomes) shown in the sample corpus of medical records.
The analogy is a joke I used to make about ML where it made for crappy self-driving cars but surprisingly good pedestrian and cyclist hunter-killer robots.
Really, LLMs aren't expert system reasoners (yet) and if the medical records all contain the same meta-errors that ultimately kill patients, there's a GIGO problem where the failure mode of AI medical opinions makes the same errors faster and at greater scale. LLMs may be really good at finding how internally consistent an ontology made of language is, where the quality of its results is the effect of that internal logical consistency.
There's probably a pareto distribution of cases where AI is amazing for basic stuff like, "see a doctor" and then conspicuously terrible in some cases where a human is obviously better.
“This model scores higher on MMLU” or some other off-the-shelf benchmark may (likely?) have essentially nothing to do with performance on a given specific use-case, especially when it’s highly specialized.
They can give you a general idea of the capabilities of a model but if you don’t have a benchmark for what you’re trying to do in the end you’re flying blind.
One of the sentences near the end that speaks to this is "...[this shows] a case where the type of medical knowledge reflected in common benchmarks is little help getting basic, fundamental questions about a patient right." Point being that you can train on every textbook under the sun, but if you can't say which hospital a record came from, or which date a visit happened as the patient thinks of it, you're toast -- and those seemingly throwaway questions are way harder to get right than people realize. NER can find the dates in a record no problem, but intuitively mapping out how dates are printed in EHR software and how they reflect the workflow of an institution is the critical step needed to pick the right one as the visit date -- that's a whole new world of knowledge that the LLM needs to know, which is not characterized when just comparing results on medical QA.
Giving examples of the crazy things we have to contend is something I can (and will!) gladly talk about for hours...
Hang on -- while this is a cool result, beating a limited number of models that you chose to include in your comparison does not qualify LLMD-8B as SOTA. (For example, Claude 3 Sonnet scores 10 percentage points higher.)
>This result confirms the power of continued pretraining and suggests that records themselves have content useful for improving benchmark performance.
In support of this conclusion, it would be informative to include an ablation study, e.g. evaluating a continued pre-training data set of the same total size but omitting medical record content from the data mix.
Also definitely a good idea on the ablation study. We had some results internally based on a production-tuned version of our model that includes a much higher weighting of records-data. It's an imperfect ablation, but it supports the story -- so I think it's there, but you're right that it would be more complete to develop and include the data directly.
I can't understand your methods without example prompts or code, so it's hard for me to interpret the data in figure 6. It will be important to document the methodology carefully to avoid concerns that your "text response" methodology is unfairly punishing other models.
In any case, since the methodology that Anthropic applied is documented and straightforward, it would be possible to do an apples to apples comparison with your model.
(I'm also very curious to know how 3.5 Sonnet performs.)
Is your text methodology based on CoT (like the "PubMedQA training dataset enriched with CoT" you trained on) or a forced single token completion like Anthropic used in their evaluation? In the latter case, I'm not sure how "text responses" differ from log probabilities at Temperature T=0 (i.e., isn't the most likely token always going to be the text response?)
I think that's especially true when you look at how well GPT-4o worked out of the box -- it makes clear what you get from the battle-hardening that's done to the big commercial models. For the numbers we did include, the thought was that was the most meaningful signal was that going from 8B to 70B with Llama3 actually gives you a lot in terms of mitigating that brittleness. That goes a step towards explaining the story of what we're seeing, moreso than showing a bunch of comparison LLMs fall over out of the box.
In the end, we presented those models that did best with light tuning and optimization (say a week's worth of iteration or so). I anticipate that we'll have to expand these results to include OpenBio as we work through the conference reviewer gauntlet. Any others you think we definitely should work to include? Would definitely be helpful!
Have you checked out dataset building with nemotron? The nemotron synthetic data builder is quite powerful.
Moreso, check out model merging. It's possible if you merge some of your model against llama3.1 base it may perform much better.
Check out max labonne's work on hugging face