Deterministic Quoting: Making LLMs safer for healthcare (opens in new tab)

(mattyyeung.github.io)

117 pointsmattyyeung2y ago38 comments

38 comments

37 comments · 14 top-level

itishappy2y ago· 6 in thread

What happens if it hallucinates the <title>?

You catch it. The hallucinated title will fail to match the retrieved text based on the reference ID.

If it hallucinates an incorrect (but valid) reference ID then hopefully your users can spot that the quoted text has no relevance to their question.

mattyyeungOP2y ago

Two possibilities:

(1) if the <title> contents (unique reference string) doesn't match, then it's trivially detected. Typically the query is re-run (non-determinism comes in handy sometimes) or if problems persist we show an error message to the doctor

(2) if a valid <title> is hallucinated, then the wrong quote is indeed displayed on the blue background. It's still a verbatim quote, but it is up to the user to handle this.

In testing when we have maliciously shown the wrong quote, users seem to be easily able to identify. It seems "Irrelevant" is easier than "wrong" to detect.

bradfox22y ago

Galactica training paper from FAIR investigated citation hallucination quite thoroughly, if you havent seen it, probably worth a look. Trained in hashes of citations were much more reliable than a natural language representation.

resource_waste2y ago

Same thing when a human hallucinates.

Except with LLMs, you can run like 10 different models. With a human, you owe $120 and are taking medicine.

pton_xd2y ago

Except with a human there's a counter-party with assets or insurance who assumes liability for mistakes.

Although presumably if a company is making decisions using an LLM, and the LLM makes a mistake, the company would still be held liable ... probably.

If there's no "damage" from the mistake then it doesn't matter either way.

KaiserPro2y ago

> With a human, you owe $120 and are taking medicine.

Well there are protocols, procedures and a bunch of checks and balances.

The problem with the LLM is that there isn't any, its you vs one shot retrieval.

1 more reply

Animats2y ago· 5 in thread

It's a search engine, basically?

simonw2y ago

Building better search tools is one of the most directly interesting applications of LLMs in my opinion.

mattyyeungOP2y ago

I'd put it like this: RAG = search engine, but sometimes hallucinates

RAG + deterministic quoting = search engine that displays real excerpts from pages.

tylersmith2y ago

Yes, and Dropbox is an rsync server.

robrenaud2y ago

A good, automatically run, privacy preserving search engine that uses electronic medical records might be a valuable resource for busy doctors.

nraynaud2y ago

I think the hope is that the LLM would find the needle in the haystack with more accuracy. But in jobs that matters, you check the results.

jonathan-adly2y ago· 2 in thread

I built and sold a company that does this a year ago. It was hard 2 years ago, but now pretty standard RAG with a good implementation will get you there.

The trick is, healthcare users would complain to no end about determinism. But, these are “below-the-line” user - aka, folks who don’t write checks and the AI is better than them. (I am a pharmacist by training, and plain vanilla GPT4-turbo is better than me).

Don’t really worry about them. The folks who are interested and willing to pay for AI has more practical concerns - like what is my ROI and the implementation like.

Also - folks should be building Baymax from big hero 6 by now (the medical capabilities, not the rocket arm stuff). That’s the next leg up.

skybrian2y ago

Seems like that’s how things go with enterprise software - who cares if the users like it if you have a captive audience?

But I want this feature and I’ll look for software that has it.

jonathan-adly2y ago

it is not about liking it. They won't like it even with determinism. The idea is to NOT learn new things, and keep doing things the old inefficient way. More headcount and job security this way.

resource_waste2y ago· 2 in thread

I feel like this is the perfect application of running the data multiple times.

Imagine having ~10-100 different LLMs, maybe some are medical, maybe some are general, some are from a different language. Have them all run it, rank the answers.

Now I believe this can further be amplified by having another prompt ask to confirm the previous answer. This could get a bit insane computationally with 100 original answers, but I believe the original paper I read was that by doing this prompt processing ~4 times, they got to some 95% accuracy.

So 100 LLMs give an answer, each time we process it 4 times, can we beat a 64 year old doctor?

mattyyeungOP2y ago

Unfortunately I don't believe that accuracy will scale "multiplicitively". You'll typically only marginally improve beyond 95%... and how much is enough?

Even with such a system, which will still have some hallucination rate, adding Deterministic Quoting on top will still help.

It feels to me we are a long way off LLM systems with trivial rates of hallucination

resource_waste2y ago

a 95% diagnosis rate would be insane.

I believe I read doctors are only at like 30%...

budududuroiu2y ago· 2 in thread

My issue with RAG systems isn’t hallucinations. Yes sure those are important. My issue is recall. Given petabyte-scale index of chunks, how can I make sure that my RAG system surfaces the “ground truth” I need, and not just “the most similar vector”.

This I think is scarier. A healthcare-oriented (or any industry) RAG retrieving a bad, but highly linguistically similar answer.

thenaturalist2y ago

You're correctly identifying an issue that by now I think everyone is facing globally: Realizing the bottleneck to performance or improvements of LLMs isn't necessarily quantity, but inevitably quality.

Which is a much harder problem to solve outside few highly standardized niches/ industries.

I think synthetic data generation as a mean to guide LLMs over a larger than optimal search space is going to be quite interesting.

budududuroiu2y ago

To me synthetic data generation makes no sense. Mathematically your LLM is learning a distribution (let’s say of human knowledge). Let’s assume your LLM models human knowledge perfectly. In that case, what can you achieve? Just sampling the same data that your model mapped perfectly.

However, if your models distribution is wrong, you’re basically going to have an even more skewed distribution in models trained using the synthetic data.

To me, it seems like the architecture is the next place for improvements. If you can’t synthesise the entirety of human knowledge using transformers, there’s an issue there.

The smell that points me in that direction is the fact that up until recently, you could quantise models heavily with little drop in performance, but recent Llama3 research shows that’s not the case anymore

w10-12y ago· 1 in thread

I'm not sure determinism alone is sufficient for proper attribution.

This presumes "chunks" are the source. But it's not easy to identify the propositions that form the source of some knowledge. In the best case, you are looking for an association and find it in a sentence you've semantically parsed, but that's rarely the case, particularly for medical histories.

That said, deterministic accuracy might not matter if you can provide enough context, particularly for further exploration. But that's not really "chunks".

So it's unclear to me that tracing probability clouds back to chunks of text will work better than semantic search.

mattyyeungOP2y ago

Thanks for the thought-provoking comment.

It's all grey isn't it? Vanilla RAG is a big step along the spectrum from LLM towards search, DQ is perhaps another small step. I'm no expert in search but I've read that those systems coming from the other direction, perhaps they'll meet in the middle.

There are three "lookups" in a system with DQ: (1) The original top-k chunk extraction (in the minimalist implementation, that's unchanged from vanilla RAG, just a vector embeddings match) (2) the LLM call, which takes its pick from 1, and (3) the call-back deterministic lookup after the LLM has written its answer.

(3) is much more bounded, because it's only working with those top-k, at least for today's context constrained systems.

In any case, another way to think of DQ is a "band-aid" that can sit on top of that, essentially a "UX feature", until the underlying systems improve enough.

I also agree about the importance of chunk-size. It has "non-linear" effects on UX.

not2b2y ago· 1 in thread

I was thinking that something like this could be useful for discovery in legal cases, where a company might give up a gigabyte or more of allegedly relevant material in response to recovery demands and the opposing side has to plow through it to find the good stuff. But then I thought of a countermeasure: there could be messages in the discovery material that act as instructions to the LLM, telling it what it should not find. We can guarantee that any reports generated will contain accurate quotes, even where they are so that surrounding context can be found. But perhaps, if the attacker controls the input data, things can be missed. And it could be done in a deniable way: email conversations talking about LLMs that also have keywords related to the lawsuit.

budududuroiu2y ago

Those do-not-search here chunks wouldn’t be retrieved during vector search and reranking because it would likely have a very low cross-encoder score with a question like “Who are the business partners of X?”.

yonigo102y ago· 1 in thread

a more robust approach https://yonigottesman.github.io/2023/08/10/extractive-genera...

mattyyeungOP2y ago

Yes, extractive QA is one of the improvements beyond the "minimalist implementation" from the article. In our lingo, we'd say that's another way to create a deterministic quotation.

So far, we haven't found extractive QA (or any other technique) to significantly improve overall answer quality when compared to matching sub-string similarity. (I'd be interested to hear if you have different experience!)

There aren't a lot of applications can purely be solved with substrings of source documentation, so having both LLM prose and quotations in the answer provides benefit (eg ability to quote multiple passages). Now, we can modify the constrained generation side of things to allow for these but that gets complicated. Or, it can be done with recursive calls to the LLM, but that again requires some kind of DQ check on top.

Ultimately, both styles seem to perform similarly - and suffer from the same downsides (choosing the wrong quote and occasionally omitting useful quotes).

(Good writeup by the way, I've forwarded it to my team, thanks!)

telotortium2y ago· 1 in thread

We’ve developed LLM W^X now - time to develop LLM ROP!

gojomo2y ago

Interesting analogies for LLMs! (https://en.wikipedia.org/wiki/W%5EX & https://en.wikipedia.org/wiki/Return-oriented_programming)

nextworddev2y ago· 1 in thread

Did I miss something or did the article never describe how the technique works? (Despite the “How It Works” section

Smaug1232y ago

It's explained at considerable length in the section _A “Minimalist Implementation” of DQ: a modified RAG Pipeline_.

mattyyeungOP2y ago· 1 in thread

Author here, thanks for your interest! Surprising way to wake up in the morning. Happy to answer questions

sitkack2y ago

Why the coyness? You submitted the post.

simonw2y ago

I like this a lot. I've been telling people for a while that asking for direct quotations in LLM output - which you can then "fact-check" by confirming them against the source document - is a useful trick. But that still depends on people actually doing that check, which most people won't do.

I'd thought about experimenting with automatically validating that the quoted text does indeed 100% match the original source, but should even a tweak to punctuation count as a failure there?

The proposed deterministic quoting mechanism feels like a much simpler and more reliable way to achieve the same effect.

burntcaramel2y ago

Is there existing terms of art for this concept? It’s not like slightly unreliable writers is a new concept, such as a student writing a paper.

For example:

- Authoritative reference: https://www.montana.edu/rmaher/ee417/Authoritative%20Referen...

- Authoritative source: https://piedmont.libanswers.com/faq/135714

bradfox22y ago

Very cool. My company is building a very similar tool for nuclear engineering and power applications that face similar adoption challenges for LLMs. We're also incorporating the idea of 'many-to-many' document claim validation and verification. The ux allowing high speed human verification of LLM resolved claims is what were finding most important.

Deepmind published something similar recently for claim validation and hallucination management and got excellent results.

j / k navigate · click thread line to collapse

38 comments

37 comments · 14 top-level

itishappy2y ago· 6 in thread

What happens if it hallucinates the <title>?

simonw2y ago

You catch it. The hallucinated title will fail to match the retrieved text based on the reference ID.

If it hallucinates an incorrect (but valid) reference ID then hopefully your users can spot that the quoted text has no relevance to their question.

mattyyeungOP2y ago

Two possibilities:

(2) if a valid <title> is hallucinated, then the wrong quote is indeed displayed on the blue background. It's still a verbatim quote, but it is up to the user to handle this.

In testing when we have maliciously shown the wrong quote, users seem to be easily able to identify. It seems "Irrelevant" is easier than "wrong" to detect.

bradfox22y ago

resource_waste2y ago

Same thing when a human hallucinates.

Except with LLMs, you can run like 10 different models. With a human, you owe $120 and are taking medicine.

pton_xd2y ago

Except with a human there's a counter-party with assets or insurance who assumes liability for mistakes.

Although presumably if a company is making decisions using an LLM, and the LLM makes a mistake, the company would still be held liable ... probably.

If there's no "damage" from the mistake then it doesn't matter either way.

KaiserPro2y ago

> With a human, you owe $120 and are taking medicine.

Well there are protocols, procedures and a bunch of checks and balances.

The problem with the LLM is that there isn't any, its you vs one shot retrieval.

1 more reply

Animats2y ago· 5 in thread

It's a search engine, basically?

simonw2y ago

Building better search tools is one of the most directly interesting applications of LLMs in my opinion.

mattyyeungOP2y ago

I'd put it like this: RAG = search engine, but sometimes hallucinates

RAG + deterministic quoting = search engine that displays real excerpts from pages.

tylersmith2y ago

Yes, and Dropbox is an rsync server.

robrenaud2y ago

A good, automatically run, privacy preserving search engine that uses electronic medical records might be a valuable resource for busy doctors.

nraynaud2y ago

I think the hope is that the LLM would find the needle in the haystack with more accuracy. But in jobs that matters, you check the results.

jonathan-adly2y ago· 2 in thread

I built and sold a company that does this a year ago. It was hard 2 years ago, but now pretty standard RAG with a good implementation will get you there.

Don’t really worry about them. The folks who are interested and willing to pay for AI has more practical concerns - like what is my ROI and the implementation like.

Also - folks should be building Baymax from big hero 6 by now (the medical capabilities, not the rocket arm stuff). That’s the next leg up.

skybrian2y ago

Seems like that’s how things go with enterprise software - who cares if the users like it if you have a captive audience?

But I want this feature and I’ll look for software that has it.

jonathan-adly2y ago

it is not about liking it. They won't like it even with determinism. The idea is to NOT learn new things, and keep doing things the old inefficient way. More headcount and job security this way.

resource_waste2y ago· 2 in thread

I feel like this is the perfect application of running the data multiple times.

Imagine having ~10-100 different LLMs, maybe some are medical, maybe some are general, some are from a different language. Have them all run it, rank the answers.

So 100 LLMs give an answer, each time we process it 4 times, can we beat a 64 year old doctor?

mattyyeungOP2y ago

Unfortunately I don't believe that accuracy will scale "multiplicitively". You'll typically only marginally improve beyond 95%... and how much is enough?

Even with such a system, which will still have some hallucination rate, adding Deterministic Quoting on top will still help.

It feels to me we are a long way off LLM systems with trivial rates of hallucination

resource_waste2y ago

a 95% diagnosis rate would be insane.

I believe I read doctors are only at like 30%...

budududuroiu2y ago· 2 in thread

This I think is scarier. A healthcare-oriented (or any industry) RAG retrieving a bad, but highly linguistically similar answer.

thenaturalist2y ago

Which is a much harder problem to solve outside few highly standardized niches/ industries.

I think synthetic data generation as a mean to guide LLMs over a larger than optimal search space is going to be quite interesting.

budududuroiu2y ago

However, if your models distribution is wrong, you’re basically going to have an even more skewed distribution in models trained using the synthetic data.

To me, it seems like the architecture is the next place for improvements. If you can’t synthesise the entirety of human knowledge using transformers, there’s an issue there.

w10-12y ago· 1 in thread

I'm not sure determinism alone is sufficient for proper attribution.

That said, deterministic accuracy might not matter if you can provide enough context, particularly for further exploration. But that's not really "chunks".

So it's unclear to me that tracing probability clouds back to chunks of text will work better than semantic search.

mattyyeungOP2y ago

Thanks for the thought-provoking comment.

(3) is much more bounded, because it's only working with those top-k, at least for today's context constrained systems.

In any case, another way to think of DQ is a "band-aid" that can sit on top of that, essentially a "UX feature", until the underlying systems improve enough.

I also agree about the importance of chunk-size. It has "non-linear" effects on UX.

not2b2y ago· 1 in thread

budududuroiu2y ago

yonigo102y ago· 1 in thread

a more robust approach https://yonigottesman.github.io/2023/08/10/extractive-genera...

mattyyeungOP2y ago

Yes, extractive QA is one of the improvements beyond the "minimalist implementation" from the article. In our lingo, we'd say that's another way to create a deterministic quotation.

Ultimately, both styles seem to perform similarly - and suffer from the same downsides (choosing the wrong quote and occasionally omitting useful quotes).

(Good writeup by the way, I've forwarded it to my team, thanks!)

telotortium2y ago· 1 in thread

We’ve developed LLM W^X now - time to develop LLM ROP!

gojomo2y ago

Interesting analogies for LLMs! (https://en.wikipedia.org/wiki/W%5EX & https://en.wikipedia.org/wiki/Return-oriented_programming)

nextworddev2y ago· 1 in thread

Did I miss something or did the article never describe how the technique works? (Despite the “How It Works” section

Smaug1232y ago

It's explained at considerable length in the section _A “Minimalist Implementation” of DQ: a modified RAG Pipeline_.

mattyyeungOP2y ago· 1 in thread

Author here, thanks for your interest! Surprising way to wake up in the morning. Happy to answer questions

sitkack2y ago

Why the coyness? You submitted the post.

simonw2y ago

I'd thought about experimenting with automatically validating that the quoted text does indeed 100% match the original source, but should even a tweak to punctuation count as a failure there?

The proposed deterministic quoting mechanism feels like a much simpler and more reliable way to achieve the same effect.

burntcaramel2y ago

Is there existing terms of art for this concept? It’s not like slightly unreliable writers is a new concept, such as a student writing a paper.

For example:

- Authoritative reference: https://www.montana.edu/rmaher/ee417/Authoritative%20Referen...

- Authoritative source: https://piedmont.libanswers.com/faq/135714

bradfox22y ago

Deepmind published something similar recently for claim validation and hallucination management and got excellent results.

j / k navigate · click thread line to collapse