Automated reasoning to remove LLM hallucinations (opens in new tab)

(aws.amazon.com)

57 pointsrustastra1y ago38 comments

38 comments

25 comments · 10 top-level

lsy1y ago· 6 in thread

I find it hard to believe that anything like this will be feasible or effective beyond a certain level of complexity. It seems like a willful denial of the complexity and ambiguity of natural language, and I am not looking forward to some poor developer trying to reason their way out of a two-hundred-step paradox that was accidentally created.

And for a use-case simple enough for this system to work (e.g. regurgitate a policy), it seems like the LLM is unnecessary. After all, if your system can perfectly interpret the question and answer and see if this rule set applies, then you can likely just use the rule set to generate the answer rather than wasting resources with a giant language model.

vineyardmike1y ago

I don’t think this is a concern, but I do understand what you see, I think this really just is a new way for a computer to be the “bad guy” in customer support systems.

First, they have a pretty low token limit for a “policy” so there won’t be anything too complex.

Second, they explicitly say they don’t support synonyms. Seems very likely it’ll just reject anything that doesn’t fit closely, so you’ll end up with “I’m sorry. I don’t know what the ‘bought it’ date is, please provide purchase date?” Until the customer does the work of using the exact language.

It looks like it takes a policy “returns must be processed within 30 days of purchase” and turns it into a pseudo-code type logic “if {purchase date} < {today-30d} => reject”. Then it seems to parse the LLM query and apply the logic. Considering my first two points, it’ll just be used to turn GPUs into another inhuman system to help companies avoid having to be human about customer support, while sounding more human.

jimmySixDOF1y ago

> It seems like a willful denial of the complexity and ambiguity of natural language

There is a paper and set of work recently that uses a measurement of entropy on the set of returned logits to detect a "certainty" estimate for outputs and flag hallucinations. It is a lot more rigorous than the OP but like everything in this space needs further testing.

fzzzy1y ago

I've been thinking a lot about whether this would work lately. Do you have a link?

1 more reply

sdesol1y ago

I'm working on a rather naive approach that is focused on identifying errors in a LLM response by using LLMs. What I can share right now are screenshots with regards to how it works. The basic idea is you can use other high-quality models to validate and compare against to find irregularities or errors. You can see what it looks like below:

https://app.gitsense.com/--/images/options.png

https://app.gitsense.com/--/images/validate.png

https://app.gitsense.com/--/images/models.png

The basic idea behind my chat system is, every model can be wrong, but it is unlikely that all will be wrong at the same time. This chat system is based on what I've learned when building my spelling and grammar checker. If you look at the following links, you can see that even the best models can get it wrong, but it is unlikely that others will get it wrong at the same time.

https://app.gitsense.com/?doc=6c9bada92&model=GPT-4o&samples...

https://app.gitsense.com/?doc=905f4a9af74c25f&model=Claude+3...

nomel1y ago

> but it is unlikely that all will be wrong at the same time.

Here's a prompt that proves this untrue, for now at least:

> A woman and her biological son are gravely injured in a car accident and are both taken to the hospital for surgery. The surgeon is about to operate on the boy when they say "I can’t operate on this boy, he’s my biological son!" How can this be?

Makes sense considering they're things of most-likely statistics, after all.

4 more replies

WhitneyLand1y ago

When will we be able to give it a try?

I’m playing around with similar ideas, sometimes called ensembling techniques.

1 more reply

Metricon1y ago· 3 in thread

This amuses me tremendously. I began programming in the early 1980s and quickly developed an interest in Artificial Intelligence. At the time there was a great interest in the advancement of AI by the introduction of "Expert Systems" (which would later play a part in the ‘Second AI Winter’).

What Amazon appears to have done here is use a transformers based neural network (aka LLM) to translate natural language into symbolic logic rules which are collectively used together in what could be identified as an Expert System.

Full Circle. Hilarious.

For reference to those on the younger side: The Computer Chronicles (1984) https://www.youtube.com/watch?v=_S3m0V_ZF_Q

nl1y ago

I don't see why this is hilarious at all.

The problem with expert systems (and most KG-type applications) has always been that translating unconstrained natural language into the system requires human-level intelligence.

It's been completely obvious that LLMs are a technology that let us bridge that gap for years, and many of the best applications of LLMs are doing exactly that (eg code generation)

Metricon1y ago

To be clear, my amusement isn't that I find this technique to not be useful for the purpose it was created, but that 40 years later, we find ourselves in pursuit for the advancement of AI to be somewhat back where we already were; albeit, in a more semi-automated fashion as someone still has to create the underlying rule-set.

I do feel that the introduction of generative neural network models in both natural language and multi-media creation has been a tremendous boon for the advancement of AI, it just amuses me to see that which was old is new again.

2 more replies

Animats1y ago

Right. The trouble with that approach is that it's great on the easy cases and degrades rapidly with scale.

This sounds like is a fix for a very specific problem. An airline chatbot told a customer that some ticket was exchangeable. The airline claimed it wasn't. The case went to court. The court ruled that the chatbot was acting as an agent of the airline, and so ordinary rules of principal-agent law applied. The airline was stuck with the consequence of their chatbot's decision.[1]

Now, if you could reduce the Internal Revenue Code to rules in this way, you'd have something.

[1] https://www.bbc.com/travel/article/20240222-air-canada-chatb...

1 more reply

gibsonf11y ago· 1 in thread

If the automated reasoning worked, why would you need an LLM and its fabrications?

Sabinus1y ago

To translate between the natural language of the user query to the generated formal rules and back again.

bloomingkales1y ago· 1 in thread

Just looking at this AWS workflow takes the joy out of programming for me.

herbst1y ago

Just looking at ANY AWS workflow ...

pkoird1y ago· 1 in thread

I'll say this again, any sufficiently advanced LLM is indistinguishable from Prolog.

darkteflon1y ago

I feel like this (and other comments like it in this thread) is getting at an important truth that is not yet widely appreciated - could you unpack your comment?

tomlockwood1y ago· 1 in thread

If this is necessary, LLMs have officially jumped the shark. And I do wonder how much of this "necessary logic" has already been added to ChatGPT and other platforms, where they've offloaded the creation of logic-based heuristics to Mechanical Turk participants, and like the old meme, AI unmasked is a bit of LLM and a tonne of IF, THEN statements.

I get the vibe VC money is being burned with promises of an AGI that may never eventuate and there's no clear path to.

BobbyTables21y ago

Is VC money ever spent on companies seeking clear paths?

I pessimistically suspect VCs like the dark mysterious paths since they often have a bigger fool at the end (acquisition).

spartanatreyu1y ago· 1 in thread

Post title: Automated reasoning to remove LLM hallucinations

---

and yet, the paper that went around in March:

Paper Link: https://arxiv.org/pdf/2401.11817

Paper Title; Hallucination is Inevitable: An Innate Limitation of Large Language Models

---

Instead of trying to trick a bunch of people into thinking we can somehow ignore the flaws of post-LLM "AI" by also using the still flawed pre-LLM "AI", why don't we cut the salesman BS and just tell people not to use "AI" for the range of tasks it's not suited for.

porridgeraisin1y ago

> why don't we cut the salesman BS

Salesmanship is exactly the process of making money out of BS. So bit of a tautology there :-)

drew-y1y ago· 1 in thread

How does automation reasoning actually check a response against the set of rules without using ML? Wouldn't it still need a language model to compare the response to the rule?

rustastraOP1y ago

aiui a natural language question e.g. "What is the refund policy?" gets matched against formalized contracts, and the relevant bit of the contract gets translated into natural language deterministically. At least this is the way I'd do it, but not sure how it actually works

majestik1y ago

I hadn't heard of Amazon Bedrock Guardrails before, but after reading about it, it seems similar to Nvidia NeMo Guardrails which I have heard of: https://docs.nvidia.com/nemo/guardrails/introduction.html

The approaches seem very different though. I'm curious if anyone here has used either or both and can share feedback.

nl1y ago

This is an interesting approach.

By constraining the field it is trying to solve it makes grounding the natural language question in a knowledge graph tractable.

An analogy is type inference in a computer language: it can't solve every problem but it's very useful much of the time (actually this is a lot more than an analogy because you can view a knowledge graph as an actual type system in some circumstances).

j / k navigate · click thread line to collapse

38 comments

25 comments · 10 top-level

lsy1y ago· 6 in thread

vineyardmike1y ago

I don’t think this is a concern, but I do understand what you see, I think this really just is a new way for a computer to be the “bad guy” in customer support systems.

First, they have a pretty low token limit for a “policy” so there won’t be anything too complex.

jimmySixDOF1y ago

> It seems like a willful denial of the complexity and ambiguity of natural language

fzzzy1y ago

I've been thinking a lot about whether this would work lately. Do you have a link?

1 more reply

sdesol1y ago

https://app.gitsense.com/--/images/options.png

https://app.gitsense.com/--/images/validate.png

https://app.gitsense.com/--/images/models.png

https://app.gitsense.com/?doc=6c9bada92&model=GPT-4o&samples...

https://app.gitsense.com/?doc=905f4a9af74c25f&model=Claude+3...

nomel1y ago

> but it is unlikely that all will be wrong at the same time.

Here's a prompt that proves this untrue, for now at least:

Makes sense considering they're things of most-likely statistics, after all.

4 more replies

WhitneyLand1y ago

When will we be able to give it a try?

I’m playing around with similar ideas, sometimes called ensembling techniques.

1 more reply

Metricon1y ago· 3 in thread

Full Circle. Hilarious.

For reference to those on the younger side: The Computer Chronicles (1984) https://www.youtube.com/watch?v=_S3m0V_ZF_Q

nl1y ago

I don't see why this is hilarious at all.

The problem with expert systems (and most KG-type applications) has always been that translating unconstrained natural language into the system requires human-level intelligence.

It's been completely obvious that LLMs are a technology that let us bridge that gap for years, and many of the best applications of LLMs are doing exactly that (eg code generation)

Metricon1y ago

2 more replies

Animats1y ago

Right. The trouble with that approach is that it's great on the easy cases and degrades rapidly with scale.

Now, if you could reduce the Internal Revenue Code to rules in this way, you'd have something.

[1] https://www.bbc.com/travel/article/20240222-air-canada-chatb...

1 more reply

gibsonf11y ago· 1 in thread

If the automated reasoning worked, why would you need an LLM and its fabrications?

Sabinus1y ago

To translate between the natural language of the user query to the generated formal rules and back again.

bloomingkales1y ago· 1 in thread

Just looking at this AWS workflow takes the joy out of programming for me.

herbst1y ago

Just looking at ANY AWS workflow ...

pkoird1y ago· 1 in thread

I'll say this again, any sufficiently advanced LLM is indistinguishable from Prolog.

darkteflon1y ago

I feel like this (and other comments like it in this thread) is getting at an important truth that is not yet widely appreciated - could you unpack your comment?

tomlockwood1y ago· 1 in thread

I get the vibe VC money is being burned with promises of an AGI that may never eventuate and there's no clear path to.

BobbyTables21y ago

Is VC money ever spent on companies seeking clear paths?

I pessimistically suspect VCs like the dark mysterious paths since they often have a bigger fool at the end (acquisition).

spartanatreyu1y ago· 1 in thread

Post title: Automated reasoning to remove LLM hallucinations

---

and yet, the paper that went around in March:

Paper Link: https://arxiv.org/pdf/2401.11817

Paper Title; Hallucination is Inevitable: An Innate Limitation of Large Language Models

---

porridgeraisin1y ago

> why don't we cut the salesman BS

Salesmanship is exactly the process of making money out of BS. So bit of a tautology there :-)

drew-y1y ago· 1 in thread

How does automation reasoning actually check a response against the set of rules without using ML? Wouldn't it still need a language model to compare the response to the rule?

rustastraOP1y ago

majestik1y ago

The approaches seem very different though. I'm curious if anyone here has used either or both and can share feedback.

nl1y ago

This is an interesting approach.

By constraining the field it is trying to solve it makes grounding the natural language question in a knowledge graph tractable.

j / k navigate · click thread line to collapse