Open Challenges in LLM Research (opens in new tab)

(huyenchip.com)

163 pointsmuggermuch2y ago72 comments

72 comments

52 comments · 15 top-level

Looking back in 25 years, the "Hallucination Problem" will sound a lot like the "Frame Problem" of the 1970s.

Looking back, it's a bit absurd to say that GOFAI would've got to AGI if only the Frame Problem could be solved. But the important point is why that sounds so absurd.

It doesn't sound absurd because we found out that the frame problem can't be solved; that's beside the point.

It also doesn't sound absurd because we found out that solving the frame problem isn't the key to GOFAI-based AGI. That's also beside the point.

It sounds absurd because the conjecture itself is... just funny. It's almost goofy, looking back, how people thought about AGI.

Hallucination is the Frame Problem of the 2023 AI Summer. Looking back from the other side of the next Winter, the whole thing will seem a bit goofy.

js82y ago

My feeling is that GOFAI had a real problem with representing uncertainty, and handling contradiction. So, we tried to approach it theoretically, with fuzzy logic and probability and so on. But the theoretical research on uncertainty didn't reach any clear conclusion.

Meanwhile, the neural nets (and ML) researchers just trucked on, with more compute power, and pretty much ignored any theoretical issues with uncertainty. And surprisingly, with lots of amazing results.

But now they hit the same wall, we don't actually understand how to do reasoning with uncertainty correctly. LLMs seem to solve this by "just mimic reasoning that humans do". Except because we lack a good theory of reasoning, it can't tell when mimicking is bad and when it's good, unless there is a lot of specific examples. So in the most egregious cases, we get hallucinations but have no clue how to avoid them.

thrwayaistartup2y ago

I think that ascribes way too much meaning to hallucinations, which are the artifact of a big fancy markov chain doing what you'd expect a big fancy markov chain to do.

1 more reply

laylower2y ago

That's an excellent summary I have to say. Theorists pushed hard to move the needle and practitioners with immense computing power reached and started chipping away at the same wall.

LLMs transpose the problem by mimicing what humans would do

resonious2y ago

I don't know much about AI research but the idea of "measuring" hallucination definitely seems very loaded to me. Humans hallucinate too and I don't think we can measure that. It almost feels like "we need AGI in order to develop AGI".

melenaboija2y ago

Comparing human hallucinations with model “hallucinations” does not make sense to me.

Model hallucinations seems to me like a fancy way to call the model results that make no sense (ie blatant errors). Plus it makes the model more humanoid.

2 more replies

p1esk2y ago

Does anyone think we would have AGI if only we could solve the hallucination problem?

goatlover2y ago

There’s people who thought we could just wire up ChatGPT to a bunch of API calls and have AGI by now. Or some similar version of bootstrapping an LLM.

1 more reply

JimtheCoder2y ago

We don't even have a generally accepted definition of AGI yet, so...no.

emmender12y ago

"... .Looking back from the other side of the next Winter, the whole thing will seem a bit goofy."

for most of us, what we wish for is what we believe.

optimalsolver2y ago

Is from ought.

errantspark2y ago· 7 in thread

Fun fact: I took the photo she used as a cover for one of her books, she asked me if she could use it and I said I'd like to be compensated and her response was something akin to "oh I was just asking assuming you'd say yes, I'm going to do it anyway". Nobody's perfect, maybe she regrets it, and it hasn't really crossed my mind in years, but I guess it still sort of irks me to be reminded of it. Anyway if anyone needs a portrait for a book cover feel free to hit me up XD.

abatilo2y ago

Just another random anecdotal experience with Chip.

I was interviewing with Claypot.ai and when I met her for my first conversation, she was on a walking treadmill and very clearly was more interested in a Slack conversation she was having.

She moved me on to the next round which I irrefutably bombed and was respectfully told that I wouldn't be moving on which was the right decision, but I'll never forget watching her walking motion while looking at Slack on her second monitor almost the entire time we were talking.

PunchTornado2y ago

Are walking interviews a thing? I'd be very annoyed.

josho2y ago

It’s possible they were taking notes on the second monitor. Many interview coaches recommend taking notes during the interview to capture the response, largely verbatim, so that candidates can be fairly compared afterwards.

1 more reply

swyx2y ago

everyone has bad days/things they regret. i'm not sure this is a relevant discussion to the content, and personal anecdotes can be very damaging to a person's reputation - i've met her in person and she is delightful, but neither of us are here to judge people so lets stick to the content?

1 more reply

lionkor2y ago

Just an itsy bitsy little bit of copyright infringement :)

wanderlust1232y ago

Sounds like a pretty entitled and unpleasant person. At the bare minimum you should have had a say in whether you picture could hve been used.

CatWChainsaw2y ago

Huh, that sounds like the arguments artists make for keeping their work out of training sets. :)

illusionist1232y ago· 7 in thread

I think it's not possible to get rid of hallucinations given the structure of LLMs. Getting rid of hallucinations requires knowing how to differentiate fact from fiction. An analogy from programming languages that people might understand is type systems. Well-typed programs are facts and ill-typed programs are fictions (relative to the given typing of the program). To eliminate hallucinations from LLMs would require something similar, i.e. a type system or grammar for what should be considered a fact. Another analogy is Prolog and logical resolution to determine consequences from a given database of facts. LLMs do not use logical resolution and they don't have a database of facts to determine whether whatever is generated is actually factual (or logically follows from some set if facts) or not, LLMs are essentially Markov chains and I am certain it is impossible to have Markov chains without hallucinations.

So whoever is working on this problem, good luck because you have you have a lot of work to do to get Markov chains to only output facts and not just correlations of the training data.

blackkettle2y ago

While I'm definitely not going to argue that LLMs are inherently 'thinking' like people do, one thing I do find pretty interesting is that all this talk about hallucinations and bias seems to often conveniently ignore the fact that people are often even more prone to these exact same problems - and as far as I know that's also unlikely to be solved.

ChatGPT is often 'confidently wrong' - I'm pretty sure I've been confidently wrong a few times too, and I've met a lot of other people in my life who've express that trait from time to time too, intentionally or otherwise.

I think there is an inherent trade off between 'confidence', 'expression', and of course 'a-priori bias in the input'. You can learn to be circumspect when you are unsure, and you can learn to better measure your level of expertise on a subject.

But you can't escape that uncertainty entirely. On the other hand, I'm not very convinced about efforts to train LLMs on things like mathematical reasoning. These are situations where you really do have the tools to always produce an exact answer. The goal in these types of problems should focus not on holistically learning how to both identify and solve them, but exclusively on how to identify and define them, and then subsequently pass them off to exact tools suitable for computing the solution.

StackOverlord2y ago

Indeed but people who are confidently wrong in good faith so to speak don't make up science paper titles.

famouswaffles2y ago

LLMs already know how to distinguish fact from fiction much better than random chance and the base non-RLHF GPT-4 model was excellently calibrated (its predicted confidence in an answer generally matches the probability of being correct). "Eliminating" it is not that important. Getting it to human levels is the goal. and boy do humans often "hallucinate", i.e have a poor grasp of what they do or do not know and confidently spout nonsense.

illusionist1232y ago

It doesn't matter what humans do or do not do. Human performance as a benchmark is not a useful metric for what machines should or should not do.

2 more replies

StackOverlord2y ago

A type system is for internal consistency though. Facts are about external consistency with real world data. And even then facts are always a social augmentation in that they are always captured in a given social context, and by that I include the lenses of theoretical frameworks and axiomatics. They always have a spin they can lose when considered from another standpoint and at the very minimum they are conditioned by attention and relevancy, and it has everything to do with our current representation of the world and nothing with the world itself.

visarga2y ago

Technically, transformers condition on the entire past not just the last step, but RNNs are Markov Chains. RNNs have information bottleneck issues though.

uoaei2y ago

Transformers (for NLP) also perform steps on Markov chains. The difference is that with transformers (for NLP), which Markov chain it's moving along changes every step.

yeck2y ago· 4 in thread

I have a hard time understanding why mechanistic interpretability has so few eyes on it. It's like trying to build a complex software system without logging or monitoring. Any other improvements you want to make on the system are going to just be trail and error with luck. The hallucination problem is one where interpretability of a model might be able to identify the failure modes that we need to address. Really any AI problem could likely be aided by a scalable approach to interpretability that is just as mundane feeling as classical software observability.

nonameiguess2y ago

I'm going to talk out of my ass here because I am not involved enough to know the mechanics of how LLMs are really trained at any deep level, but from the surface level understanding I have, I would expect any attempt to eliminate hallucination to be intractable given the techniques in use. As far as I understand, the initial training run is simply fed raw text and it works on the basis of predicting a next token. Then these are find-tuned using RLHF and potentially other techniques I don't know much about.

To truly eliminate hallucinations, I would think you'd have to change the initial training phase. Rather than only feeding raw text and predicting next tokens, you'd need to feed propositions labeled with some probability that they are actually true. Doing this with real fidelity is clearly not possible. No one has a database of all fact claims quantified by probability of truth. But you could potentially use the same heuristics used by human learners and impart some encoding of hierarchy of evidence. Give high weight to claims made by professional scientific organizations, high but somewhat lesser to conclusions of large-scale meta-analyses in relatively mechanistic fields, give very low weight to comments on Reddit.

That is all entirely possible but the manual human labor required seems antithetical to the business goals of anyone actually doing this kind of research. Without it, though, you're seemingly limited to either playing whack-a-mole with fine tuning out specific classes of error when they're caught or relying on a dubious assumption that plausibly human-generated utterances you're trying to mimic are sufficiently more likely to be true than false.

This problem arguably goes away if people treat LLMs for what they are, generators of strings that look like plausible human-generated utterances, rather than generators of fact claims likely to be true. But if we really want strong AI, we clearly need the latter. There is a reason epistemologists have long defined knowledge as justified true belief, not just incidentally lucking into being correct.

yeck2y ago

If you could know that this is the case with interpretability tools than we would be able to train new models with purposeful decisions to reduce or remove hallucinations. Narrow the range of the tests and experiments you need to do to solve the problem. Otherwise we are mostly speculating about why stuff doesn't work and play a game of darts in the dark.

jebarker2y ago

When I looked into this briefly my impression was that it's extremely hard to do mechanistic interpretation beyond very simple cases like CNN classification or toy problems like arithmetic in transformers. Not to say it's not a worthy pursuit, but I think the difficulty isn't justified for many researchers since the results won't make a big splash like a new model training result.

yeck2y ago

Yeah, it is harder than other things, but if we can train a model to explain collections of pixels in human language then we might be able to do similar with collections of activations.

I don't know if that is the direction, but just an example that comes to mind easily.

If someone figures out how to do this, I think their models will be far more capable and reliable.

ford2y ago· 3 in thread

So far it's been ~8 months since ChatGPT started the (popular) LLM craze. I've found raw GPT to be useful for a lot of things, but have yet to see my most frequently used apps integrate it in a useful way. Maybe I'm using the wrong apps...

It'll be interesting to see what improvements (in a lab or at a company) need to happen before most people use purpose-built LLMs (or behind the scenes LLM prompts) in the apps they use every day. The answer might be "no improvements" and we're just in the lag time before useful features can be built

Legend24402y ago

There are some unsolved practical problems like prompt injection, the difficulty of using them on your own data, etc.

But the biggest problem is that they take so much compute, which slows down both research and deployment. Only a handful of giant companies can train their own LLM, and it's a major undertaking even for them. Academic researchers and everyday tinkerers can only run inference on pretrained models.

p1esk2y ago

Sounds like a great motivation for academic researchers to find a way to train LLMs with less compute. Or maybe invent something better than transformers. A brain trains on 20 Watts after all.

1 more reply

netdur2y ago

I have helped making behind sense cases, one was to classify emails and redirect them to intended sides, second was quality monitoring of call center.

mattlutze2y ago· 3 in thread

"Never before in my life had I seen so many smart people working on the same goal"

I'm not sure why but the assumptions and naivety in this opening line bothers me. There are plenty of goals and problems that orders of magnitude more people are working on today.

dustypotato2y ago

But the author hadn't seen them in their life

PunchTornado2y ago

I think more people worked on the covid problem which was 3 years ago

strikelaserclaw2y ago

Maybe she hasn't watched the Oppenheimer movie yet :D

visarga2y ago· 1 in thread

Let me add a few:

- organic data exhaustion - we need to step up synthetic data and its validation

- imbalanced datasets - catalog, assess and fill in missing data

- backtracking - make LLMs better at combinatorial or search problems

- deduction - we need to augment the training set for revealing implicit knowledge, in other words to study the text before learning it

- defragmentation - information comes in small chunks, sits in separate siloes, and context size is short, we need to use retrieval to bring it together for analysis

tl;dr We need quantity, diversity and depth in our training sets

josephg2y ago

And I’ll add some more:

- LLMs aren’t very good at large scale narrative construction. They get too distracted by low level details that they miss the high level details in long text. It feels like the same problem as stable diffusion giving people too many fingers.

- LLMs have 2 kinds of memory: current activations (context) and trained weights. This is like working memory and long term memory. How do we add short term memory? Like, if I read a function, I summarize it in my head and then remember the summary for as long as it’s relevant. (Maybe 20 minutes or something). How do we build a mechanism that can do this?

- How do we do gradient descent on the model architecture itself during training?

- Humans have lots more tricks to use when reading large, complex text - like re-reading relevant sections, making notes, thinking quietly, and so on. Can we introduce these thinking modalities into our systems? I bet they’d behave smarter if they could do this stuff.

- How do we combine multiple LLMs into a smarter overall system? Eg, does it make sense to build committees of “experts” (LLMs taking on different expert roles) to help in decision making? Can we get more intelligence out of chatgpt by using it in a different way in a larger system?

inciampati2y ago· 1 in thread

One thing I'd like to see is more effort on developing citation systems for these models.

What I mean is that every part of the output of an LLM should be annotated with references to the content that is most important or relevant to it.

Who is leading this effort now?

astrange2y ago

It's not possible to do this without a completely different and less efficient architecture. You can approximate it, but it won't give you the correct answers, as there are no correct answers insofar as the model is learning to generalize rather than memorize things.

https://twitter.com/AnthropicAI/status/1688946685937090560

techwizrd2y ago· 1 in thread

I really like seeing articles or papers that describe the current advances and open challenges in a sub-field (such as [0]). They're underappreciated, but good practice or reading for folks wanting to get in the field. They're also worthwhile and humbling to look back at every few years: did we get the challenges right? How well did we understand the problem at the time?

0: https://arxiv.org/abs/1912.04977

blackkettle2y ago

You might like this one; seems to be the basis for the blog post:

- https://arxiv.org/abs/2307.10169

blackkettle2y ago

Seems like it is basically a blog post review of Challenges and Applications of Large Language Models which was published to arXiv last month:

- https://arxiv.org/abs/2307.10169

1 more reply

karxxm2y ago

I have not seen many work on explainable AI regarding large language models. I remember many very nice visualizations and visual analysis tools trying to comprehend, what the network „is seeing“ (eg. in the realm of image classification) or doing

crosen992y ago

The biggest challenge I’m trying to track isn’t on the list: online learning. The difficulties with getting LLMs to absorb new knowledge without catastrophic forgetting is a key factor making us so reliant on techniques like retrieval augmented generation. While RAG is very powerful, it’s only as good as the information retrieval step and context size, which quite often aren’t good enough.

Buttons8402y ago

Our APIs up to this point have been designed for computers. Json input, json output, and those are the nice ones.

I wonder if a deterministic but natural language API would be any better for LLMs to integrate with? Or do LLMs already speak Json well enough?

_pdp_2y ago

Another challenge is also around how we think LLMs should be used vs understanding how LLMs can be used. It will take some time to figure this out.

matanyal2y ago

Interesting, no mention of Groq for number 6.

j / k navigate · click thread line to collapse

72 comments

52 comments · 15 top-level

thrwayaistartup2y ago· 10 in thread

Looking back in 25 years, the "Hallucination Problem" will sound a lot like the "Frame Problem" of the 1970s.

Looking back, it's a bit absurd to say that GOFAI would've got to AGI if only the Frame Problem could be solved. But the important point is why that sounds so absurd.

It doesn't sound absurd because we found out that the frame problem can't be solved; that's beside the point.

It also doesn't sound absurd because we found out that solving the frame problem isn't the key to GOFAI-based AGI. That's also beside the point.

It sounds absurd because the conjecture itself is... just funny. It's almost goofy, looking back, how people thought about AGI.

Hallucination is the Frame Problem of the 2023 AI Summer. Looking back from the other side of the next Winter, the whole thing will seem a bit goofy.

js82y ago

thrwayaistartup2y ago

I think that ascribes way too much meaning to hallucinations, which are the artifact of a big fancy markov chain doing what you'd expect a big fancy markov chain to do.

1 more reply

laylower2y ago

That's an excellent summary I have to say. Theorists pushed hard to move the needle and practitioners with immense computing power reached and started chipping away at the same wall.

LLMs transpose the problem by mimicing what humans would do

resonious2y ago

melenaboija2y ago

Comparing human hallucinations with model “hallucinations” does not make sense to me.

Model hallucinations seems to me like a fancy way to call the model results that make no sense (ie blatant errors). Plus it makes the model more humanoid.

2 more replies

p1esk2y ago

Does anyone think we would have AGI if only we could solve the hallucination problem?

goatlover2y ago

There’s people who thought we could just wire up ChatGPT to a bunch of API calls and have AGI by now. Or some similar version of bootstrapping an LLM.

1 more reply

JimtheCoder2y ago

We don't even have a generally accepted definition of AGI yet, so...no.

emmender12y ago

"... .Looking back from the other side of the next Winter, the whole thing will seem a bit goofy."

for most of us, what we wish for is what we believe.

optimalsolver2y ago

Is from ought.

errantspark2y ago· 7 in thread

abatilo2y ago

Just another random anecdotal experience with Chip.

I was interviewing with Claypot.ai and when I met her for my first conversation, she was on a walking treadmill and very clearly was more interested in a Slack conversation she was having.

PunchTornado2y ago

Are walking interviews a thing? I'd be very annoyed.

josho2y ago

1 more reply

swyx2y ago

1 more reply

lionkor2y ago

Just an itsy bitsy little bit of copyright infringement :)

wanderlust1232y ago

Sounds like a pretty entitled and unpleasant person. At the bare minimum you should have had a say in whether you picture could hve been used.

CatWChainsaw2y ago

Huh, that sounds like the arguments artists make for keeping their work out of training sets. :)

illusionist1232y ago· 7 in thread

So whoever is working on this problem, good luck because you have you have a lot of work to do to get Markov chains to only output facts and not just correlations of the training data.

blackkettle2y ago

StackOverlord2y ago

Indeed but people who are confidently wrong in good faith so to speak don't make up science paper titles.

famouswaffles2y ago

illusionist1232y ago

It doesn't matter what humans do or do not do. Human performance as a benchmark is not a useful metric for what machines should or should not do.

2 more replies

StackOverlord2y ago

visarga2y ago

Technically, transformers condition on the entire past not just the last step, but RNNs are Markov Chains. RNNs have information bottleneck issues though.

uoaei2y ago

Transformers (for NLP) also perform steps on Markov chains. The difference is that with transformers (for NLP), which Markov chain it's moving along changes every step.

yeck2y ago· 4 in thread

nonameiguess2y ago

yeck2y ago

jebarker2y ago

yeck2y ago

Yeah, it is harder than other things, but if we can train a model to explain collections of pixels in human language then we might be able to do similar with collections of activations.

I don't know if that is the direction, but just an example that comes to mind easily.

If someone figures out how to do this, I think their models will be far more capable and reliable.

ford2y ago· 3 in thread

Legend24402y ago

There are some unsolved practical problems like prompt injection, the difficulty of using them on your own data, etc.

p1esk2y ago

Sounds like a great motivation for academic researchers to find a way to train LLMs with less compute. Or maybe invent something better than transformers. A brain trains on 20 Watts after all.

1 more reply

netdur2y ago

I have helped making behind sense cases, one was to classify emails and redirect them to intended sides, second was quality monitoring of call center.

mattlutze2y ago· 3 in thread

"Never before in my life had I seen so many smart people working on the same goal"

I'm not sure why but the assumptions and naivety in this opening line bothers me. There are plenty of goals and problems that orders of magnitude more people are working on today.

dustypotato2y ago

But the author hadn't seen them in their life

PunchTornado2y ago

I think more people worked on the covid problem which was 3 years ago

strikelaserclaw2y ago

Maybe she hasn't watched the Oppenheimer movie yet :D

visarga2y ago· 1 in thread

Let me add a few:

- organic data exhaustion - we need to step up synthetic data and its validation

- imbalanced datasets - catalog, assess and fill in missing data

- backtracking - make LLMs better at combinatorial or search problems

- deduction - we need to augment the training set for revealing implicit knowledge, in other words to study the text before learning it

- defragmentation - information comes in small chunks, sits in separate siloes, and context size is short, we need to use retrieval to bring it together for analysis

tl;dr We need quantity, diversity and depth in our training sets

josephg2y ago

And I’ll add some more:

- How do we do gradient descent on the model architecture itself during training?

inciampati2y ago· 1 in thread

One thing I'd like to see is more effort on developing citation systems for these models.

What I mean is that every part of the output of an LLM should be annotated with references to the content that is most important or relevant to it.

Who is leading this effort now?

astrange2y ago

https://twitter.com/AnthropicAI/status/1688946685937090560

techwizrd2y ago· 1 in thread

0: https://arxiv.org/abs/1912.04977

blackkettle2y ago

You might like this one; seems to be the basis for the blog post:

- https://arxiv.org/abs/2307.10169

blackkettle2y ago

Seems like it is basically a blog post review of Challenges and Applications of Large Language Models which was published to arXiv last month:

- https://arxiv.org/abs/2307.10169

1 more reply

karxxm2y ago

crosen992y ago

Buttons8402y ago

Our APIs up to this point have been designed for computers. Json input, json output, and those are the nice ones.

I wonder if a deterministic but natural language API would be any better for LLMs to integrate with? Or do LLMs already speak Json well enough?

_pdp_2y ago

Another challenge is also around how we think LLMs should be used vs understanding how LLMs can be used. It will take some time to figure this out.

matanyal2y ago

Interesting, no mention of Groq for number 6.

j / k navigate · click thread line to collapse