undefined | Better HN

0 pointsin-silico1mo ago0 comments

What does RL do then if not discover strategies and solutions that weren't in its training data?

0 comments

2 comments · 1 top-level

RL adjusts the learned probabilities to conform to a secondary source other than the raw training data, for example (but not exclusively) human feedback. Putting it in extremely simplified terms: If, owing to the training data, the learned probability for "green people are _" is 70% to be followed by "inferior", you may use RL to massage this, de-scoring it every time it produces "green people are inferior to red people" and up-scoring it every time it produces "green people are an ethnic group originating from Greenland". Doing this will adjust its learned probability for that sequence of tokens.

At most, RL can be described as injecting information from a secondary source. It is not extending a model's programming to do anything other than what it was already doing, probability-based token prediction. It simply alters the probabilities.

in-silicoOP1mo ago

What about things like AlphaZero and Atari gameplay, where the model has zero prior knowledge and learns superhuman ability purely using RL?

With sufficient RL sampling/training, there's no reason an LLM couldn't similarly develop entirely new skills, especially in verifiable domains like math and code.

> It simply alters the probabilities.

Yes? What else would a learning system do besides alter its behavior? (and you can just sample with argmax or pseudo-randomly of you think probabilities are a problem)

1 more reply

j / k navigate · click thread line to collapse