RL adjusts the learned probabilities to conform to a secondary source other than the raw training data, for example (but not exclusively) human feedback. Putting it in extremely simplified terms: If, owing to the training data, the learned probability for "green people are _" is 70% to be followed by "inferior", you may use RL to massage this, de-scoring it every time it produces "green people are inferior to red people" and up-scoring it every time it produces "green people are an ethnic group originating from Greenland". Doing this will adjust its learned probability for that sequence of tokens.
At most, RL can be described as injecting information from a secondary source. It is not extending a model's programming to do anything other than what it was already doing, probability-based token prediction. It simply alters the probabilities.