> NNs are a stat fitting alg learning a conditional probability distribution, P(next_word|previous_words).
They are trained to maximise this, yes.
> Their weights are a model of this distribution.
That doesn't really follow, but let's leave that.
> Why does, 'mat' follow from 'the cat sat on the ...' because 'mat' is the most frequent word in the dataset; and the NN is a model of those frequencies.
Here's the rub. If how you describe them is all they're doing then a sequence of never-before-seen words would have no valid response. All words would be equally likely. It would mean that a single brand new word would result in absolute gibberish following it as there's nothing to go on.
Let's try:
Input: I have one kjsdhlisrnj and I add another kjsdhlisrnj, tell me how many kjsdhlisrnj I now have.
Result: You now have two kjsdhlisrnj.
I would wager a solid amount that kjsdhlisrnj never appears in the input data. If it does pick another one, it doesn't matter.
So we are learning something more general than the frequencies of sequences of tokens.
I always end up pointing to this but OthelloGPT is very interesting https://thegradient.pub/othello/
While it's trained on sequences of moves, what it does is more than just "sequence a,b,c is followed by d most often"