It depends on what does "generalize beyond the training data" means. If I invent a new programming language and I teach (in-context) the language to the model and it's able to use it to solve many tasks, is it generalizing beyond the training data?
No. The way I'd look at it is that generalization or specifically extrapolation would mean that different features are needed to make a prediction (here, the next token) than what is seen in the training data. Something like a made up language could still result in the same patterns being relevant. That's why out-of-distribution research often uses mathematical extrapolation as a task.
Can you provide a real world example? Because this sounds like nonsense. As in, not a weakness of any architecture but just the very concept of pattern matching.
What you might be asking for is a system that simply continually learns.