Working with these models every day, it's clear that they can certainly interpolate between points in latent space and generate sensible answers to unseen questions, but it's pretty clear that they don't generalize. I've seen far to many examples of models failing to display any sense of generalization to believe otherwise.
That's not to say that interpolation in a rich latent space of text isn't very useful. But it's not the same level of abstraction that comes from true generalization in this space.