Interestingly, I find that the models generalize decently well as long as the "training" (more analogous to that for humans) fits in (small enough) context. That's to say, "in-context learning" seems good enough for real use.
But of course, that's not quite "long term"