I've given it descriptions of non-existent "franken-languages" composed by telling it to imagine taking programming language A and adding various features I want to explore to it, and then had it correctly symbolically reason about a program written in this hypothetical language that doesn't exist anywhere, so yeah, the notion it doesn't generalize to at least some degree is nonsense, but note this involved tests on a GPT2 scale model so it's not very surprising they had poor results.
That said, even GPT4 certainly has pretty significant limitations on what it manages to reason about. But without comparing their capabilities in other aspects, arguably so do most humans. We tend to force our way past those limitations by learning incrementally by doing over and over. Current models don't get that luxury without complicated fine-tuning steps, so if anything what should surprise us is how well they do with the limitation of only context to act as short-term memory.