If you gave this puzzle to a human, I bet that a non-insignificant proportion would respond to it as if it were the traditional puzzle as soon as they hear words "cabbage", "lion", and "goat". It's not exactly surprising that a model trained on human outputs would make the same assumption. But that doesn't mean that it can't reason about it properly if you point out that the assumption was incorrect.
With Bing, you don't even need to tell you what it assumed wrong - I just told it that it's not quite the same as the classic puzzle, and it responded by correctly identifying the difference and asking me if that's what I meant, but forgot that lion still eats the goat. When I pointed that out, it solved the puzzle correctly.
Generally speaking, I think your point that "when solving the puzzle you might visualize" is correct, but that is orthogonal to the ability of LLM to reason in general. Rather, it has a hard time to reason about things it doesn't understand well enough (i.e. the ones for which its internal model that was built up by training is in is way off). This seems to be generally the case for anything having to do with spatial orientation - even fairly simple multi-step tasks involving concepts like "left" vs "right" or "on this side" vs "on that side" can get hilariously wrong.
But if you give it a different task, you can see reasoning in action. For example, have it play guess-the-animal game with you while telling it to "think out loud".