but... it understands the meat-eating goat part just fine?
That it hasn't learned enough doesn't show that this approach can never learn, which seems to be the point you're making.
It's input dataset is many orders of magnitude bigger than the model itself - it can't "remember" all of it's training data.
Instead, it collects data about how certain tokens tend to relate to other tokens. Like learning that "goats" often "eat" "leafy greens". It also learns to group tokens together to create meta-tokens, like understanding how "red light district" has different connotations to each of those words individually.
Is this process of gathering connections about the different types of things we experience much different to how humans learn? We don't know for sure, but it seems to be pretty good at learning anything thrown at it. Nobody is telling it how to make these connections, it just does, based on the input data.
A separate question, perhaps, might consider how some concepts are much harder to understand if you were a general intelligence in a box that could only ever experience the world via written messages in and out, and how some concepts would be much easier (one might imagine that language itself would come faster given the lack of other stimulation). Things like "left" and "right" or "up" and "down" would be about as hard to understand properly as the minutae of particle interactions (which humans can only experience in abstract too)