Ok, here's a (hopefully) better worded puzzle, again made up by myself right now.
There are 12 frogs. Five are green, 3 red, and 4 yellow. Two donkeys are counting the frogs. One of the donkeys is yellow, the other green. Each donkey is unable to see frogs that are the same color as itself, also each donkey was careless and missed a frog when counting. How many frogs does the green donkey count?
GPT4 answers 6 every time for me.
My point is that GPT is capable of a certain amount of "reasoning" about puzzles that most certainly don't exist in it's training data. Playing with it, it's clear that in this current generation the reasoning ability doesn't go very deep - just change the above puzzle a little to make it even slightly more complicated and it breaks. The amazing thing isn't how good at reasoning it is, but that a computer can reason at all.