But the human (in the comment chain) here made exactly the same mistake!
In that sense this test doesn’t seem to be a good fit for testing the reasoning capabilities. Since it‘s also easy to get wrong for humans (and humans also don’t always reason about everything from first principles, especially if they have similar answers already cached in their memory).
It seems you would need novel puzzles that aren’t really common (even if in kind) and don’t really sound similar to existing puzzles to get a handle on its reasoning capabilities.