Going by the differences between gpt3.5 and gpt4 is really interesting. It is better able to reason in basically any problem I throw at it. Personally I think that a hypothetical system that is able to generate a sufficiently good response for ANY text input is AGI.
There aren't really any "gotcha" cases with this technology that I'm aware of where it just can't ever respond appropriately. Most clear failings of existing systems involve ever more contrived logic puzzles, which each successive generation is able to solve, and eventually at some point the required logic puzzle will be so dense few humans can solve it.
This isn't a case a case of "studying for the test" of popular internet examples either. I encourage you to try and invent your own gotchas for earlier versions then try them on newer models. Change the wording and order of logic puzzles, or encase them within scenarios to ensure its not responding to the format of the prompt
There are absolubtely cases of people overhypting it, or it overfitting to training data (see the debacle about it passing whatever bar exam, university test etc.). But despite the hype there is an underlying level of intelligence that is building and I use it to solve problems pretty much every day. I think of it atm as like a 4 year old that has inexplicably read every book ever written