undefined | Better HN

0 pointssathish3162mo ago0 comments

I can feel the AGI on this one :)

I ran extensive tests on this and variations on multiple models. Most models interpret 50 m as a short distance and struggle with spatial reasoning. Only Gemini and Grok correctly inferred that you would need to bring your car to get it washed in their thought stream, and incorporated that into the final answer. GPT-5.2 and Kimi K2.5 and even Opus 4.6 failed in my tests - https://x.com/sathish316/status/2023087797654208896?s=46

What surprised me was how introducing a simple, seemingly unrelated context - such as comparing a 500 m distance to the car wash to a 1 km workout - confused nearly all the models. Only Gemini Pro passed my second test after I added this extra irrelevant context - https://x.com/sathish316/status/2023073792537538797?s=46

Most real-world problems are messy and won’t have the exact clean context that these models are expecting. I’m not sure how the major AI labs assume most real-world problems are simpler than the constraints exposed by this example like prerequisites, ordering, and contextual reasoning, which are already posing challenges to these bigger models.

0 comments

K0balt2mo ago

To be fair, we all have holes in our reasoning if we don’t carefully consider things and sometimes they are very surprising when they come to light. The dependency issue (need the car at the car wash) is an easy one that often trips up people at first glance too. (Left my phone at work, plan: take an uber to get to the office, walk to the couch and remember I don’t have my phone to call an uber)

Things like that are notorious points of failure in human reasoning. It’s not surprising that machines based on human behavior exhibit that trait as well, it would be surprising if they didn’t.

kenjackson2mo ago

Another simple example is using the flashlight on your phone to look for your phone.

K0balt2mo ago

Oh the cringe. Got me.

jansan2mo ago

> I can feel the AGI on this one :)

This was probably meant in a sarcastic way, but isn't it impressive how you cannot push Gemini off track? I tried another prompt with claiming that one of my cups does not work, because it is closed at the top and open at the bottom, and it kind of played with me, giving me a funny technical explanation on how to solve that problem and finally asking me if that was a trick question.

In this case I can feel the AGI indeed.

sathish316OP2mo ago

Gemini Fast and Thinking failed just like other models.

I found Gemini Pro to be more consistent in logical reasoning

j / k navigate · click thread line to collapse

0 comments

K0balt2mo ago

Things like that are notorious points of failure in human reasoning. It’s not surprising that machines based on human behavior exhibit that trait as well, it would be surprising if they didn’t.

kenjackson2mo ago

Another simple example is using the flashlight on your phone to look for your phone.

K0balt2mo ago

Oh the cringe. Got me.

jansan2mo ago

> I can feel the AGI on this one :)

In this case I can feel the AGI indeed.

sathish316OP2mo ago

Gemini Fast and Thinking failed just like other models.

I found Gemini Pro to be more consistent in logical reasoning

j / k navigate · click thread line to collapse