It's actually hard to find short, simple, "plain english" failure cases like the above.
The "chain of reasoning" that the modern models deploy before the fail is funny too. This is GPT-4:
---
To determine the relationship between cherries and bananas based on your statements, let's break it down:
1. Apples are better than bananas.
2. Cherries are worse than apples.
From statement 1, we know apples rank higher than bananas. Statement 2 tells us cherries rank lower than apples. By this logic, since cherries are lower than apples, which are higher than bananas, it follows that cherries are also lower than bananas.Therefore, based on these comparisons, cherries are not better than bananas.