And their poor performance on these tasks highlights deficits in exactly the kind of higher-order, off-the-page reasoning skills -- i.e. to not just reason based on the apparent objects in the stream (the kiwis and the numbers in this case), but to reason about the token stream itself: "okay, these tokens are important, but these others I can leave out", efficiently and seamlessly (like humans do) -- that the models are supposed to develop.
This whole attention business, they're calling it.