In the other test the perturbations aren’t particularly sophisticated and modify the problem according to a template. As the parent comment said this is pretty easy to generate test data for (and for the model to pattern match against) so maybe that is what they did.
A better test of “reasoning” would be to isolate the concept/algorithm and generate novel instances that are completely textually different from existing problems to see if the model really isn’t just pattern matching. But we already know the answer to this because it can’t do things like arbitrary length multiplication.