Erm, no it wouldn't. Two addresses with identical index bits will still collide after you XOR those identical indexes with the (same) random pattern.
You would need to do some bit mixing. The simplest would be to multiply the non-offset bits by a prime. But a multiply is still a lot slower than physically wiring the index bits to the mux.
The problem is a little more nuanced than the article suggests. Notice that you don't actually need any of the data to stay in the cache during the loop—you read and write an address, then never return to it again. So the fact that you overflow your associativity doesn't specifically matter; you're happy to have it evict any or every entry.
I often get these things wrong, but I believe what's happening is that it's getting blocked on the writeback from the cache evictions (because they're now dirty from the fresh write). Which means that the 256 vs 257 distinction is a tiny bit fake; if you have a large enough array, both will eventually slow down to at best the speed of the next lower cache layer. The 257 stride will postpone the reckoning longer if the cache starts out partly clean, since the writes will just pile up in the cache for a while. (Shorter version: the 257 stride will indeed probably go faster, but it's probably better to think of the effective size of the cache rather than thinking of it as just faster vs slower.)
The article carefully sets N to 2*21 to demonstrate the behavior. (The overall conclusion is still correct!)