Hello, yes the approach is slower in my implementation, but I've the feeling that a suitable F (much simpler) and a low number of rounds could do the trick. However the highlight in the original post was that the
ports were not able to reproduce the effect. Given that the ports are AFAIK successive and use higher resolutions, I bet that the CPU was not an issue in that case.
EDIT: I just randomly checked that 4 rounds of F = ((r * 31) ^ (r >> 3)) & 0xff provide more or less the same result. Multiplying for 31 is the same as shifting 5 bits on the left and subtracting the number again, so it's just 4 rounds of bit shifting and xor.