This is good, what you need can be implemented significantly more efficiently than the CS version. As you mentioned, it doesn't help your application to be truly nondeterministic.
C rand() is about the worst one you can find. Mersenne Twister is great statistically, but it's quite expensive in terms of memory consumption.
As I understand GPU programming, you want to maximize parallelism. Ideally, you would have something small enough to have an independent generator for each processor. You should be able to generate perfectly random-looking numbers using 64 or 128 bits of state storage and a handful of instructions per 32-bits of output.
For example, the Skein hash function and CSPRNG can generate random data at 5 cycles per byte of output. But that's cryptographically secure, you can probably cut its state size and computational expense each in half to get statistical randomness.