Floating point associativity differences can lead to non-determinism with 0 temperature if the order of operations are non-deterministic.
Anyone with reasonable experience with GPU computation who pays attention knows that even randomness in warp completion times can easy lead to non-determinism due to associativity differences.
For instance:
https://www.twosigma.com/articles/a-workaround-for-non-deter...
It is very well known that CUDA isn't strongly deterministic due to these factors among practitioners.
Differences in batch sizes of inference compound these issues.
Edit: to be more specific, the non-determinism mostly comes from map-reduce style operations, where the map is deterministic, but the order that items are sent to the reduce steps (or how elements are arranged in the tree for a tree reduce) can be non-deterministic.