The solution I have in that issue adapts from the very helpful discussions in the original Pytorch issue [2]
`worker_init_fn=lambda id: np.random.seed(torch.initial_seed() // 2*32 + id)`
I will admit that this is *very* easy to mess up as evidenced by the fact that examples in the official tutorials for Pytorch and other well known code-bases suffer from it. In the Pytorch training framework I've helped develop at work, we've implemented a custom `worker_init_fn` as outlined in [1] that is the default for all "trainer" instances who are responsible for instantiating DataLoaders in 99% of our training runs.
Also, as an aside, Holy Clickbaity title Batman! Maybe I should have blogged about this 2 years ago. Heck, every 6 months or so, I think that, and then I realize that I'd rather spend time with my kids and on my hobbies when I'm not working on interesting ML stuff and/or coding. An added side benefit is not having to worry about making idiotic clickbaity titles like this to farm karma, or provide high-quality unpaid labor for Medium in order for my efforts to be actually seen by people. But it could also just be that I'm lazy :-)
torch.utils.data.get_worker_info().seed
So I guess something like the below (untested!) could work too: worker_init_fn=lambda id: np.random.seed(torch.utils.data.get_worker_info().seed)It’s a little annoying to have to set and pass RNG state explicitly, but on the plus side you never hit these sorts of issues. Your code will also be completely reproducible, without any chance of spooky “action at a distance.” Once you’ve been burned by this a few times, you’ll never go back.
You might think that explicitly seeding the global RNG would solve reproducibility issues, but it really doesn’t. If you call into any code you didn’t write, it might also be using the same global RNG.
The post just stresses that one should be careful when using random states and multiprocessing, so you should either reseed after forking or using multiprocess/multithread-aware RNG API.
> I downloaded and analysed over a hundred thousand repositories from GitHub that import PyTorch. I kept projects that use NumPy’s random number generator with multi-process data loading. Out of these, over 95% of the repositories are plagued by this problem. It’s inside PyTorch’s official tutorial, OpenAI’s code, NVIDIA’s projects, etc. [1]
I know this bc I fixed the bug. And probably caused it. Hehe.
Also you dont just want to set ur numpy seed but also the native python one and the torch one.
Coincidentally I find this article timely as I was recently reviewing PyTorch DataLoader docs regarding random number generator seeding. It’s the kind of thing unit test don’t pick up since it only occurs when you use separate worker processes.
The issue here is a little more subtle: if you fork 10 copies of your Python process, all 10 inherit the current RNG state, and will thereafter produce identical random number sequences. If you were manually forking, you might guess that was a potential problem, and re-seed the RNGs after forking. But PyTorch's data loaders fork a bunch of processes to do things in parallel, so users might not realize that they're using duplicate copies of their RNG state.
Python multiprocessing doesn’t use fork on Windows. It starts a new process and so shouldn’t be affected by this.
So to trigger this you need to have num_processes != 0 on your DataLoader and be running on a non-Windows platform.
1) this will help reproducibility a great deal, which is a pain so often.
2) forcing users to actually understand the seeding of RNGs from the point that they are novice programmers could help allay bugs of the sort seen in this post, which I believe stems from having too much faith that RNGs will simply work out of the box as substitutions for ‘real’ random variables.
But you really need to change selected known bad seeds, which destroy the PRNG statistical properties. Most PRNG's have a couple of known bad seeds, but nobody does anything against it. Same for hash functions.
I'm looking at some code that uses random.random() to randomly apply augmentations, I suspect that will have the same issue right?
For splitting up sequential ranges, a good rng typically has an advance function to advance the seed for each range. So you can get reproducibility.
A bit like Stockholm syndrome - "Python doesn't do threading" is so ingrained in its users (and I'm a user) minds that it's not even questioned as a potential source of problems.
(Noone said it's easy to do. That's why language developers and implementers are a special breed even today.)
1) This is an issue from 2018 (https://github.com/pytorch/pytorch/issues/5059), which links to the closed numpy issue (https://github.com/numpy/numpy/issues/9248) which just says: seed your random numbers folk.
2) The documentation in pytorch covers this (https://pytorch.org/docs/stable/data.html#randomness-in-mult...), but it's not really highlighted specifically in, eg. tutorials. (but it is in the FAQ https://pytorch.org/docs/stable/notes/faq.html#dataloader-wo...)
3) It doesn't affect windows, which uses spawn instead of fork.
4) To quote the author:
> I downloaded and analysed over a hundred thousand repositories from GitHub that import PyTorch. I kept projects that use NumPy’s random number generator with multi-process data loading. Out of these, over 95% of the repositories are plagued by this problem.
^ No actual stats, just some vague hand waving; this just seems like nonsense.
So, I suppose... there's some truth to it being a documentation issue, but I guess the title + (1-3) kind of say to me: OP thought they discovered something significant... turns out, they didn't.
Oh well, spin it into some page views.
i had exactly the same thought - if they'd actually crawled github they'd have some nice plots to back up the claim.
Title seems pretty accurate to me!