A common mistake when NumPy’s RNG with PyTorch (opens in new tab)

(tanelp.github.io)

261 pointssunils345y ago53 comments

53 comments

45 comments · 22 top-level

This is probably because I never read these kinds of blogposts but this is one of the most flagrantly clickbait titles I've ever seen. Like the article doesn't even suggest ditching numpy in favor of jax or some kind of other hot take (which would at least warrant such a bombastic title) it literally just presents one instance in which you might be making a mistake when using numpy's rng (not even something more unique to numpy). And the PyTorch team is aware of this and hence exposes `worker_init_fn`. So the title should actually be "Using fork without understanding fork? You might be making a mistake."

wokwokwok5y ago

I suppose...

1) This is an issue from 2018 (https://github.com/pytorch/pytorch/issues/5059), which links to the closed numpy issue (https://github.com/numpy/numpy/issues/9248) which just says: seed your random numbers folk.

2) The documentation in pytorch covers this (https://pytorch.org/docs/stable/data.html#randomness-in-mult...), but it's not really highlighted specifically in, eg. tutorials. (but it is in the FAQ https://pytorch.org/docs/stable/notes/faq.html#dataloader-wo...)

3) It doesn't affect windows, which uses spawn instead of fork.

4) To quote the author:

> I downloaded and analysed over a hundred thousand repositories from GitHub that import PyTorch. I kept projects that use NumPy’s random number generator with multi-process data loading. Out of these, over 95% of the repositories are plagued by this problem.

^ No actual stats, just some vague hand waving; this just seems like nonsense.

So, I suppose... there's some truth to it being a documentation issue, but I guess the title + (1-3) kind of say to me: OP thought they discovered something significant... turns out, they didn't.

Oh well, spin it into some page views.

anon_tor_123455y ago

>No actual stats, just some vague hand waving; this just seems like nonsense.

i had exactly the same thought - if they'd actually crawled github they'd have some nice plots to back up the claim.

unityByFreedom5y ago

Better title? Over 95% of GitHub repos using NumPy and PyTorch aren't getting the random numbers they think they are.

1 more reply

gleenn5y ago

OP said they scanned and found this problem in thousands of projects including some ones which are probably copied heavily as examples like from Nvidia. While the post might be a little strong, at least they back up their statement that many others are actually suffering this problem

coolreader185y ago

Maybe the intent is for it to be read as "If you're using pytorch and numpy, it's _very_ likely you're making this mistake", but the effect is still that the headline is clickbait

nerdponx5y ago

It's so obviously clickbait that I wonder if it's meant to be tongue-in-cheek.

TruthWillHurt5y ago

"I downloaded over a hundred thousand repositories from GitHub that import PyTorch... Out of these, over 95% of the repositories are plagued by this problem."

Title seems pretty accurate to me!

jandrese5y ago· 5 in thread

Forgetting to seed your RNG is a really classic bug. IMHO RNGs should auto seed unless explicitly set not to, but since the opposite behaviour was baked into C so many years ago it's kind of the default. The worst part is how easy a bug this is to miss unless you're explicitly printing out the first set of random numbers for some strange reason.

_delirium5y ago

NumPy does auto-seed the RNG if you don't pass a seed yourself, using platform-specific code to pull some entropy from the OS. So that common case is handled reasonably well, unlike with C. In fact if you want exactly reproducible results (e.g. in testcases), you have to seed with a known seed, to avoid that default behavior.

The issue here is a little more subtle: if you fork 10 copies of your Python process, all 10 inherit the current RNG state, and will thereafter produce identical random number sequences. If you were manually forking, you might guess that was a potential problem, and re-seed the RNGs after forking. But PyTorch's data loaders fork a bunch of processes to do things in parallel, so users might not realize that they're using duplicate copies of their RNG state.

jeeeb5y ago

It’s even slightly more subtle than that.

Python multiprocessing doesn’t use fork on Windows. It starts a new process and so shouldn’t be affected by this.

So to trigger this you need to have num_processes != 0 on your DataLoader and be running on a non-Windows platform.

2 more replies

phonebucket5y ago

I’m of the opposite opinion and would get away from all auto RNG seeding:

1) this will help reproducibility a great deal, which is a pain so often.

2) forcing users to actually understand the seeding of RNGs from the point that they are novice programmers could help allay bugs of the sort seen in this post, which I believe stems from having too much faith that RNGs will simply work out of the box as substitutions for ‘real’ random variables.

rurban5y ago

I have again a different opinion. Allow both: srand() - explicit seed initialization, as well as autoseed.

But you really need to change selected known bad seeds, which destroy the PRNG statistical properties. Most PRNG's have a couple of known bad seeds, but nobody does anything against it. Same for hash functions.

0-_-05y ago

Indeed. Almost every time (like now) when you think you need a random number, you actually need a low discrepancy sequence.

_coveredInBees5y ago· 2 in thread

Yeah, I'd run into this 2 years ago and ended up also reporting an issue on the Centernet repo [1]

The solution I have in that issue adapts from the very helpful discussions in the original Pytorch issue [2]

`worker_init_fn=lambda id: np.random.seed(torch.initial_seed() // 2*32 + id)`

I will admit that this is *very* easy to mess up as evidenced by the fact that examples in the official tutorials for Pytorch and other well known code-bases suffer from it. In the Pytorch training framework I've helped develop at work, we've implemented a custom `worker_init_fn` as outlined in [1] that is the default for all "trainer" instances who are responsible for instantiating DataLoaders in 99% of our training runs.

Also, as an aside, Holy Clickbaity title Batman! Maybe I should have blogged about this 2 years ago. Heck, every 6 months or so, I think that, and then I realize that I'd rather spend time with my kids and on my hobbies when I'm not working on interesting ML stuff and/or coding. An added side benefit is not having to worry about making idiotic clickbaity titles like this to farm karma, or provide high-quality unpaid labor for Medium in order for my efforts to be actually seen by people. But it could also just be that I'm lazy :-)

[1] https://github.com/xingyizhou/CenterNet/issues/233

[2] https://github.com/pytorch/pytorch/issues/5059

nurpax5y ago

This field contains the worker-specific seed:

    torch.utils.data.get_worker_info().seed

So I guess something like the below (untested!) could work too:

   worker_init_fn=lambda id: np.random.seed(torch.utils.data.get_worker_info().seed)

_coveredInBees5y ago

Should be 2**32 above. HN formatting swallowed an asterisk.

shoyer5y ago· 2 in thread

This post is yet another example of why you should never use APIs for random number generation that rely upon and mutate hidden global state, like the functions in numpy.random. Instead, use APIs that explicitly deal with RNG state, e.g., by calling methods on an explicitly created numpy.random.Generator object. JAX takes this one step further: there are no mutable RNG objects at all, and the users has to explicitly manipulate RNG state with pure functions.

It’s a little annoying to have to set and pass RNG state explicitly, but on the plus side you never hit these sorts of issues. Your code will also be completely reproducible, without any chance of spooky “action at a distance.” Once you’ve been burned by this a few times, you’ll never go back.

You might think that explicitly seeding the global RNG would solve reproducibility issues, but it really doesn’t. If you call into any code you didn’t write, it might also be using the same global RNG.

warsheep5y ago

The solution you suggest is irrelevant to the issue mentioned in the article. Even if you use np.random.RandomState, or any other "explicit RNG state", that state will still be copied in the fork() call.

The post just stresses that one should be careful when using random states and multiprocessing, so you should either reseed after forking or using multiprocess/multithread-aware RNG API.

nemetroid5y ago

I believe the point is that the error will be more obvious if the state is passed around explicitly.

1 more reply

qd6pwu45y ago· 2 in thread

I notice that the web page of this article is beautifully justified to two sides instead of left alignment, and there is hyphen in breaking lines. Does anyone know how to achieve this in web page? text-align: justify seems to produce inferior results than this page, e.g. rivers in text.

rockostrich5y ago

`hypens: auto` is probably it. It allows words to be split with a hypen at the browser's discretion.

qd6pwu45y ago

thanks, that seems to be the answer

nxpnsv5y ago· 2 in thread

So click baity. A proper title would be, be careful when using random numbers and multi processing...

rlanday5y ago

My initial reaction was, “it can’t possibly be as big as the mistake you’re making if you’re using TensorFlow!”

nxpnsv5y ago

Certainly if you use it without understanding how to use a random number generator... but that is not really a fault in numpy OR TensorFlow...

timzaman5y ago· 1 in thread

Iirc the bug Karpathy mentioned in his tweet was actually due to the seed being the same across multigpu data parallel workers! You need to account for this too. So the author hasnt solved it.

I know this bc I fixed the bug. And probably caused it. Hehe.

Also you dont just want to set ur numpy seed but also the native python one and the torch one.

Salgat5y ago

Yeah this isn't really a bug or even an issue with the library. If you instantiate an RNG with a seed and then fork your process, well duh of course the RNG will be repeated across the forks.

Too5y ago· 1 in thread

.NET has a similar pitfall, but not due to forking but rather that the Random() default seed is based on the system clock. So starting several threads constructing new Random objects with the hope that they are unique might in fact give you same RNG sequences.

Salgat5y ago

In .NET Core they actually fixed this thank goodness https://stackoverflow.com/questions/57905143/did-microsoft-c...

andrew_v45y ago· 1 in thread

Is there something specific about numpy here, or would it be any RNG?

I'm looking at some code that uses random.random() to randomly apply augmentations, I suspect that will have the same issue right?

rurban5y ago

It is a well-known trap that forked or threaded RNG's keep using the same seed. It can be called a feature, eg when you need to process different ranges, using the same sequence of random numbers for each range, but mostly it's a bug. You need to init your rng seed for each thread or fork to get different random sequences.

For splitting up sequential ranges, a good rng typically has an advance function to advance the seed for each range. So you can get reproducibility.

unityByFreedom5y ago

Great catch!

[1] https://github.com/pytorch/pytorch/issues/5059

jeeeb5y ago

I always randomly log a sample of my inputs to TendorBoard to manually review what my training data actually looks like and (hopefully) pick up on bugs like these. Similarly I find logging high loss inputs very informative.

Coincidentally I find this article timely as I was recently reviewing PyTorch DataLoader docs regarding random number generator seeding. It’s the kind of thing unit test don’t pick up since it only occurs when you use separate worker processes.

jamesfisher5y ago

Note official TensorFlow tutorials make the exact same mistake. I've reported it but it hasn't been fixed. [1]

[1]: https://github.com/tensorflow/tensorflow/issues/47755

tsimionescu5y ago

This seems like another reason to never use fork() without exec(). Fork is really a mine field when used this way (and a pretty big maintenance burden on the kernel, by my understanding, to provide the illusion of sharing read-only state with the parent process).

formerly_proven5y ago

Python has os.register_at_fork nowadays, so why‘d still have this kind of behavior? Not reseeding after fork has been a footgun for almost as long as fork exists.

canjobear5y ago

I usually write my own data handling functions rather than trying to play PyTorch’s game here. I find their abstractions here confusing or not useful.

etiam5y ago

Would normally refrain from upvoting this on account of the title, but the actual topic was important enough that I think it can be worth an exception.

selimthegrim5y ago

A less clickbaity title might have been: Bugs are easy to make. Here’s how to make fewer bugs when modifying existing PyTorch and NumPy code.

Nimitz145y ago

Oh wow, I've definitely done this mistake without realizing it...

ivoras5y ago

A lot of comments are criticising the frameworks or the developers, but suprisingly almost no one is criticising Python, which remains a language of the early 90ies as far as parallelism is concerned.

A bit like Stockholm syndrome - "Python doesn't do threading" is so ingrained in its users (and I'm a user) minds that it's not even questioned as a potential source of problems.

(Noone said it's easy to do. That's why language developers and implementers are a special breed even today.)

ummonk5y ago

"You're making a mistake" sounds like one shouldn't use PyTorch and NumPy together, when the actual message is "there might be a mistake in your code".

king_magic5y ago

Aside from the infuriating clickbait title (which I shall not dignify with an upvote), this is part of why I preprocess augmented images. I don't like too much magic in my custom derived (PyTorch) Dataset objects.

BlueTemplar5y ago

Taking the title at face value : yeah, you are making a mistake, using GAFAM tools like PyTorch, Github or OpenAI's GPTs.

j / k navigate · click thread line to collapse

53 comments

45 comments · 22 top-level

anon_tor_123455y ago· 7 in thread

wokwokwok5y ago

I suppose...

3) It doesn't affect windows, which uses spawn instead of fork.

4) To quote the author:

^ No actual stats, just some vague hand waving; this just seems like nonsense.

So, I suppose... there's some truth to it being a documentation issue, but I guess the title + (1-3) kind of say to me: OP thought they discovered something significant... turns out, they didn't.

Oh well, spin it into some page views.

anon_tor_123455y ago

>No actual stats, just some vague hand waving; this just seems like nonsense.

i had exactly the same thought - if they'd actually crawled github they'd have some nice plots to back up the claim.

unityByFreedom5y ago

Better title? Over 95% of GitHub repos using NumPy and PyTorch aren't getting the random numbers they think they are.

1 more reply

gleenn5y ago

coolreader185y ago

Maybe the intent is for it to be read as "If you're using pytorch and numpy, it's _very_ likely you're making this mistake", but the effect is still that the headline is clickbait

nerdponx5y ago

It's so obviously clickbait that I wonder if it's meant to be tongue-in-cheek.

TruthWillHurt5y ago

"I downloaded over a hundred thousand repositories from GitHub that import PyTorch... Out of these, over 95% of the repositories are plagued by this problem."

Title seems pretty accurate to me!

jandrese5y ago· 5 in thread

_delirium5y ago

jeeeb5y ago

It’s even slightly more subtle than that.

Python multiprocessing doesn’t use fork on Windows. It starts a new process and so shouldn’t be affected by this.

So to trigger this you need to have num_processes != 0 on your DataLoader and be running on a non-Windows platform.

2 more replies

phonebucket5y ago

I’m of the opposite opinion and would get away from all auto RNG seeding:

1) this will help reproducibility a great deal, which is a pain so often.

rurban5y ago

I have again a different opinion. Allow both: srand() - explicit seed initialization, as well as autoseed.

0-_-05y ago

Indeed. Almost every time (like now) when you think you need a random number, you actually need a low discrepancy sequence.

_coveredInBees5y ago· 2 in thread

Yeah, I'd run into this 2 years ago and ended up also reporting an issue on the Centernet repo [1]

The solution I have in that issue adapts from the very helpful discussions in the original Pytorch issue [2]

`worker_init_fn=lambda id: np.random.seed(torch.initial_seed() // 2*32 + id)`

[1] https://github.com/xingyizhou/CenterNet/issues/233

[2] https://github.com/pytorch/pytorch/issues/5059

nurpax5y ago

This field contains the worker-specific seed:

    torch.utils.data.get_worker_info().seed

So I guess something like the below (untested!) could work too:

   worker_init_fn=lambda id: np.random.seed(torch.utils.data.get_worker_info().seed)

_coveredInBees5y ago

Should be 2**32 above. HN formatting swallowed an asterisk.

shoyer5y ago· 2 in thread

warsheep5y ago

The post just stresses that one should be careful when using random states and multiprocessing, so you should either reseed after forking or using multiprocess/multithread-aware RNG API.

nemetroid5y ago

I believe the point is that the error will be more obvious if the state is passed around explicitly.

1 more reply

qd6pwu45y ago· 2 in thread

rockostrich5y ago

`hypens: auto` is probably it. It allows words to be split with a hypen at the browser's discretion.

qd6pwu45y ago

thanks, that seems to be the answer

nxpnsv5y ago· 2 in thread

So click baity. A proper title would be, be careful when using random numbers and multi processing...

rlanday5y ago

My initial reaction was, “it can’t possibly be as big as the mistake you’re making if you’re using TensorFlow!”

nxpnsv5y ago

Certainly if you use it without understanding how to use a random number generator... but that is not really a fault in numpy OR TensorFlow...

timzaman5y ago· 1 in thread

Iirc the bug Karpathy mentioned in his tweet was actually due to the seed being the same across multigpu data parallel workers! You need to account for this too. So the author hasnt solved it.

I know this bc I fixed the bug. And probably caused it. Hehe.

Also you dont just want to set ur numpy seed but also the native python one and the torch one.

Salgat5y ago

Yeah this isn't really a bug or even an issue with the library. If you instantiate an RNG with a seed and then fork your process, well duh of course the RNG will be repeated across the forks.

Too5y ago· 1 in thread

Salgat5y ago

In .NET Core they actually fixed this thank goodness https://stackoverflow.com/questions/57905143/did-microsoft-c...

andrew_v45y ago· 1 in thread

Is there something specific about numpy here, or would it be any RNG?

I'm looking at some code that uses random.random() to randomly apply augmentations, I suspect that will have the same issue right?

rurban5y ago

For splitting up sequential ranges, a good rng typically has an advance function to advance the seed for each range. So you can get reproducibility.

unityByFreedom5y ago

Great catch!

[1] https://github.com/pytorch/pytorch/issues/5059

jeeeb5y ago

jamesfisher5y ago

Note official TensorFlow tutorials make the exact same mistake. I've reported it but it hasn't been fixed. [1]

[1]: https://github.com/tensorflow/tensorflow/issues/47755

tsimionescu5y ago

formerly_proven5y ago

Python has os.register_at_fork nowadays, so why‘d still have this kind of behavior? Not reseeding after fork has been a footgun for almost as long as fork exists.

canjobear5y ago

I usually write my own data handling functions rather than trying to play PyTorch’s game here. I find their abstractions here confusing or not useful.

etiam5y ago

Would normally refrain from upvoting this on account of the title, but the actual topic was important enough that I think it can be worth an exception.

selimthegrim5y ago

A less clickbaity title might have been: Bugs are easy to make. Here’s how to make fewer bugs when modifying existing PyTorch and NumPy code.

Nimitz145y ago

Oh wow, I've definitely done this mistake without realizing it...

ivoras5y ago

A bit like Stockholm syndrome - "Python doesn't do threading" is so ingrained in its users (and I'm a user) minds that it's not even questioned as a potential source of problems.

(Noone said it's easy to do. That's why language developers and implementers are a special breed even today.)

ummonk5y ago

"You're making a mistake" sounds like one shouldn't use PyTorch and NumPy together, when the actual message is "there might be a mistake in your code".

king_magic5y ago

BlueTemplar5y ago

Taking the title at face value : yeah, you are making a mistake, using GAFAM tools like PyTorch, Github or OpenAI's GPTs.

j / k navigate · click thread line to collapse