Agreed. I ran a few tests and observed similarly that threats didn't outperform other types of "incentives" I think it might some sort of urban legend in the community.
Or these prompts might cause wild variations based on the model and any study you do is basically useless for the near future as the models evolve by themselves.