I think the situation would improve with better teaching of philosophy of science and statistics (this would educate better reviewers too).
On the other hand, you might expect that new discoveries by nature have less data since data is likely more expensive for brand new research, and by extension a lower likelihood of meeting these sorts of stringent statistical requirements. Decreasing the p-value threshold may be counter-productive if we dismiss legitimate new discoveries due to essentially economic constraints with data gathering, which would have the impact of making it less likely to get funding to pursue the problem in more depth, thereby slowing the advance of discoveries.
I could see the reverse happening, where higher p-value standards lead to normalization of deviance in the form of worse p-hacking.
This is a classic tradeoff between exploration and exploitation in active learning.
If your view of the world is that there are only a very few hypotheses worth exploring, and you have a good lay of the scientific land, then requiring higher bar of proof is probably good.
If it's a new field that's extremely complex and where very little is known of the governing principles, then requiring very high stats could severely slow progress and waste lots of research dollars.
I completely agree that rather than setting arbitrary barriers for significance, it would seem much better to let people actually understand what was found, at whatever significance it was. Even setting up the null model to get a p-value requires tons and assumptions. The better test is reproducibility and predictive models that can be validated or invalidated. That's where the science is, and not in the p.
I am not at all in favor of this proposal, but one thing it may do is stem the tidal wave of misinformation.
The very practical concern is that entire areas of research have been based on studies replicated and backed up entirely through p-hacking and selectively publishing only papers with positive results. This is a proven issue today. See https://en.wikipedia.org/wiki/Replication_crisis for more.
It may be that there is a pendulum that needs to swing a few times to get to a good tradeoff. But it is clear, now, which direction it needs to swing.
As a Psychology student, this is a well-known initiative: https://cos.io/prereg/
(Though I can't confirm or deny its widespread usage.)
The publication bias is harder, and pre-registration won't solve this. But I think this is a separate issue, and it's important to address each issue in its own right.
I've seen the proposal from TFA before and with my very limited knowledge, I'm still fairly certain it will never come to pass in Psychology, as nearly half of all modern studies have reproducibility issues (!). It would be beneficial to our field, in the way that a band-aid is beneficial to a gaping wound, but it would require a lot more rigor than has been evidently been displayed so far (and more rigor is more work, and time is limited).
So... Don't hold your breath.
(Sorry if my comment sounds pessimistic, I don't know much and I'm open to being corrected. I still have enough critical thought to be skeptical of some researchers' dedication to intellectual rigor.)
This is necessary, but not sufficient. What's needed is a way to know for sure that the hypothesis was not changed after data collection. I think predeclaring the hypothesis is the way to go.
Not that education can fix all these (you can't prevent evil), but if reviewers and journals and conferences started to accept more the negative results, the incentive in lying would quickly decrease. And people would probably start to "disprove" interesting theories, instead of trying to "prove" niche results...
http://jaoa.org/article.aspx?articleid=2517494
For example, here sample size is huge, USA population gets significant increased risk, while EU population does not. Mixing the two together would result in a smaller but still significant increased risk.
Given the size, it's quite clear that USA population has many other confounding factors that cannot be eliminated by mathematics alone (there is no control).
Is this along the lines of what you were hoping to find? Here's more: http://andrewgelman.com/2016/12/13/bayesian-statistics-whats...
You can take a look at the three axioms people use to justify statistics. If you are willing to accept them, all else that relies on them (without using new axioms) must be true:
https://en.wikipedia.org/wiki/Probability_axioms
This same logic is used to justify development in pure mathematics: choose a set of axioms which you accept as ground truths, and prove things using them. As long as you are unable to prove your axioms are contradictory, and the axiom choice seems acceptable, then the work that you've done (with respect to them) is philosophically justified.
You mean you aren't happy with...probability?
This proposal is a great pragmatic step forward. Like they say in the paper, it doesn't solve all problems, but it would be an improvement with reasonable cost and tremendous benefits.
>Such an increase means that fewer studies can be conducted using current experimental designs and budgets. But Fig. 2 shows the benefit: false positive rates would typically fall by factors greater than two. Hence, considerable resources would be saved by not performing future studies based on false premises.
So this proposal is really the opposite of pragmatic. Pragmatic would be requiring effect size estimates and confidence intervals in all published papers. It is surprising how many papers will talk about highly significant effects without actually discussing how large the estimated effect is thought to be, which gives authors a lot of leeway when exaggerating the importance of their findings.
1) The p-value filter leads to publication bias.
-You should publish your results anyway, or the study wasn't designed/performed correctly. The raw data and description of methods should be valuable.
2) The null hypothesis is (almost) always false anyway.
-Everything in bio/psych/etc has a real (not spurious) non-zero correlation with everything else, so the significance level just determines how much data needs to be collected to reject it.
3) Rejection or not of the null hypothesis does not indicate whether the theory/explanation of interest is correct, so is inappropriate for deciding whether a result is interesting to begin with.
-Usually the null hypothesis is very precise and the "alternative statistical hypothesis" that maps to the research hypothesis is very vague, so many alternative research hypotheses may explain the results.
Imagine psychology... done properly. The beginning of a science.
(I realize that "beginning" is too harsh, but psychology does have very serious problems with replicability. At the moment, it deserves its tarnished reputation.)
Of course, this system only worked because academia was a bastion of the male WASP elites that didn't have much pretense of serving the broader public. But at least you didn't have the torrent of mediocre papers that you see today.
Have things really changed? I suspect there are fewer males, but any job that demands 20 years of full-time concerted effort is likely to be dominated by men. Similarly, the western world is overwhelmingly caucasian, so again... the best predictor (now as then) is that white male professors will be represented disproportionately.
> at least you didn't have the torrent of mediocre papers that you see today.
That certainly is true. Stats for the humanities and social sciences are that 80% of the papers have zero citations. i.e. they have no contribution to the greater body of human work.
In Physics (my background), most papers have 2-3 citations, and only a small percentage have 1 or fewer.
I would say that if a discipline is dominated by uncited papers, then that discipline is probably a waste of time. And the professors who work in it are a net drain on society.
> In Physics (my background), most papers have 2-3 citations, and only a small percentage have 1 or fewer
Does that account for self-citations?
But then you get the problem of selecting those people without easily measured objective indicators. That's why it worked reasonably well when those were slightly low paying jobs restricted to a caste.
Similarly, the technical solution involves technology that does not require drivers and has no risk of human error anymore. The pragmatic solution is to just limit the acceptable speeds.
Edit: or rather, it would limit false positives that show up as a result of accidental p-value hacking, if not the process itself.
Any game can be gambled. Bayesian statistics just isn't there yet.
I prefer the PDF version for print outs.
Good science requires a tension between hypothesis generation and skepticism. Perhaps if we rewarded the _debunking_ of findings as much as we do the discovery of findings, things would change.
The funding bodies etc, who want "quantitive" measures of research look at publications. Why would we expect debunking papers be published if they are debunking something interesting?
But whether some amount of evidence is "significant" or not is entirely dependent on your prior. If you believe something has about a 50:50 chance of being true to start with, then a factor 20 of evidence is quite enough. Now you believe it 20:1 likely to be true.
But for something like xkcd's "green jelly beans cause cancer", your prior should be something like 1 to 100,000 or even smaller. After all, there are a lot of possible foods and a lot of possible diseases. Unless you believe a significant number of them are dangerous, your prior for any specific food causing any specific disease must be pretty low. And then even a factor 200 of evidence is nowhere near enough to convince me that green jelly beans cause cancer.
Money saved by using a small sample size is wasted trying to replicate a false positive result and by groups around the world that rely on that false result.
Requiring larger sample sizes would mean fewer experiments are carried out but we will have more confidence in the positive results produced. The outcome is fewer experiments wasted on following up on false positives. None of this requires a change in funding.
For instance, in the field I work in you have to spend days to months waiting for tumors to grow and then go and treat the animals every day for a couple of weeks with an IV drug (weekends too!). That is a a lot of work and at the end only tells you one piece of information about the drug: does it slow tumor growth in this one experimental model. It may in fact do that -- and you may get a really great p-value if you increase the number of mice -- but you still need to study the drug's pharmacokinetics, tissue distribution, in vivo mechanism of action (assuming you already know the in vitro mechanism of action). These are not just optional experiments that we require today to publish: this kind of work is essential to presenting a story about a new drug. It's not just about what it does, but how it works and universalizable it is.
This seems unintuitive and the claim is unreferenced. Can anyone explain why this is the case (if true)?
Sounds scientific, doesn't it?
> [...] we judge to be reasonable.
And tomorrow someone else judges it differently?
Maybe they should not try to redefine significance but simply introduce something called 'well-reproducible' or so.