story

Half the papers at NIPS would be rejected if the review process were rerun (opens in new tab)

mrtz.org

141 pointsurish11y ago49 comments

49 comments

nl11y ago

For those who didn't read the story: NIPS decided to run this experiment themselves to identify problems.

That's a a really brave thing to do, and deserves serious credit.

Absolutely, it's hugely important. The core idea that subjective beings will disagree holds obvious weight, but it takes serious commitment to improving the process to offer up this kind of experiment to prove it.

deepsearch11y ago

Exactly, the same credit mimicking cognition along vector comparison deserves.

andrea_s11y ago

Since NIPS is a very prestigious conference, I'd expect a lot of submitted papers (perhaps a vast majority of them) would fall in a grey area between "clearly unsuitable" and "clearly suitable". I personally think there are too many factors at play in evaluating a series of papers - no objective sorting can really exist.

A note for people outside the academic Machine Learning field: NIPS is widely believed to be on a different level from the rest - during my PhD, my advisor used to say that a NIPS paper would be on par with a paper published on a good journal, resume-wise. The difference is especially striking if you've had the chance to attend other conferences (including, alas, IEEE-sponsored events) - that are, with very few exceptions, fairly terrible from a scientific point of view.

drpgq11y ago

CVPR is pretty good.

Canada11y ago

Selecting talks really subjective. It's more like choosing what songs to play at a party than an objective sorting process.

You have limited speaking slots. You have to guess what the conference attendees will find interesting this year. You are biased by your own particular interests.

Also, some of the submitters are your friends/colleagues, and even if they didn't tell you what they were submitting already, which is unlikely since your relationship is based on talking about this stuff, you can tell it's theirs in less than 250 words...

However a conference tries to sell the fairness and objectivity of its process, you can't anonymize or double blind these things away.

sytelus11y ago

For two independent committees, 6% of papers were acceptable without disagreement, 25% were rejectable while the rest were coin flips. This means when your paper gets accepted or rejected, luck is playing huge part. This is not because judges are actually flipping coin but vast majority of people don't seem strikingly good or bad. So for a repeated trial outcome may not be same. Also, the asymmetry here is striking. Definitely bad papers dominates in number by 4X than definitely good papers.

These are really great observations with deep implications. This same patterns might get applied in other aspects of life such as interviewing candidates or selection of mate or buying a shirt. In all these cases, we might have similar distribution at work.

I have often wondered why is it so hard to have less mediocrity in world? Why is not every book, t-shirt or smartphone is just great? One obvious reason is that lot of times people create something out of obligation such as demand from job instead of out of urge to create. So subsequent question is that if it was possible that no one has to have any obligation to create, can above distribution turn its head over hill? For example, in that scenario would we have, say, 70% great papers, 5% mediocre and rest coin toss?

dalke11y ago

"Why is not every book, t-shirt or smartphone is just great?"

Different people have different ideas of what "great" means. Not everyone thinks the Harry Potter series are great books, while many do. We see that in movies where a movie does poorly at the box office while the critics.

The definition of greatness changes over time, so "It's a Wonderful Life", now considered one of the most critically acclaimed films ever made, had only mediocre revenue when it came out.

Greatness is sometimes situational, so "Dan Brown ... is the undisputed king of airplane books — the not-too-heavy, not-too-long potboilers perfect for a long layover." If you don't fly, then perhaps there's no time when Brown's works might appeal.

Travel has its own category of "good enough." Visiting Germany once I bought a book from the limited English selection not because it was great, but because it was something to read on the long train ride.

A lot of people watch sports, but surely it can't be that all sports games are great, so greatness can't be the only reason for keeping someone's interest.

Since it's hard to predict greatness, people will test out ideas to see if there's a response. Sometimes this can lead to feedback and improvements. Sometimes this testing is through writing clubs. Sometimes (as with smartphone apps) this is with the market itself.

Houshalter11y ago

It's simpler than that. We focus on the differences between things, not the similarities. If all movies were equally good, we would then grow to focus on the tiny differences between them and start to judge them based on those.

danieltillett11y ago

This is the way "peer review" works. It is basically random. I have always found it comforting whenever I had a paper rejected as I would know it was nothing to do with the quality of my work. I would fix any of the typos found by the reviewers (you always get a spelling nazi as one of reviewers) and send it out again unchanged. I only have had one paper rejected twice and it was accepted unchanged on the third attempt.

IndianAstronaut11y ago

Isn't this why Mendeley and EndNote exist? Just change up the citation formatting and resubmit to a different journal until one accepts your work.

danieltillett11y ago

Yes :)

My favourite peer review story is when I submitted one on my articles to the top journal in my field at the time (Appied and Enviromental Microbiogy). It came back with the usual peer review trivial changes (cite this irrelevant paper of mine,etc) which I did (this nearly always easier than arguing with the reviewers). The editor made a mistake and instead of sending the updated manuscript out to the original reviewers, they sent it out to a new lot of reviewers. What was funny about the whole exercise was the second set of reviewers called the first set of reviewers idiots and told me to change everything back.

1 more reply

at-fates-hands11y ago

Is this the Neural Information Processing Systems convention you're talking about?

I'm sure more than a few people won't have any idea what "NIPS" stands for.

davmre11y ago

To be fair, calling it "Neural Information Processing Systems" isn't significantly more informative. The name is just a quirk of history; NIPS in its modern form includes research in all areas of machine learning, not just neural nets.

_delirium11y ago

In fact for some years neural networks were very out of fashion there, and it was almost purely a statistical machine learning conference. I tend to just think of it as a machine-learning conference named "NIPS", which stands for something historical (like Perl and Lisp do).

army11y ago

I think people are reading more into this than there is. Reviewing papers is a highly subjective, high variance process, and very few papers get universally positive reviews.

From the point of view of an author, if you get a paper rejected that you know is worthwhile, you just have to make whatever improvements you can and then submit it again.

graycat11y ago

FYI: Apparently NIPS abbreviates Neural Information Processing Systems.

djulius11y ago

That's a very neat experiment.

SIGMOD made an interesting move this year by accepting all papers reaching its standards. However not every paper will be given a presentation slot during the conference.

dhm11y ago

I am surprised the committees were "tasked with a 22.5% acceptance rate". Couldn't more than 77.5% of the submissions have been of poor quality?

davmre11y ago

NIPS gets a ton of submissions, so the law of large numbers governs pretty strongly. Imagine that each paper submitted is independently either good or bad, with 22.5% probability of being good. With 1660 submissions, the total number of good papers follows a Binomial(1660, 0.225) distribution, which has mean 374 and standard deviation 17. Under this model, the fraction of good papers would be somewhere in the range 20.5-24.5% (corresponding to a two-standard-deviation window around the mean) in 95% of reviewing cycles. So even though the quality of the individual papers is totally random, the randomness mostly "cancels out" and the overall number of good submissions is relatively constant.

Of course this is assuming an objective standard for what constitutes a "good" paper. As others have pointed out, the only really meaningful standard is "how does this paper compare to other work being done in this field"? So it's also reasonable to think of NIPS's goal as just trying to present the best papers that were written in any given year, not as bestowing a strictly-defined stamp of objective quality.

dalke11y ago

When I've arranged conferences, we had a certain number of time slots. It's a bit flexible, in that we can decide to allocate a longer time for two talks, or shorter time for three, depending on the talks.

It could be that they had a first pass at a schedule, used that to set a first cut for the reviewers, then adjusted the schedule once they figured they needed to add another 42 papers.

Also, not being accepted does not mean that a paper is poor. They used a rank system, so it only means that others had papers which appeared to be better.

liquidise11y ago

I am more curious of the inverse: what if more that 22.5% of papers were at acceptable quality levels? Wouldn't that leave each committee to pick and choose, thus artificially inflating their disagreements?

wsxcde11y ago

Yes, and I think that is essentially why we're seeing these disagreements.

I've heard from lots of professors that a good conference gets a lot of "very-good-but-not-great" submissions and the job of the program committee is to pick the best among these. I wouldn't be surprised at all if minor personal preferences (which from the outside look rather random) ended up having a big say in the fate of a particular paper. Maybe some reviewers are more forgiving of poorly-written but technically strong papers, maybe some reviews consider certain fields "dead" and so are biased against them, reviewers tend to wildly different standards on how extensive an experimental analysis should be to be acceptable, ...

wsxcde11y ago

Highly (almost vanishingly) unlikely at a "top-tier" venue like NIPS.

dhm11y ago

Can you say more about why you believe this is true?

2 more replies

hyperbovine11y ago

What is going on with these graphics?

rtkwe11y ago

It's an XKCD styled graph by the looks of it. There are a couple generators out there that take data and make the hand drawn looking graphs. e.g http://xkcdgraphs.com/

simonster11y ago

Also, matplotlib, the most commonly-used Python plotting package, can generate xkcd-ized plots with a single additional command (see http://matplotlib.org/xkcd/examples/showcase/xkcd.html).

jff11y ago

And they look fucking awful.

1 more reply

techaddict00911y ago

Something similar happened with IEEE[1] - It had accepted approx 120 papers generated by SCIgen[2]. [1] - http://www.nature.com/news/publishers-withdraw-more-than-120... [2] - http://pdos.csail.mit.edu/scigen/

sqrt1711y ago

IEEE may be more prone to precision errors (letting bad papers in) while NIPS may be prone to recall errors (throwing good papers out). With the way reviewing is done (no one can take a week off to read and fully comprehend the four papers they are given) you cannot achieve perfect separation - even if that were possible.

onan_barbarian11y ago

Calling the process of "accepting a SciGen-generated paper into a allegedly peer-reviewed journal" a "precision error" is a bit on the optimistic side. It implies that someone was making a decision after reading the content of the paper, as opposed to, well, just accepting everything in sight.

It doesn't take a "week off" to notice that a paper is gibberish, at the very least.

1 more reply

Teodolfo11y ago

How is this similar to what the NIPS organizers did? It doesn't seem related at all.

j / k navigate · click thread line to collapse

49 comments

nl11y ago

For those who didn't read the story: NIPS decided to run this experiment themselves to identify problems.

That's a a really brave thing to do, and deserves serious credit.

ehurrell11y ago

deepsearch11y ago

Exactly, the same credit mimicking cognition along vector comparison deserves.

andrea_s11y ago

drpgq11y ago

CVPR is pretty good.

Canada11y ago

Selecting talks really subjective. It's more like choosing what songs to play at a party than an objective sorting process.

You have limited speaking slots. You have to guess what the conference attendees will find interesting this year. You are biased by your own particular interests.

However a conference tries to sell the fairness and objectivity of its process, you can't anonymize or double blind these things away.

sytelus11y ago

dalke11y ago

"Why is not every book, t-shirt or smartphone is just great?"

The definition of greatness changes over time, so "It's a Wonderful Life", now considered one of the most critically acclaimed films ever made, had only mediocre revenue when it came out.

A lot of people watch sports, but surely it can't be that all sports games are great, so greatness can't be the only reason for keeping someone's interest.

Houshalter11y ago

danieltillett11y ago

IndianAstronaut11y ago

Isn't this why Mendeley and EndNote exist? Just change up the citation formatting and resubmit to a different journal until one accepts your work.

danieltillett11y ago

Yes :)

1 more reply

at-fates-hands11y ago

Is this the Neural Information Processing Systems convention you're talking about?

I'm sure more than a few people won't have any idea what "NIPS" stands for.

davmre11y ago

_delirium11y ago

army11y ago

I think people are reading more into this than there is. Reviewing papers is a highly subjective, high variance process, and very few papers get universally positive reviews.

From the point of view of an author, if you get a paper rejected that you know is worthwhile, you just have to make whatever improvements you can and then submit it again.

graycat11y ago

FYI: Apparently NIPS abbreviates Neural Information Processing Systems.

djulius11y ago

That's a very neat experiment.

SIGMOD made an interesting move this year by accepting all papers reaching its standards. However not every paper will be given a presentation slot during the conference.

dhm11y ago

I am surprised the committees were "tasked with a 22.5% acceptance rate". Couldn't more than 77.5% of the submissions have been of poor quality?

davmre11y ago

dalke11y ago

It could be that they had a first pass at a schedule, used that to set a first cut for the reviewers, then adjusted the schedule once they figured they needed to add another 42 papers.

Also, not being accepted does not mean that a paper is poor. They used a rank system, so it only means that others had papers which appeared to be better.

liquidise11y ago

wsxcde11y ago

Yes, and I think that is essentially why we're seeing these disagreements.

wsxcde11y ago

Highly (almost vanishingly) unlikely at a "top-tier" venue like NIPS.

dhm11y ago

Can you say more about why you believe this is true?

2 more replies

hyperbovine11y ago

What is going on with these graphics?

rtkwe11y ago

It's an XKCD styled graph by the looks of it. There are a couple generators out there that take data and make the hand drawn looking graphs. e.g http://xkcdgraphs.com/

simonster11y ago

Also, matplotlib, the most commonly-used Python plotting package, can generate xkcd-ized plots with a single additional command (see http://matplotlib.org/xkcd/examples/showcase/xkcd.html).

jff11y ago

And they look fucking awful.

1 more reply

techaddict00911y ago

sqrt1711y ago

onan_barbarian11y ago

It doesn't take a "week off" to notice that a paper is gibberish, at the very least.

1 more reply

Teodolfo11y ago

How is this similar to what the NIPS organizers did? It doesn't seem related at all.

j / k navigate · click thread line to collapse