A Kuhnian gap between research publishing and academic success

There is a gap in research publishing and how it relates to academic success. On the one hand, there are scientists complaining of low funds, being short-staffed, low-quality or absent equipment, disoptimal employment/tenure terms, bureaucratic incompetence and political interference. On the other, there are scientists who describe their success within academia in terms of being published in XYZ journals (with impact factors of PQR), having high h-indices, having so many papers to their names, etc.

These two scenarios – both very real in India and I imagine in most countries – don’t straightforwardly lead to the other. They require a bridge, a systemic symptom that makes both of them possible even when they’re incompatible with each other. This bridge is those scientists’ attitudes about what it’s okay to do in order to keep the two façades in harmonious coexistence.

What is it okay to do? For starters, keep the research-publishing machinery running in a way that allows them to evaluate other scientists on matters other than their scientific work. This way, lack of resources for research can be decoupled from scientists’ output in journals. Clever, right?

According to a study published a month ago, manuscripts that include a Nobel laureate’s name among the coauthors are six-times more likely to be accepted for publication than those without a laureate’s name in the mast. This finding piles on other gender-related problems with peer-review, including women’s papers being accepted less often as well as men dominating the community of peer-reviewers. Nature News reported:

Knowledge of the high status of a paper’s author might justifiably influence a reviewer’s opinion: it could boost their willingness to accept a counterintuitive result, for example, on the basis of the author’s track record of rigour. But Palan’s study found that reviewers’ opinions changed across all six of the measures they were asked about, including the subject’s worthiness, the novelty of the information and whether the conclusions were supported. These things should not all be affected by knowledge of authorship, [Palan, one of the paper’s coauthors, said].

Palan also said the solution to this problem is for journals to adopt double-anonymised peer-review: the authors don’t who the reviewers and the reviewers don’t know who the authors are. The most common form of peer-review is the single-blind variety, where the reviewers know who the authors are but the authors don’t know who the reviewers are. FWIW, I prefer double-anonymised peer-review plus the journal publishing the peer-reviewers’ anonymised reports along with the paper.

Then again, modifying peer-review would still be localised to journals that are willing to adopt newer mechanisms, and thus be a stop-gap solution that doesn’t address the use of faulty peer-review mechanisms both inside journals and in academic settings. For example, given the resource-mininal context in which many Indian research institutes and universities function, hiring and promotion committees often decide whom to hire or promote based on which journals their papers have been published in and/or the number of times those papers have been cited.

Instead, what we need is systemic change that responds to all the problems with peer-review, instead of one problem at a time in piecemeal fashion, by improving transparency, resources and incentives. Specifically: a) make peer-review more transparent, b) give scientists the resources – including time and freedom – to evaluate each others’ work on factors localised to the context of their research (including the quality of their work and the challenges in their way), and c) incentivise scientists to do so in order to accelerate change and ensure compliance.

The scientometric numbers, originally invented to facilitate the large-scale computational analysis of the scientific literature, have come to subsume the purpose of the scientific enterprise itself: that is, scientists often want to have good numbers instead of want to do good science. As a result, there is often an unusual delay – akin to the magnetic hysteresis – between the resources for research being cut back and the resulting drop in productivity and quality showing in the researchers’ output. Perhaps more fittingly, it’s a Kuhnian response to paradigm change.

A conference’s peer-review was found to be sort of random, but whose fault is it?

It’s not a good time for peer-review. Sure, if you’ve been a regular reader of Retraction Watch, it’s never been a good time for peer-review. But aside from that, the process has increasingly been taking the brunt for not being able to stem the publishing of results that – after publication – have been found to be the product of bad research practices.

The problem may be that the reviewers are letting the ‘bad’ papers through but the bigger issue is that, while the system itself has been shown to have many flaws – not excluding personal biases – journals rely on the reviewers and naught else to stamp accepted papers with their approval. And some of those stamps, especially from Nature or Science, are weighty indeed. Now add to this muddle the NIPS wrangle, where researchers may have found that some peer-reviews are just arbitrary.

NIPS stands for the Neural Information Processing Systems (Foundation), whose annual conference was held in the second week of December 2014, in Montreal. It’s considered one of the few main conferences in the field of machine-learning. Around the time, two attendees – Corinna Cortes and Neil Lawrence – performed an experiment to judge how arbitrary the conference’s peer-review could get.

Their modus operandi was simple. All the papers submitted to the conference were peer-reviewed before they were accepted. Cortes and Lawrence then routed a tenth of all submitted papers through a second peer-review stage, and observed which papers were accepted or rejected in the second stage (According to Eric Price, NIPS ultimately accepted a paper if either group of reviewers accepted it). Their findings were distressing.

About 57%* of all papers accepted in the first review were rejected during the second review. To be sure, each stage of the review was presumably equally competent – it wasn’t as if the second stage was more stringent than the first. That said, 57% is a very big number. More than five times out of 10, peer-reviewers disagreed on what could be published. In other words, in an alternate universe, the same conference but with only the second group of reviewers in place was generating different knowledge.

Lawrence was also able to eliminate a possibly redeeming confounding factor, which he described in a Facebook discussion on this experiment:

… we had a look through the split decisions and didn’t find an example where the reject decision had found a ‘critical error’ that was missed by the accept. It seems that there is quite a lot of subjectivity in these things, which I suppose isn’t that surprising.

It doesn’t bode well that the NIPS conference is held in some esteem among its attendees for having one of the better reviewing processes. Including the 90% of the papers that did not go through a second peer-review, the total predetermined acceptance rate was 22%, i.e. reviewers were tasked with accepting 22 papers out of every 100 submitted. Put another way, the reviewers were rejecting 78%. And this sheds light on the more troubling perspective of their actions.

If the reviewers had been randomly rejecting a paper, they would’ve done so at the tasked rate of 78%. At NIPS, one can only hope that they weren’t – so the second group was purposefully rejecting 57% of the papers that the first group had accepted. In an absolutely non-random, logical world, this number should have been 0%. So, that 57% is closer to 78% than is 0% implies some of the rejection was random. Hmm.

While this is definitely cause for concern, forging ahead on the basis of arbitrariness – which machine-learning theorist John Langford defines as the probability that the second group rejects a paper that the first group has accepted – wouldn’t be the right way to go about it. This is similar to the case with A/B-testing: we have a test whose outcome can be used to inform our consequent actions, but using the test itself as a basis for the solution wouldn’t be right. For example, the arbitrariness can be reduced to 0% simply by having both groups accept every nth paper – a meaningless exercise.

Is our goal to reduce the arbitrariness to 0% at all? You’d say ‘Yes’, but consider the volume of papers being submitted to important conferences like NIPS and the number of reviewer-hours being available to evaluate them. In the history of conferences, surely some judgments must have been arbitrary for the reviewer to have fulfilled his/her responsibilities to his/her employer. So you see the bigger issue: it’s not all the reviewer as much as it’s also the so-called system that’s flawed.

Langford’s piece raises a similarly confounding topic:

Perhaps this means that NIPS is a very broad conference with substantial disagreement by reviewers (and attendees) about what is important? Maybe. This even seems plausible to me, given anecdotal personal experience. Perhaps small highly-focused conferences have a smaller arbitrariness?

Problems like these are necessarily difficult to solve because of the number of players involved. In fact, it wouldn’t be entirely surprising if we found that nobody or no institution was at fault except how they were all interacting with each other, and not just in fields like machine-learning. A study conducted in January 2015 found that minor biases during peer-review could result in massive changes in funding outcomes if the acceptance rate was low – such as with the annual awarding of grants by the National Institutes of Health. Even Nature is wary about the ability of its double-blind peer-review to solve the problems ailing normal ‘peer-review’.

Perhaps for the near future, the only takeaway is likely going to be that ambitious young scientists are going to have to remember that, first, acceptance – just as well as rejection – can be arbitrary and, second, that the impact factor isn’t everything. On the other hand, it doesn’t seem possible in the interim to keep from lowering our expectations of peer-reviewing itself.

*The number of papers routed to the second group after the first was 166. The overall disagreement rate was 26%, so they would have disagreed on the fates of 43. And because they were tasked with accepting 22% – which is 37 or 38 – group 1 could be said to have accepted 21 that group 2 rejected, and group 2 could be said to have accepted 22 that group 1 rejected. Between 21/37 (56.7%) and 22/38 (57.8%) is 57%.

Hat-tip: Akshat Rathi.