Monday, September 8, 2014

Experiments Hurt the Review Process

Experimentally validating (or falsifying) a theory is one of the fundamental aspects of science. As such I have always put a lot of emphasize on experiments, and of course empirical evidence are essential when designing systems. To pick a simple example: merge sort has better asymptotic behavior than quick sort, but in practice quick sort is usually faster. Wall clock is what matters.

However while all that is true in general, in practice experiments can be quite harmful, in particular in the context of peer-reviewed papers. In the rest of this post I will try to illustrate why that is the case and what could potentially be done about that.
It took my a while to realize that experiments hurt the review process, but both real papers that I have seen, and in particular input from my colleague Alfons Kemper have convinced me that experiments in papers are a problem. Alfons suggested to simply ignore the experimental section in papers, which I find a bit extreme, but he has a point.

The first problem is that virtually all papers "lie by omission" in their evaluation section. The authors will include experiments where their approach behaved well, but they will not show results where it has problems. In Computer Science we find that perfectly normal behavior, in fields like Pharmacy we would go to jail for that.

Furthermore this behavior interacts badly with the review process. Of course the reviewers know that are shown only the good cases, therefore these cases have to be really good. If someone writes a paper about an approach that is nicer and cleaner than a previous one, but is 20% slower in experiments, he will have a hard time publishing it (even though the 20% might well just stem from small implementation differences).
Which is a shame. Often reviewers are too keen on good experimental results.

The easiest way to kill a paper is to state that the experiments not conclusive enough. The authors will have a hard time arguing against that, as there is always another experiment that sounds plausible and important and on which the reviewer can insist. Authors therefore have an incentive to make their experiments slick looking and impressive, as that will hopefully guide the reviewer in other directions.

Which brings me to the core of the problem with experiments in papers: There have been papers where the thin line between careful experimental setup and outright cheating has been touched, perhaps even crossed. And that is really bad for science. Now one can claim that this is all the fault of the authors, and that they should be tarred and feathered etc., but that is too short sighted. Of course it is the fault of the authors. But the current system is rigged in a way that makes is very, very attractive to massage experimental results. I once heard the statement that "the reviewers expect good numbers". And unfortunately, that is true.
And this affects not only the highly problematic cases, even papers that stay "legal" and "just" create good looking results by a very careful choice of their experimental setup are quite harmful, in best case we learn little, in worst case we are misled.

Now what can we do about that problem? I don't really know, but I will propose a few alternatives. One extreme would be to say we largely ignore the experimental part of a paper, and evaluate it purely based upon the ideas presented within. Which is not the worst thing to do, and arguably it would be an improvement over the current system, but if we ignore the actual experiments the quick sort mentioned above might have had a hard time against the older merge sort.

The other extreme would be what the SIGMOD Repeatability Effort tried to achieve, namely that all experiments are validated. And this validation should happen during reviewing (SIGMOD did it only after the fact). Then, the reviewer should repeat the experiments, try out different settings, and fully understand and validate the pros and cons of the proposed approach.
Unfortunately, in an ideal world that might actually be the best approach, but that is not going to happen. First, authors will claim IP problems and all kinds of excuses why their approach cannot be validated externally. And second, even more fundamental, reviewers simply do not have the time to spend days on repeating and validating experiments for each paper they review.

So how could a compromise look like? Perhaps a good mode would be to review papers primarily based upon the ideas presented therein, and take only a cursory look at the experiments. The evaluation should look plausible, but that is it, it should not have much impact on the review decision. And in particular authors should not be expected to produce another miracle result, reporting honest number is preferable over a new performance record.
Now if the authors want to they could optionally submit a repeatability set (including binaries, scripts, documentation etc.) together with their paper, and that would give them bonus points during reviewing, in particular for performance numbers, as now the reviewers can verify the experiments if they want. No guarantee that they will do, and papers should still be ranked primarily based upon ideas, but that would allow for more reasonable experimental results.

Experiments are great in principle, but in the competitive review process they have unintended consequences. In the long run, we have to do something about that.