Some ACM conferences such as CHI offer authors an opportunity to flag material misconceptions in reviewers’ perceptions of submitted papers prior to rendering a final accept/reject decision. SIGIR is not one of them. Its reviewers are free from any checks on their accuracy from the authors, and, to judge by the reviews of our submission, from the program committee as well.
Consider this: We wrote a paper on a novel IR framework which we believe has the potential to greatly increase the efficacy of interactive Information Retrieval systems. The topic we tackled is (not surprisingly) related to issues we often discuss on this and on the IRGupf blog, including HCIR, Interactive IR, Exploratory Search, and Collaborative Search. In short, these are all areas that could be well served by an algorithmic framework that supports greater interactivity.
So in our paper, we chose to evaluate our framework through experiments that involved relevance feedback. Relevance feedback is a long-studied, traditionally well-accepted interaction paradigm. The user runs a query, judges a few documents for relevance, and any relevant documents that are found during this process are saved or marked and fed back into the system to produce even better results on subsequent queries. Our results showed that the proposed framework is not only more effective than a robust, well-understood baseline, but that algorithms involved are up to an order of magnitude more efficient than traditional baselines. And speed is of utmost importance to interactive IR systems!
We received three reviews…
The first review, after summarizing our contribution, read in its entirety:
The paper is well written and both the idea and the experimental part are sound.
This was accompanied by a 4/6 recommendation score. Not much help.
The second review’s worst criticism of the work was that the evaluation was incomplete:
The idea is new & interesting, especially that it can make use of non-text query logs. One drawback of the paper, in this reviewer’s opinion, is incompleteness: why pseudo-relevance-feedback not considered as well, which is easy to do? Asking a user to judge documents until one gets 5 relevant may not be realistic. Even if PRF does not work, paper should present the results.
The idea … is new & interesting, especially that it can make use of non-text query logs. One drawback of the paper, in this reviewer’s opinion, is incompleteness: no study of employing pseudo-relevant docs. The impact would be small if one requires judged relevant docs.
This criticism is flawed on three counts:
- First, it was flat out wrong. We were not asking people to find five relevant documents, we were asking them to make five judgments of relevance. This was made very clear in the paper. Furthermore, even if making explicit judgments is difficult, there are many techniques for eliciting implicit (but not pseudo!) judgments of relevance (e.g., see Kelly and Belkin, 2001).
- The second reason for not doing pseudo-relevance feedback is that it is more likely to introduce noise and topic drift.
- The third reason for avoiding PRF entirely is that it is unnecessary for interactive systems. Indeed, if a user is reading or saving (marking) documents, i.e. if a user is giving explicit judgments of relevance, decades of research have already shown those judgments to be much more effective than pseudo-judgments.
The argument that somehow an evaluation is incomplete or meaningless — or that the impact is small — if it does not involve pseudo-relevance feedback is offensive in its narrow-mindedness. What it reflects, we believe, is the current bias of the field as a whole toward non-interactive web search-like experiences. In the web IR world, the commonly held understanding is that users are too lazy to engage in explicit relevance feedback, or else are engaged in a type of information seeking activity, such as navigation, that does not require any feedback, pseudo-relevant or otherwise. But web information retrieval is not all of information retrieval.
This second reviewer gave us a 3/6 recommendation score.
The third reviewer, while stating that the work is novel, had two main concerns:
- Our chosen query expansion technique (selection and weighting of terms) was not convincing because many others were possible, using our framework.
- We did not evaluate using pseudo-relevance feedback.
For the first point, we intentionally chose the simplest implementation of our framework to show its strength and to intentionally make the fairest comparison possible. If our naive approach beats the quite reasonable baseline (20-30% increase in effectiveness, 10x speedup in efficiency) that should be enough; it is beyond the scope of a conference paper to exhaustively demonstrate its effectiveness for arbitrary schemes a reviewer might dream up. The naive approach worked. That’s a publishable result.
That brings us again to the second point, which actually sounded like the stronger reviewer criticism: Pseudo-relevance feedback. Reading between the lines of the reviews, one gets the impression that the reviewers are well-versed in traditional web search: they mention log mining (a minor aside in our paper), and are obsessed with pseudo-relevance feedback. Those of us who were doing IR research before the late 1990s remember a time when intellectual efforts were not judged by standards applicable only to web search. The diversity of approaches, of metrics, and of applications of that era seem to have been reduced to the bleak outlines of precision-oriented page-at-a-time results lists, where interactivity is looked upon as a burden rather than an opportunity.
It is ironic, then, to note that just two days ago, on the same day that our rejection reviews arrived, Google rolled out an interface that allows people to make explicit relevance judgments through bookmarking, which Google’s algorithms then use as a form of relevance feedback! The old web maxim of users being too lazy or unwilling or unengaged to mark documents for relevance — thus necessitating pseudo-relevance feedback at the expense of real relevance feedback — was busted by a major web search engine!
The third reviewer’s score was 4/6.
So what? What are we going to do about it?
Given the discussion on Twitter and in e-mail in the aftermath of this round of rejection decisions, I think it is safe to say that we are not alone in our dissatisfaction in the reviewing process. What will happen is what always happens; the paper will be resubmitted elsewhere and life moves on.
But what about the SIGIR conference, and the community it represents? Are we unhappy with the misreadings and misunderstandings, with the reviewer who could not tell the difference between “5 judgments of relevance” and “5 relevant documents”? Yes, of course. And a conference review system that allows for anonymous feedback from the author to the reviewers, as does CHI, could go a long way in rectifying these misunderstandings. Misconceptions of one’s work are, to a certain extent, completely understandable. Even if a paper is written clearly, the reviewers have not grappled with the ideas anywhere near as much as the author(s) have.
But what about the more basic problem, the one of narrow thinking in the reviews themselves? The idea that a paper on interactivity and relevance feedback is not acceptable unless it also includes experiments on and evaluations of the non-interactive pseudo-relevance feedback approach is one that we have a difficult time accepting. Non-interactive approaches, and the web search world which thrives on them, are popular right now. Pseudo-relevance feedback epitomizes that non-interactivity, yet multiple reviewers suggested that the lack of PRF was the paper’s biggest weakness. We feel that it isn’t; PRF is a different problem, solving a different kind of need, in a different kind of scenario. Not all of information retrieval is web search. So what is one to do when a review not only mis-perceives a paper, but actively tries to impose its own values onto the paper, asking it to solve a different kind of problem than the one it is trying to solve?
Given the non-interactivity in the SIGIR review process, the inability to discuss and correct mis-perceptions and biases, one is tempted to label reviewer comments themselves as a form of pseudo-relevance feedback. I.e. contrary to appearances, no explicit judgments of relevance to the conference had actually been actually made. ;-)
Comments and lively discussion are welcome.