Timothy G. Armstrong, Alistair Moffat, William Webber, and Justin Zobel have written what will undoubtedly be a controversy and discussion-inspiring paper for the upcoming CIKM 2009 conference. The paper compares over 100 studies of information retrieval systems based on various TREC collections, and concludes that not much progress has been made over the last decade in terms off Mean Average Precision (MAP). They also found that studies that use the TREC data outside the TREC competition tend to pick poor baselines to show short-term improvement (which is publishable) without demonstrating long-term gains in system performance. This interesting analysis is summarized in a blog post by William Webber.
They explored the issue of baseline analyses further through a simulated combination of various techniques (stemming, stop words, query expansion, etc.) found to contribute to search effectiveness individually. The idea was to see if combinations of these techniques actually improve performance. They found that on the collections tested, the more features were added, the better the overall performance in terms of MAP, and that query expansion and smoothing seemed to offer the most reliable improvements when used in combination with other features.
The authors have also created a public repository of evaluation results that can be used to standardize performance as well as collections. This site, www.evaluatIR.org, allows researchers to understand the state of the art in terms of performance on standardized test collections, to upload new results, to compare results within the database, and share uploaded results with others. More info on this tool can be found in another blog post by William. I hope that this repository receives due attention both from people writing papers, and from reviewers considering merits of conference submissions. I certainly intend to use it when reviewing papers.
Finally, it is proper to point out that the lack cumulative progress in ad hoc retrieval effectiveness may be due to the nature of the approach. Reducing an exploratory task to single-shot interaction can only get you so far. Perhaps a more interactive approach characteristic of HCIR is more appropriate. Another aspect to consider is the utility of MAP as a measure of system performance. MAP tends to penalize results that don’t put relevant documents in the top few slots of a ranked list While this may be an appropriate metric for some kinds of search, it is not necessarily useful in recall-oriented scenarios such as medical, legal, or other exploratory search. Thus it would be interesting to extend the scope of www.evaluatIR.org to include other measures of performance.