In his comment to an earlier post, Miles Efron reiterated the usefulness of the various TREC competitions to fostering IR research. I agree with him (and with others) that TREC has certainly been a good incubator both in its annual competition and in follow-on studies that use its data in other ways. And, as Miles points out,we have seen a proliferation of collections: everything from the original newspaper articles to blogs, video, large corpora, etc.
But its major limitation is the reliance on recall and precision—that is, on an a priori gold standard—to measure performance. These metrics allow systems to be compared in a lab, but are difficult to transfer to real-world situations. While algorithms may transfer from the lab to the web (or elsewhere), the evaluation methodology based on these metrics doesn’t. By focusing researchers’ attention so closely on recall and precision (and on MAP in particular), I argue that TREC has in a real sense discouraged innovation in evaluation.
In the recently published book, “Information Retrieval: A Health and Biomedical Perspective ,” William Hersh lists several questions that try to get at evaluation of IR systems in the wild. These are:
- Was the system used?
- For what was the system used?
- Were the users satisfied?
- How well did they use the system?
- What factors were associated with successful or unsuccessful use of the system?
- Did the system have an impact?
This seems like a great starting point for exploration of evaluation in real-world settings. While the answers to these questions are surely dependent on the domain, and will differ for medical, legal, and other disciplines, by collecting such answers in a systematic way, we may be able to identify useful generalizations that can inform the design and evaluation of many systems. For example, one way to assess the impact of a document is to measure its re-use in certain contexts, as suggested by Jon Elsas for comments on a forum or as we did in a study how law students write from sources.
Perhaps a to go forward without burning bridges is to extend the mandate for TREC, and to fill in an important missing aspect. We should honor the Text REtrieval Conference by transforming it into the Text Retrieval and Evaluation Conference.