We’ve been dabbling in the word of Mechanical Turk, looking for ways to collect judgments of relevance for TREC documents. TREC raters’ coverage is spotty, since it is based on pooled (and sometimes sampled) documents identified by a small number of systems that participated in a particular workshop. When evaluating our research systems against TREC data from prior years, we found that many of the identified documents had not received any judgments (relevant or non-relevant) from TREC assessors. Thus we turned to Mechanical Turk for answers.
We have not yet run full experiments (looking for relevance judgments for large numbers of documents), but have run a couple of exploratory studies to assess the effects of presentation and to get some first-hand experience on using Mechanical Turk. In one set of experiments using a couple of topics, we found that Turkers engaged with the subject matter rather than answering the questions we posed. This was particularly true of a topic about the role of women in Parliament.
The topic (321 to those familiar with TREC topics) description reads
Pertinent documents relating to this issue will discuss the lack of representation by women, the countries that mandate the inclusion of a certain percentage of women in their legislatures, decreases if any in female representation in legislatures, and those countries in which there is no representation of women.
A number of turkers judged documents as relevant that did not address the topic directly but rather talked about other aspects of injustice to women.
For another topic (411, recovering sunken treasure), a Turker reported as non-relevant a document (LA041589-0050) that TREC assessors had marked as relevant. While the document did address sunken ships, the article did not contain a discussion of recovery of treasure.
Inter-rater reliability of TREC documents is known to be low. For example, Voorhees (2000) reports overlaps in assessor ratings between 0.421 and 0.494 for pairs of assessors, and at 0.301 for three assessors on average for sets of 200 relevant and 200 non-relevant but similar documents over a range of topics, although that analysis found that different sets of relevance judgments did not have a strong affect on the relative ranking of systems being comparing through that ground truth.
On the other hand, inaccurate judgments could certainly affect absolute performance, particularly for sparse topics. Thus in principle, it would be useful to identify and correct errors in the sets of relevant documents associated with each query. It is possible that a Mechanical Turk-based approach that solicits explanatory comments and uses judgments of multiple people on each document might be used to identify inaccuracies in the TREC ground truth.
We tested this further in another set of small pilot tests for topics 333 (Antibiotics Bacteria Disease), 343 (Police Deaths), 367 (piracy), and 370 (food/drug laws). We had gotten a rather slow response to original set of tasks, which we attributed to low payment and 98% acceptance rate requirements. We doubled the payments and reduced the rate to 96% for the second set of tasks. The tasks still took a while to complete, and we did get quite a bit of noise in the system. There were some obvious attempts to game the system, and at least three turkers performed well below chance level when assessing relevance of documents. (Some, however, had very good accuracy.)
For the four topics, we obtained rates of 70%, 90%, 90%, and 50% agreement between turkers and TRECers; with the inaccurate (much less than 50% right) turkers’ results removed, the numbers changed to 75%, 85%, 90% and 65%, respectively. My sense is that the reason that topic 370 yielded such poor agreement is that a number of the allegedly relevant documents in it, are, in fact, not relevant. Thus Mechanical Turk seems to be a useful mechanism for identifying potentially-problematic TREC topics.
We are still working out the kinks in our presentation strategy and Turker vetting, but are heartened by these preliminary results. One of the challenges in dealing with Mechanical Turk when handling large numbers of responses is the tradeoff between using the simple HTML interface vs. a more customized, API-driven solution. One of the issues is how to streamline the acceptance or rejection of work based on turker performance.