Turk vs. TREC


We’ve been dabbling in the word of Mechanical Turk, looking for ways to collect judgments of relevance for TREC documents. TREC raters’ coverage is spotty, since it is based on pooled (and sometimes sampled) documents identified by a small number of systems that participated in a particular workshop. When evaluating our research systems against TREC data from prior years, we found that many of the identified documents had not received any judgments (relevant or non-relevant) from TREC assessors. Thus we turned to Mechanical Turk for answers.

We have not yet run full experiments (looking for relevance judgments for large numbers of documents), but have run a couple of exploratory studies to assess the effects of presentation and to get some first-hand experience on using Mechanical Turk. In one set of experiments using a couple of topics, we found that Turkers engaged with the subject matter rather than answering the questions we posed. This was particularly true of a topic about the role of women in Parliament.

The topic (321 to those familiar with TREC topics) description reads

Pertinent documents relating to this issue will discuss the lack of representation by women, the countries that mandate the inclusion of a certain percentage of women in their legislatures, decreases if any in female representation in legislatures, and those countries in which there is no representation of women.

A number of turkers judged documents as relevant that did not address the topic directly but rather talked about other aspects of injustice to women.

For another topic (411, recovering sunken treasure), a Turker reported as non-relevant a document (LA041589-0050) that TREC assessors had marked as relevant. While the document did address sunken ships, the article did not contain a discussion of recovery of treasure.

Inter-rater reliability of TREC documents is known to be low. For example,  Voorhees (2000) reports overlaps in assessor ratings between 0.421 and 0.494 for pairs of assessors, and at 0.301 for three assessors on average for sets of 200 relevant and 200 non-relevant but similar documents over a range of topics, although that analysis found that different sets of relevance judgments did not have a strong affect on the relative ranking of systems being comparing through that ground truth.

On the other hand, inaccurate judgments could certainly affect absolute performance, particularly for sparse topics. Thus in principle, it would be useful to identify and correct errors in the sets of relevant documents associated with each query.  It is possible that a Mechanical Turk-based approach that solicits explanatory comments and uses judgments of multiple people on each document might be used to identify inaccuracies in the TREC ground truth.

We tested this further in another set of small pilot tests for topics 333 (Antibiotics Bacteria Disease), 343 (Police Deaths), 367 (piracy), and 370 (food/drug laws). We had gotten a rather slow response to original set of tasks, which we attributed to low payment and 98% acceptance rate requirements. We doubled the payments and reduced the rate to 96% for the second set of tasks. The tasks still took a while to complete, and we did get quite a bit of noise in the system. There were some obvious attempts to game the system, and at least three turkers performed well below chance level when assessing relevance of documents. (Some, however, had very good accuracy.)

For the four topics, we obtained rates of 70%, 90%, 90%, and 50% agreement between turkers and TRECers; with the inaccurate (much less than 50% right) turkers’ results removed, the numbers changed to 75%, 85%, 90% and 65%, respectively. My sense is that the reason that topic 370 yielded such poor agreement is that a number of the allegedly relevant documents in it, are, in fact, not relevant. Thus Mechanical Turk seems to be a useful mechanism for identifying potentially-problematic TREC topics.

We are still working out the kinks in our presentation strategy and Turker vetting, but are heartened by these preliminary results. One of the challenges in dealing with Mechanical Turk when handling large numbers of responses is the tradeoff between using the simple HTML interface vs. a more customized, API-driven solution. One of the issues is how to streamline the acceptance or rejection of work based on turker performance.


  1. Two points. First, the Cranfield methodology (and the TREC collections in particular) is _only_ valid for comparative evaluation; the scores in isolation are not meaningful.

    Second, are your agreement scores over all documents judged or over TREC-relevant documents? If the former I suspect your high agreements are dominated by nonrelevant documents. You don’t give enough experiment information to ask good questions ;-)

  2. Twitter Comment

    RT @HCIR_GeneG finds that using Turk to cover unjudged TREC docs not so easy. [link to post] #mturk

    Posted using Chat Catcher

  3. Twitter Comment

    Posted “Turk vs. TREC” [link to post] #mturk

    Posted using Chat Catcher

  4. Omar says:

    Great summary. I think it is worth mentioning that the topics used were not that trivial. And I really like using justifications/comments for getting more data back.

  5. Twitter Comment

    RT @HCIR_GeneG: Posted “Turk vs. TREC” [link to post] #mturk

    Posted using Chat Catcher

  6. @ian points of clarification: For the first test, we had some known relevant, known non-relevant, and unknown documents; for the second experiment, we had equal numbers of known relevant and known non-relevant, so the test was geared only at calibrating the effectiveness of eliciting relevance judgments from turkers.

    With respect to using TREC data only for comparison, from a statistical reliability point of view, I think it’s useful to show not only the ranking, but the magnitude of the difference to get a feel for the robustness of the effect.

  7. @omar One of the tests we need to run is to look at the correlation between accuracy of judgment and the likelihood of comment. I can tell you that the topic related to women in Parliament elicited a lot of comments, most of which related to unequal treatment of women in general rather than to the specifics of the topic. But the data will speak!

  8. Omar says:

    @gene Another thing to try would be to see if controversial topics generate more comments.

  9. We’ve found the single most useful thing for Turker vetting is a tutorial qualifying task.

    I’ve heard it also helps to personalize the request by telling people what it’s for, providing feedback as users go (if you have any gold standard data or other means of online evaluation), and offering bonuses for good work.

    There’s lots of experience out there in search eval on Turk from PowerSet and Microsoft. You might want to talk to the folks at Dolores Labs, who also have a lot of experience in this area.

Comments are closed.