Crowdsourcing relevance


Amazon’s Mechanical Turk is increasingly being used to obtain judgments of relevance that can be used to establish gold standards with which to evaluate information seeking experiments. The attraction is clear: for a few bucks, in a few days, you can obtain data that is every bit as useful for evaluating simulations and other off-line experiments as data collected in the lab from “live” participants, and may be a good substitute for TREC assessors’ judgments. And of course the scale of the enterprise is such that you can run complex multi-factor experiments and still retain enough power. If you’re not up to doing this yourself, companies such as Dolores Labs will do it for you.

There are a number of open issues, of course, including

  • Issues related to IRB approval (discussed earlier in the year by Panos Ipeirotis). IRB approval may be warranted if research hypotheses involve testing demographic or other personal data.
  • Moral issues: people doing these tasks don’t get paid a lot of money, which is why this approach is so attractive. And while¬† Turkers certainly benefit from this work, an argument can be made that the effort they expend is certainly worth more to the experimenter than a few cents a judgment it currently costs. Insert your fair-trade coffee argument here.
  • Data quality: care must be taken when soliciting data to make sure that you’re getting good results.
  • Interface design issues: designing an effective Human Intelligence Task (HIT) is not trivial, and poorly designed experiments can result in incomplete data collection or inappropriate inferences.
  • Follow up: you cannot easily ask follow-up questions of participants, which may not be important in some experiments but can be useful in other cases, particularly with follow-on studies.
  • Experimental design: It is difficult to design between-subject experiments because it is hard to guarantee who will see which HIT. Similarly, repeated measures designs have to rely on random assignment since forcing people to perform many tasks before they get paid is unlikely to be a successful strategy. See Kittur et al. for a discussion of using Mechanical Turk for HCI experiments.
  • Reliability: judgments on topics of common interest are more likely to be reliable than judgments on complex medical issues, for example, so the nature of the collection you’re trying to evaluate may affect your costs and the scale of experiments that are practical.

Despite all these caveats, Mechanical Turk remains an exciting option for collecting large amounts of data that can enable experimentation that is not otherwise feasible.

Share on: 


  1. Omar Alonso says:

    Excellent summary!

  2. I think almost all these issues are quantitative. You can’t get as reliable a judgements from the Turk, so you get more, with 10 slightly less reliable judgements pooled being more reliable than 2 or 3 more reliable judgements.

    You can’t do complete panel designs, so I and others have built hierarchical models of the data annotation process rather than relying on paired inter-annotator statistics. A big advantage is the pooling available in hierarchical models and the implicit modeling of multiple comparisons.

    You can ask follow ups on mech Turk either immediately on the same task, later through e-mail, or later through another task. We’ve had good luck following up with Turkers. The problem is drop out, which is also common in other social science or medical studies of any size.

    Data quality and interface are an issue either way.

    With mech Turk, you have good control over how you present the data. You can even have people download an app to use if you can convince them to do it. Or use AJAX or servlets. Dolores Labs has some slick wrappers, like CrowdFlower.

    I found better results from mech Turk than existing gold standard data prepared by professionals (or at least their students). Alonso and Mizzaro report the same thing in the paper of theirs you cite above.

  3. You’re right — I just wanted to point out that the approach to experimental design using MTurk needs to be different from that of conventional experiments. With respect to pooling judgments, a voting mechanism makes sense as long as you believe that there is no systematic bias. If the bias may be caused by lack of domain knowledge, a qualification test should help, but it will, by definition, reduce your participant pool and increase the amount of time required to perform the experiment. These are interesting trade-offs to explore.

  4. Twitter Comment

    Posted “Crowdsourcing relevance” [link to post]

    Posted using Chat Catcher

  5. Twitter Comment

    Great summary on crowdsourcing and relevance evaluation: [link to post]

    Posted using Chat Catcher

Comments are closed.