Amazon’s Mechanical Turk is increasingly being used to obtain judgments of relevance that can be used to establish gold standards with which to evaluate information seeking experiments. The attraction is clear: for a few bucks, in a few days, you can obtain data that is every bit as useful for evaluating simulations and other off-line experiments as data collected in the lab from “live” participants, and may be a good substitute for TREC assessors’ judgments. And of course the scale of the enterprise is such that you can run complex multi-factor experiments and still retain enough power. If you’re not up to doing this yourself, companies such as Dolores Labs will do it for you.
There are a number of open issues, of course, including
- Issues related to IRB approval (discussed earlier in the year by Panos Ipeirotis). IRB approval may be warranted if research hypotheses involve testing demographic or other personal data.
- Moral issues: people doing these tasks don’t get paid a lot of money, which is why this approach is so attractive. And while Turkers certainly benefit from this work, an argument can be made that the effort they expend is certainly worth more to the experimenter than a few cents a judgment it currently costs. Insert your fair-trade coffee argument here.
- Data quality: care must be taken when soliciting data to make sure that you’re getting good results.
- Interface design issues: designing an effective Human Intelligence Task (HIT) is not trivial, and poorly designed experiments can result in incomplete data collection or inappropriate inferences.
- Follow up: you cannot easily ask follow-up questions of participants, which may not be important in some experiments but can be useful in other cases, particularly with follow-on studies.
- Experimental design: It is difficult to design between-subject experiments because it is hard to guarantee who will see which HIT. Similarly, repeated measures designs have to rely on random assignment since forcing people to perform many tasks before they get paid is unlikely to be a successful strategy. See Kittur et al. for a discussion of using Mechanical Turk for HCI experiments.
- Reliability: judgments on topics of common interest are more likely to be reliable than judgments on complex medical issues, for example, so the nature of the collection you’re trying to evaluate may affect your costs and the scale of experiments that are practical.
Despite all these caveats, Mechanical Turk remains an exciting option for collecting large amounts of data that can enable experimentation that is not otherwise feasible.