In her recent CIKM 2010 keynote address, Sue Dumais emphasized the importance of time to the understanding of to structure search over web collections. It was an provocative and inspiring talk, but one that left me with a sense of futility: how does a research group that doesn’t have access to a large scale search engine engage in this kind of research?
Static collections are obviously not appropriate, but simulated dynamism is likely to miss the mark as well. Indeed, I think it’s not possible to do kinds of experiments that Sue described on synthetic collections at all, which may make it difficult for those without access to large numbers of users doing real searches over long periods of time to contribute to this research. Volunteer efforts to collect query logs and such are probably not going to succeed (see the UMass Lemur Study, for example, which over a year collected as much data as is available to Google in about six seconds).
Thus the research community is left with a couple of options: to rely on historical datasets collected by crawling the web, to use the resources of companies that have access to query logs and to other data (via internships, visiting positions, etc.), or perhaps to create an open-source search engine project that would provide an ad-free search environment in exchange for the rights to use that data for research purposes.
I wonder how feasible this latter option really is, though. The issue is not the technology; I think that with a moderate amount of effort, a medium-scale web search engine could be built using mostly open-source components. It’s not clear, however, what the costs of running such a search engine (the crawler, the index, etc.) would be, nor what an appropriate funding model would be.
Would a consortium of smaller organizations be able to collect enough funds to pay for the required resources? What would the buy-in have to be? Would 100 organizations each contributing $10K/year, plus some open-source development, be sufficient? How would policy issues around experiments be managed? In some ways, this is related to my post about the ACM Digital Library as a colleciton for IR research. The pragmatic differences are largely due to scale.
It is ironic (or perhaps consistent) that at SIGIR 2009, Sue gave a keynote address about the Living Laboratory, which evoked a similar reaction in me: great idea, Sue. We should definitely do this. How do we fund it?
I am curious to hear what people think: am I tilting at windmills here, with no hope of this kind of research being done outside the YGB club, or is there some practical way to make this possible?