A platform for interactive search research

on

In her recent CIKM 2010 keynote address, Sue Dumais emphasized the importance of time to the understanding of to structure search over web collections. It was an provocative and inspiring talk, but one that left me with a sense of futility: how does a research group that doesn’t have access to a large scale search engine engage in this kind of research?

Static collections are obviously not appropriate, but simulated dynamism is likely to miss the mark as well. Indeed, I think it’s not possible to do kinds of experiments that Sue described on synthetic collections at all, which may make it difficult for those without access to large numbers of users doing real searches over long periods of time to contribute to this research. Volunteer efforts to collect query logs and such are probably not going to succeed (see the UMass Lemur Study, for example, which over a year collected as much data as is available to Google in about six seconds).

Thus the research community is left with a couple of options: to rely on historical datasets collected by crawling the web, to use the resources of companies that have access to query logs and to other data (via internships, visiting positions, etc.), or perhaps to create an open-source search engine project that would provide an ad-free search environment in exchange for the rights to use that data for research purposes.

I wonder how feasible this latter option really is, though. The issue is not the technology; I think that with a moderate amount of effort, a medium-scale web search engine could be built using mostly open-source components. It’s not clear, however, what the costs of running such a search engine (the crawler, the index, etc.) would be, nor what an appropriate funding model would be.

Would a consortium of smaller organizations be able to collect enough funds to pay for the required resources? What would the buy-in have to be? Would 100 organizations each contributing $10K/year, plus some open-source development, be sufficient? How would policy issues around experiments be managed? In some ways, this is related to my post about the ACM Digital Library as a colleciton for IR research. The pragmatic differences are largely due to scale.

It is ironic (or perhaps consistent) that at SIGIR 2009, Sue gave a keynote address about the Living Laboratory, which evoked a similar reaction in me: great idea, Sue. We should definitely do this. How do we fund it?

I am curious to hear what people think: am I tilting at windmills here, with no hope of this kind of research being done outside the YGB club, or is there some practical way to make this possible?

9 Comments

  1. […] This post was mentioned on Twitter by josek_net, Gene Golovchinsky. Gene Golovchinsky said: Posted "A platform for interactive search research" http://palblog.fxpal.com/?p=4876 […]

  2. gdupont says:

    Well, Jimbo Wales tryed to make this kind of “open” search engine. The idea was not research oriented but they put quite an effort in it and were able to crawl and update a 50M pages index… They may have god number of needed funding.

    There is also have the Yacy experiment (P2P search engine) which is still alive and growing… Again it’s more engineer driven project than researchers but it has sufficient potential.

    But what’s the target ? What scale is sufficient ? And as you said how to organize and balance between experiment ? Maybe some organization like Mozilla could make the referee like they do with “Firefox test pilot” initiative…

  3. I think there are two big issues here: how to fund a web-scale effort, and how to manage it to make it a viable research platform.

    In terms of funding, I still don’t have a good sense of how much this might cost, although perhaps people at Ask.com (which just announced the closure of its search engine) might have some useful data on this.

    In terms of managing the research, one approach I could imagine is to make available an API to research organizations through which search results could be piped into whatever experimental interface you want. The API should be capable of returning data from which a SERP could be built, and also should expose collection statistics for more sophisticated uses.

  4. “create an open-source search engine project that would provide an ad-free search environment in exchange for the rights to use that data for research purposes.”
    Why not change the problem a little and look to libraries and library collections? There’s a fair bit of query logging, but not a huge amount of analysis. And less competition than in web search.

    I think the hardest part of creating a new search engine is getting people to use it–it has to be far better than what’s out there, or appeal to people ideologically. Smaller search engines might be a good place to start asking for data. I’m not sure if Duck Duck Go (say) is logging the queries (there’s lots of stuff they don’t log): https://duckduckgo.com/privacy.html

  5. Just floating this idea, and not too sure if it is too relevant to the blog post. Rather than institutions/organisations having to buy dedicated machines for crawling and indexing, would a distributed system, equivalent to the Seti@home project, be feasible?

    Whereby machines that aren’t being used could be used to perform crawling and indexing, and to relay this to a central dedicated repository. Possibly massively cutting costs, then each institution is able to feed off the central core index via an API, allowing them to present a custom UI to this data, which will then be able to collect data.

    In addition any organisation / individual who uses to the API, has to agree, to provide a copy of any data collected or created, as part of the project. Agreeing to provide a copy via P2P services.

    Sorry this may have gone a little off topic.

  6. @Jodi, I think studying library search is an interesting topic in its own right, but it has some fundamental differences from web-based search. For example, if you want to understand how people search for blog content, a library won’t be the best venue to study that.

    In terms of getting people to use the search engine, that can be unpacked into (at least) two separate issues:
    1) Whether the search engine has access to click-through data to optimize rankings
    2) Whether people can access the search engine through some public interface.

    My thought was that much of the access may be through purpose-built, experimental interfaces, only some of which will be available persistently to the public. With respect to point 1, I think a suitably-structured API can still be used to collect user feedback (including explicit feedback!).

  7. @Jon I think you’re spot-on with your comment — that might be a good way to reduce the need for some centralized resources, and to allow people to contribute in kind as well as financially.

    There’s been some work done in maintaining P2P distributed indexes; I wonder if that would be sufficient for this purpose, or if a more compute-intensive central core (or cores) to serve results would still be required.

  8. I think the answer is Wikipedia — a very large (and very lively) collection of documents where the complete revision history for each article is easily accessible via an API.

  9. Sérgio, I think the Wikipedia is a nice place to start, but it doesn’t support the richness and variety of information needs that exist with respect to the web at large. Thus conclusions drawn from research on the Wikipedia, while useful, may not apply directly to the web. I agree that Wikipedia is worth studying, perhaps as a way of testing algorithms, but it isn’t a particularly good approximation of what happens on the web, both with respect to the way content changes, and with respect to users’ information needs.

Comments are closed.