Sue Dumais of MSR gave an excellent keynote address at CIKM last week, in which she emphasized the temporal nature of collections used for information retrieval and of the way people access information on the web. This was by far the most user-oriented talk at the conference that I attended, and a refreshing change from the vast array of machine learning papers in the rest of the conference.
The slides from the talk will be available on her site, but are substantially similar to her ECDL 2010 keynote talk. In short, Sue described how collections and documents change over time, and how people’s patterns of visiting web sites change in response to content evolution. She also introduced a new browser plugin for Internet Explorer called Diff-IE that helps people understand changes to the web sites they visit.
The goal of the talk was to describe the nature of change in web collections, and then to suggest how this dynamism can be modeled and leveraged in the design of systems. Sue’s list of dynamic qualities of collections includes
- The rate of introduction of new documents
- The rate at which document content changes
- The evolving nature of relevance (E.g., Hurricane Earl from 1998 and Hurricane Earl from 2010, and others)
- Changes in query volume
- Changes in interaction over time
Sue cited a paper by Adar et al that described the nature of web content change. If I have my math right, 33% of 55,000 sampled web pages changed in five weeks, and about the same number changed every hour. This is considerable churn that is not well captured in web test collections used in research.
She also reported an interesting analysis of page content changes, including DOM-level and term level changes. She presented an interesting analysis of terms associated with pages: some terms seemed to be fundamental to a page, while other terms on the same page were ephemeral.
Sue then tied an analysis of revisitation patterns to document change patterns. MSR ran a study based on toolbar and query logs to understand practices and patterns around revisitation and refinding, and coupled this with a survey to get at some of the motivation for people’s behavior. They found revisitation rates in the 60-80% range, and four major patterns of revisitation based on page structure and type. They also found that about a third of the queries are repeats, and about 39% of clicked-on documents are re-visits.
They also found (not surprisingly) that the more a web page changes, the more unique visitors it attracts. This led to another analysis of rates of content change on a page. Because different parts of a page change at different rates, they tend to attract different kinds of visitors. She described this as a resonance between rates of change and visitation patterns, which I thought was a great analogy.
Some of this data comes from an internal beta deployment of Diff-IE, a tool that shows users how a page has changed since the last visit. This seems like a great tool to have, and MSR has plans to release it quite soon. It’s a great tool, and a tribute to MSR for developing and making it available. (Jeremy would want me to point out that he and Laurent Denoue filed a patent application on a related idea back in December of 2007 ).
In short, it was a rich and thought-provoking talk, particularly when it comes to evaluation: it is not clear how to best capture the kind of churn characteristic of the web, and how to set up interactive experiments that simulate it.