The fourth HCIR workshop was held this past weekend at Rutgers University in conjunction with the IIiX 2010 conference. This was, in my opinion, the best workshop of the four so far. Part of the strength of the workshop has been the range of presentations, covering more mature work in traditional 30 minute presentations, a poster and demo session, and, new this year, reports from the HCIR search challenge.
From the web site:
The aims of the challenge are to encourage researchers and practitioners to build and demonstrate information access systems satisfying at least one of the following:
- Not only deliver relevant documents, but provide facilities for making meaning with those documents.
- Increase user responsibility as well as control; that is, the systems require and reward human effort.
- Offer the flexibility to adapt to user knowledge / sophistication / information need.
- Are engaging and fun to use.
Participants would be given access to the New York Times annotated corpus which consists of 1.8 million articles published in the Times between 1987 and 2007, and they would be expected do something interesting in searching or browsing this collection.
Several teams competed in the event, and their entries were judged by the workshop participants. The entries (available as part of the proceedings) were:
- Search for Journalists: New York Times Challenge Report
Corrado Boscarino, Arjen P. de Vries, and Wouter Alink
(Centrum Wiskunde and Informatica)
- Exploring the New York Times Corpus with NewsClub
Christian Kohlschütter (Leibniz Universität Hannover)
- Searching Through Time in the New York Times
Michael Matthews, Pancho Tolchinsky, Roi Blanco, Jordi Atserias, Peter Mika, and
Hugo Zaragoza (Yahoo! Labs)
- News Sync: Three Reasons to Visualize News Better
V.G. Vinod Vydiswaran (University of Illinois),
Jeroen van den Eijkhof (University of Washington),
Raman Chandrasekar (Microsoft Research), Ann Paradiso (Microsoft Research),
and Jim St. George (Microsoft Research)
- Custom Dimensions for Text Corpus Navigation
Vladimir Zelevinsky (Endeca Technologies)
- A Retrieval System Based on Sentiment Analysis
Wei Zheng and Hui Fang (University of Delaware)
I liked Vladimir Zelevinsky’s system that constructed a faceted browsing interface on the fly: given a search term, it generated facets using WordNet relations, and then used terms obtained this way to populate the facets. Of course he got great performance using the Endeca back end, which was able to compute facet value counts quickly enough to populate multiple facets in a few seconds. Search results themselves were presented in a two-column newspaper-like layout which appealed to me with its clean look. Of course I am somewhat partial to newspaper layouts for presenting search results.
Microsoft’s entry performed automatic clustering of matching articles into several groups, and offered a pleasant UI for browsing the results. It was a much heavier system whose success depends on the quality of the clustering algorithm. In demos, it performed well for the most part, although some of the smaller category labels were a bit odd.
The winner of this competition was the entry from Yahoo that used some NLP techniques to identify references to time in the articles. These references were used to construct a variety of timeline visualizations intended to help journalists make sense of the data. I didn’t get to play with the system, but it looked like a solid piece of work, judging from the presentation. It also introduced me to OpenNLP, a set of NLP tools written in Java. Apparently it’s quite useful for doing things like POS tagging and named entity extraction.
These presentations highlighted a successful attempt to put HCIR techniques into practice in an open-ended manner. The effort was also a success in terms of press coverage thanks to Daniel Tunkelang’s efforts, with the Yahoo entry receiving coverage from the Technology Review in an article titled “A Search Service that Can Peer into the Future,” which was also picked up by Techmeme.