Jamie Callan of CMU gave an interesting and thought-provoking keynote talk at CIKM 2010. While traditionally search engines have been used in a more or less direct manner to identify useful documents that the user would then (manually) incorporate into other tasks, Jamie suggested a new class of applications that would use search engines for the purposes of identifying documents or parts of documents in some collection, but then would apply this information in pursuit of some other, more specialized, task.
While the notion of using a search engine as a component of another system is not particularly novel, the kinds of requirements that his proposed use imposes on search engines would certainly push the envelope.
He started by motivating his talk with three applications: computer-assisted language learning, question-answering, and Never Ending Language Learner (NELL). The way such applications are constructed currently involves running some query, extracting information from the results, and throwing away the junk. Queries are typically keyword searches or simple patterns, but these often do not meet the true requirements of the applications that consume the results.
What he was proposing was a general solution that allows the search engine to know as much as possible about the application’s information need and the document contents. His claim is that current structured queries can handle simple document structure, but are too brittle and cannot handle more complex structures. His examples centered about using various NLP techniques to parse document structure and the kinds of failures that might break traditional approaches to index structured documents.
The best example of an alternative approach that Jamie described was based on work in indexing PubMed structure that was based on a relational schema that combined elements of a traditional document (author, title, abstract, journal, etc.) with more meta-level information such as gene-t0-gene relationship information. (Unfortunately, I wasn’t able to catch the reference to this work; stay tuned.)
He concluded with a call for research on this new class of applications that combine multiple forms of knowledge and language analysis and metadata and structure of varying reliability. These applications pose many interesting unsolved core IR problems, require diverse information resources to exploit, and create opportunities for new retrieval models.
While I found the talk interesting and inspiring, I think some of the kinds of indexing and ranking algorithms that he wants to see already exist. Two examples come to mind: Ancestry.com and PowerSet.
Ancestry.com’s search interface implements a rich and flexible schema that incorporates exact matches and flexible best-match approach. While this is not exactly what Jamie suggested because it solves a direct search problem, it does address some of the search infrastructure requirements of flexible search capability.
As I understand it, PowerSet’s approach is to index (subject, verb, object) tripples are created from documents, and then indexed in a more traditional manner. This seems to be a step in the right direction, and fallback strategies for compensating for parser failures should make this more robust. Its incorporation into Bing for Wikipedia search is further indication of the viability of this approach.