Traditional interactive information retrieval systems function by creating inverted lists, or term indexes. For every term in the vocabulary, a list is created that contains the documents in which that term occurs and its relative frequency within each document. Retrieval algorithms then use these term frequencies alongside other collection statistics to identify the matching documents for a query.
In a paper to be published at CIKM 2010, Jeremy Pickens, Matt Cooper and I describe a way of using the inverted index to associate document ids with the queries that retrieve them. Our approach combines the inverted index with the notion of retrievability to create an efficient query expansion algorithm that is useful for a number of applications, including relevance feedback. We call this kind of index a reverted index because rather than mapping terms onto documents, it maps document ids onto queries that retrieved the associated documents.
Here’s how it works:
First, we build a regular inverted index.
Then we create a collection of queries, which we call basis queries. Basis queries can be created from query logs, or they can be synthesized in some manner. One naive approach that works reasonably well is to use terms from the inverted index that have a minimum DF.
We evaluate each basis query against the inverted index using some ranking algorithm. (Pick one!) For each query, we record the ranked list of retrieved document ids in the reverted index. In other words, we create a virtual document with id=basis query that contains for its terms ids of documents that were retrieved by that query. Putting all this into the reverted index (which is just a second instance of an inverted index) creates a new structure that can be queried with document ids of documents from the inverted index, and will create a ranked list of queries that are effective at retrieving the specified documents.
Once you’ve identified these queries, they can be used for any other purpose such as offering term suggestions, expanding queries through pseudo-relevance feedback or through user-controlled relevance feedback, etc.
How well does this work? Read the paper for the full details, but the short answer is that our query expansion technique outperforms PL2 and Bose-Einstein algorithms (as implemented in Terrier) by 15-20% on several TREC collections. This is just a first stab at implementing and evaluating this indexing, but we are quite excited by the results.