Reverted Indexing

on

Traditional interactive information retrieval systems function by creating inverted lists, or term indexes. For every term in the vocabulary, a list is created that contains the documents in which that term occurs and its relative frequency within each document. Retrieval algorithms then use these term frequencies alongside other collection statistics to identify the matching documents for a query.

In a paper to be published at CIKM 2010, Jeremy Pickens, Matt Cooper and  I describe a way of using the inverted index to associate document ids with the queries that retrieve them. Our approach combines the inverted index with the notion of retrievability to create an efficient query expansion algorithm that is useful for a number of applications, including relevance feedback. We call this kind of index a reverted index because rather than mapping terms onto documents, it maps document ids onto queries that retrieved the associated documents.

Here’s how it works:

First, we build a regular inverted index.

Then we create a collection of queries, which we call basis queries. Basis queries can be created from query logs, or they can be synthesized in some manner. One naive approach that works reasonably well is to use terms from the inverted index that have a minimum DF.

We evaluate each basis query against the inverted index using some ranking algorithm. (Pick one!) For each query, we record the ranked list of retrieved document ids in the reverted index. In other words, we create a virtual document with id=basis query that contains for its terms ids of documents that were retrieved by that query. Putting all this into the reverted index (which is just a second instance of an inverted index) creates a new structure that can be queried with document ids of documents from the inverted index, and will create a ranked list of queries that are effective at retrieving the specified documents.

Once you’ve identified these queries, they can be used for any other purpose such as offering term suggestions, expanding queries through pseudo-relevance feedback or through user-controlled relevance feedback, etc.

How well does this work? Read the paper for the full details, but the short answer is that our query expansion technique outperforms PL2 and Bose-Einstein algorithms (as implemented in Terrier) by 15-20% on several TREC collections. This is just a first stab at implementing and evaluating this indexing, but we are quite excited by the results.

8 Comments

  1. Twitter Comment


    RT @HCIR_GeneG: Posted “Reverted indexing” [link to post] #cikm2010

    Posted using Chat Catcher

  2. Twitter Comment


    RT @iadh: Interesting paper using Terrier RT @HCIR_GeneG Posted “Reverted indexing” #cikm2010 – [link to post]

    Posted using Chat Catcher

  3. Twitter Comment


    Interesting paper using Terrier RT @HCIR_GeneG Posted “Reverted indexing” #cikm2010 – [link to post]

    Posted using Chat Catcher

  4. Twitter Comment


    Posted “Reverted indexing” [link to post] #cikm2010

    Posted using Chat Catcher

  5. Twitter Comment


    “ReIndex” …A type of query index, looking forward to the paper [link to post] /via @HCIR_GeneG

    Posted using Chat Catcher

  6. Twitter Comment


    Reverted Index! link to paper on blog RT @christangrant “ReIndex” …A type of query index [link to post] /via @HCIR_GeneG

    Posted using Chat Catcher

  7. […] gives an extended summary of the paper on the FXPAL blog. In brief, a reverted index maps from documents to the queries which retrieve those documents, […]

  8. […] slides from our talk at CIKM 2010 last week. More details on reverted indexing can be found in an earlier post and on the FXPAL site, the full paper is available here, and the previous post describes why the […]

Comments are closed.