On term selection in reverted indexing

on

Jeremy Pickens contributed to this post.

Jeremy did a great job of presenting our Reverted Indexing paper, but the short session made it difficult to answer all questions and comments thoroughly. For example, William Webber wrote up a post summarizing our work, in which he observed

The authors surmise that the reverted index is more effective because it suggests more selective expansion terms, and they reproduce example term sets as evidence. This explanation is convincing enough as far as it goes; but what is not explained is why the reverted index’s expansion terms are more selective. The reason is not obvious. A single-term reverted index is not much more than a weighted direct index, mapping from documents to the terms that occur in them

I would like to address his comments because this is a key aspect of Reverted Indexing.

The key to understanding how a Reverted Index assigns weights is the following mapping:

  • The number of queries that retrieved a particular document becomes the document frequency statistic in the reverted index.
  • The sum of the retrieval scores of all docids retrieved by a Basis Query becomes the reverted document length

In general, basis queries that have a high probability/score for retrieving a particular docid are preferred.  But if those basis queries retrieve a lot of other documents as well, the relative contribution of that score is lowered (due to a higher reverted document length).  Therefore, docids that are retrieved by more specific basis queries (with lower DF) get preferred by the ranking algorithm. Terms with high DF (such as Africa and African in the example from the paper) tend not to get selected by the expansion algorithm because of the cutoff that William mentioned in his post, but also because of the reverted document length component.  The cutoff is applied to a list of expansion terms (basis queries) ranked by their weights. Increasing the cutoff would serve only to increase the reverted document length, and therefore lower the relative contribution of these documents.

In short, the more specific expansion terms are selected over the more general terms by leveraging existing weighting frameworks that prefer “shorter” documents in the reverted index, which are roughly a function of lower document frequencies in the standard index, i.e.,

rev_tf / rev_doclength <-> score / sum_of_scores

Note that there is also a third factor at play in the reverted indexes: Reverted IDF.

What is reverted IDF?  It is the transform of the number of basis query “reverted documents” in which a docid is found.  Docids that are retrieved by a lot of different basis queries will receive a lower “term” weight when doing a reverted query.  Therefore, terms that come from those documents will not be weighted as highly.  On the other hand, docids that are retrieved by relatively few basis queries will receive a higher term weight.
Now, maybe this is a bit of a stretch, because this isn’t the only reason why a docid would have fewer basis queries that retrieve it, but one important reason why a docid would have fewer basis queries that retrieve it is because the original document was shorter, with relatively few unique terms.  And maybe this is a bit of handwaving, and shouldn’t be believed without some sort of specific empirical study, but shorter documents a moderate number of unique terms tend to be more specific, more focused documents.
So you have this double whammy.  Basis queries that retrieve a lot of documents tend to be downweighted by the reverted doclength factor, whereas docids that are retrieved by a lot of basis queries tend to be downweighted by the reverted IDF factor. The combination of these two factors yields a strong, natural reason that the reverted queries prefer low DF terms.

Finally, these more specific terms also result in efficiency gains because the inverted lists associated with more specific expansion terms are shorter.

4 Comments

  1. Thanks for the clarification, Gene (and Jeremy). You’re right; I was confused about the source of IDF in the reverted index. I’ll clarify my understanding and blog again in the next couple of days.

  2. […] FXPAL Blog On technology and beyond! « On term selection in reverted indexing […]

  3. […] Gene and Jeremy have disputed two assertions about index reversion made in my previous post. The first is that (1,10) normalization strips term occurences of their IDF factor. The second is that the cutoff depth used in creating the reverted corpus sets an upper bound on reverted term DF. Let me say immediately that my second assertion is wrong, and was due to confusion on my part: cutoff depth constrains reverted document length, not reverted DF; the latter depends (as Gene points out) on how many terms a document is highly retrievable by. […]

  4. […] By william, on November 4th, 2010 Copied with permission from IREvalEtAl GD Star Ratingloading…Gene and Jeremy have disputed two assertions about index reversion made in my previous post. The first […]

Comments are closed.