Jeremy Pickens contributed to this post.
Jeremy did a great job of presenting our Reverted Indexing paper, but the short session made it difficult to answer all questions and comments thoroughly. For example, William Webber wrote up a post summarizing our work, in which he observed
The authors surmise that the reverted index is more effective because it suggests more selective expansion terms, and they reproduce example term sets as evidence. This explanation is convincing enough as far as it goes; but what is not explained is why the reverted index’s expansion terms are more selective. The reason is not obvious. A single-term reverted index is not much more than a weighted direct index, mapping from documents to the terms that occur in them
I would like to address his comments because this is a key aspect of Reverted Indexing.
The key to understanding how a Reverted Index assigns weights is the following mapping:
- The number of queries that retrieved a particular document becomes the document frequency statistic in the reverted index.
- The sum of the retrieval scores of all docids retrieved by a Basis Query becomes the reverted document length
In general, basis queries that have a high probability/score for retrieving a particular docid are preferred. But if those basis queries retrieve a lot of other documents as well, the relative contribution of that score is lowered (due to a higher reverted document length). Therefore, docids that are retrieved by more specific basis queries (with lower DF) get preferred by the ranking algorithm. Terms with high DF (such as Africa and African in the example from the paper) tend not to get selected by the expansion algorithm because of the cutoff that William mentioned in his post, but also because of the reverted document length component. The cutoff is applied to a list of expansion terms (basis queries) ranked by their weights. Increasing the cutoff would serve only to increase the reverted document length, and therefore lower the relative contribution of these documents.
In short, the more specific expansion terms are selected over the more general terms by leveraging existing weighting frameworks that prefer “shorter” documents in the reverted index, which are roughly a function of lower document frequencies in the standard index, i.e.,
rev_tf / rev_doclength <-> score / sum_of_scores
Note that there is also a third factor at play in the reverted indexes: Reverted IDF.
Finally, these more specific terms also result in efficiency gains because the inverted lists associated with more specific expansion terms are shorter.