Recall vs. Precision


Stephen Robertson’s talk at the CIKM 2011 Industry event caused me to think about recall and precision again. Over the last decade precision-oriented searches have become synonymous with web searches, while recall has been relegated to narrow verticals. But is precision@5 or NCDG@1 really the right way to measure the effectiveness of interactive search? If you’re doing a known-item search, looking up a common factoid, etc., then perhaps it is. But for most searches, even ones that might be classified as precision-oriented ones, the searcher might wind up with several attempts to get at the answer. Dan Russell’s a Google a day lists exactly those kinds of challenges: find a fact that’s hard to find.

So how should we think about evaluating the kinds of searches that take more than one query, ones we might term session-based searches?

This session-oriented search suggests that we need to distinguish between queries and information needs when we talk about interactive search. Thus we have recall- and precision-oriented information needs. What about the queries? Wouldn’t you always want to have precision-oriented queries, giving the person the best information high in the results list?

In an exploratory search task that involves learning about the topic, not every query is intended to find a specific item; some queries are run to understand the field, to learn about the vocabulary, to test hypotheses, to get a sense for the scope of the collection, etc. While these queries may not retrieve any pertinent documents, they may help people understand which queries to run that will in fact find such documents. It might be better, therefore, to have these queries achieve high diversity rather than high precision.

Recall-oriented exploratory search shares the exploratory characteristics of learning and uncertainty with precision-oriented exploratory search, thereby requiring at least some high-diversity queries. In addition, it may benefit from both high-recall and high-precision queries for the various steps along the way.

In short, we have precision-oriented needs that might be satisfied by single high-precision queries (the head of the web search curve), precision-oriented needs that are satisfied through a combination of high-diversity and high-precision queries, and recall-oriented information needs that are satisfied through high-precision, high-diversity, and high-recall queries. Of course we also have straight recall-oriented info needs, which can be tackled one query at a time, or en masse.

We can summarize this as follows: the table shows two dimensions of information need (certainty and scope) and the third dimension of the type of query).

Info need









Exploratory X X


Exploratory X X

The design challenge for HCIR systems, then, is to diagnose or elicit the kind of information need a searcher has, and to bring to bear the appropriate ranking algorithms to identify results that help the person with the task.

The correlated challenge is how to evaluate session search in a way that doesn’t penalize queries that increase a searcher’s knowledge, but does reward continued progress toward the searcher’s goal. In terms of Marchionini’s keynote chart, we would like to have learning be represented by the green monotonic line, while we would be happy with marginal pertinence jumping around. Increased learning might come from high-precision queries in a known-item information need, but is more likely to result from high diversity and high recall queries.

This analysis suggests (to me!) that while NDCG@3 or precision @5 may be useful metrics for known-item, high-precision information needs, other metrics may be more appropriate for other scenarios.  In a multi-query session, the contribution of individual queries is not as important as the evolution of search process or the final outcome. We should avoid the temptation of computing a Mean Average Precision (where the mean is computed over the queries in the session) or some other similar metric to assess the quality of the interaction.

Instead, we should look to measures of learning (which, admittedly, may be hard to capture in the real world), and use set-based measures related to saved or bookmarked documents to assess the quantity of information that was retrieved. It’s possible that a system design that encourages people to mark useful or pertinent documents may yield log data that is more useful in assessing system (and user) performance compared to designs that only measure click-through rates. (For a nice discussion of one set of such possibilities, see Lad & Tunkelang (2011).)

We can then revisit the information needs described above, with an eye toward metrics:

Info need Metrics
Certainty Scope
Known Precision Prec @ n, NDCG @ n, MRR, etc.
Exploratory Time to completion, completion rate, bad abandonment rate
Known Recall Recall, # docs saved
Exploratory # docs saved, measures of learning

These are by no means complete, but might be an interesting place to start unpacking recall and precision.

Comments are closed.