Finding vs. retrieving

on

Having stumbled onto the IR Museum on the SIGIR web site, I decided to investigate why I had not come across it before. I missed the SIGIR announcement about the museum when it was made in 2008, and since that item was no longer on the first page of the web site, I didn’t find it by browsing. Site search might have helped, but there wasn’t any.

So I tried searching the web.

For the query IR museum, Yahoo! returned a bunch of hits on Iran. I looked through the top 200 hits, and found no mention of information retrieval or of the term infra-red. Bing also tried to tell me about Iran,with somewhat more diverse results than Yahoo! On the 3rd page it even produced a link to my earlier post. Of course it showed me another link to the same post on the 4th page as well. So much for duplicate detection. Google had reasonably diverse results on the first page, listing my blog post at position 12,but not producing anything else relevant in the top 200 hits. It did find an IR museum, but not the right one.

For the query “IR museum” (in quotes), Google came up with my earlier post at positions 6 and 25, but did not find the SIGIR site in the top 200 results. Yahoo! results still focused predominantly on Iran; nothing about information retrieval was retrieved int he top 200 documents. Bing found my post at rank 14, a tweet referencing my post indexed into someone’s blog at positions 46 and 47, and a PDF file from a NKOS workshop file from 2004 with the spurious conjunction of the terms IR and museum at ranks 79 and 102.

Each of these search engines returned the correct document at rank #1 for the query SIGIR Museum.

So have I learned from this?

  1. My results were clearly not sufficiently biased by prior search activity to help any one of the search engines disambiguate the keyword IR (Iran vs. infra-red vs. Information Retrieval vs. Interventional Radiology)
  2. The results were not particularly diverse. Even if the sites supported explicit relevance feedback, I wouldn’t have had to examine too many documents to find anything useful.
  3. The search engines return really different stuff for somewhat off-beat queries. Trying multiple search engines may be useful in these cases.
  4. The term IR has very low retrievability, in the sense used by Azzapardi and Vinay.
  5. The internet keeps being changed by writing about it, so experiments are difficult to repeat. By the time you read this, each search engine may well produce completely different results, and if link to this post (please!) that may affect rankings even more.
  6. Relying on an internet search engine to help users find content in an otherwise obscure site is not a winning strategy.

3 Comments

  1. […] This post was mentioned on Twitter by Gene Golovchinsky, Tatsumi Kobayashi. Tatsumi Kobayashi said: interesting analysis RT @HCIR_GeneG: Posted “Finding vs. Retrieving” http://palblog.fxpal.com/?p=3791 […]

  2. The IR Museum site is not really linked anywhere, is it? Orphaned pages are obviously more difficult to rank nowadays.

    As far as Google goes, the fact that the page takes an eternity to load on my modest connection could be a factor. Does it make a difference that most (if not all) the interesting content on the IR Museum page is in Flash format.

  3. Neel, I had thought of adding a bit about linkage, but vacation got in the way :-) Yahoo does show that there are 19 (as of today) inbound links to the museum. Perhaps the problem is collective obscurity.

Comments are closed.