Boolean illogic

on

I am trying to understand how Google patent search works, and am encountering some quite odd behavior. I am not talking about the inventor search bug (which is still un-fixed), but about Boolean logic.

If I run the query [“information retrieval”], the system retrieves 323 documents. Similarly, [“dynamic hypertext”] retrieves 368 documents. The combination, [“information retrieval” “dynamic hypertext”] yields 16. Putting a plus in front of either quoted phrase does not affect the results. So far, this seems reasonable.

This seems reasonable until you ask the following questions:

So what is the ground truth? How many matching documents are there, really? I decided to cross-check these numbers with USPTO searches. USPTO seems to use a Boolean search system, so I figured I would compare the results. I searched the USPTO for the two phrases in the title (TTL), abstract (ABST), description (SPEC), and claims (ACLM) fields like this:


TTL/"dynamic hypertext" or SPEC/"dynamic hypertext" or ABST/"dynamic hypertext" or ACLM/"dynamic hypertext"

The table below summarizes what I found:

Query USPTO count Google count
“dynamic hypertext” 221 368
“information retrieval” 7489 323
“dynamic hypertext” and
“information retrieval”
10 16

While for typical precision-oriented web search this sort of funny math doesn’t matter because you’re not interested in all the results, but just in one, any one, that matches your information need, this is not the case for patent search. Patent search has a significant recall-oriented component when it is in fact important to find every single document that matches your criteria. In such situations, apparently, one might not wish to rely on Google’s algorithms or on its lack of transparency.

Note: If you repeat this experiment, you are likely to get slightly different counts due to the varying availability of different parts of the index over time. None-the-less, this is should either not account for the logical inconsistencies in the results sets, or it should be considered a bug for recall-oriented tasks on collections such as the patent database.

5 Comments

  1. […] This post was mentioned on Twitter by Gene Golovchinsky, TheSource Newsletter. TheSource Newsletter said: Boolean knowledge from Gene Golovchinsky: FXPAL Blog » Blog Archive » Boolean illogic http://bit.ly/9lzCWU #sourcecon […]

  2. Nice analysis. I found a reproducible bug in their stemmer a while back, which they eventually fixed. The page counts have always been a joke, but it looks like you actually clicked through to get the real counts (bravo).

    Ironically, a Google recruiter called my wife in for an interview based on finding her online resume a few years back. The Google recruiter said they were hiring for a QA position. She used to do QA and thought working at Google might be fun.

    She went to the interview in NY, the gist of which was, “We’re too smart to need QA. Why are you here?”. She never did figure out why they called her in the first place. Nor why they’ve called her back every so often since then to ask if she might’ve changed her mind about being interested in a QA position.

  3. I would be happy if someone in the know would explain the rationale behind this behavior. The lack of QA may be attributed to the “beta” status, but this seems more like a fundamental flaw in their conception of how retrieval should work, rather than a simple lack of unit testing.

  4. Hi Gene,
    I can’t offer any explanation…but do agree with you regarding the need for recall in patent searching. In my previous role many years ago working with a patent database supplier the end users were (and still may be) the most demanding users for the “right” answer from their IR systems; a wrong answer could be very expensive.

  5. I wonder then if anyone uses Google Patent search as the sole source for patent research, or if people use other (commercial) services or the USPTO site. Based on my little exploration, I would be reluctant to trust Google for search, although once you have the patent number, using their PDFs is a lot nicer than the USPTO image viewer.

Comments are closed.