I am trying to understand how Google patent search works, and am encountering some quite odd behavior. I am not talking about the inventor search bug (which is still un-fixed), but about Boolean logic.
If I run the query [“information retrieval”], the system retrieves 323 documents. Similarly, [“dynamic hypertext”] retrieves 368 documents. The combination, [“information retrieval” “dynamic hypertext”] yields 16. Putting a plus in front of either quoted phrase does not affect the results. So far, this seems reasonable.
This seems reasonable until you ask the following questions:
- Why does the query [“information retrieval” OR “dynamic hypertext”] return only 294 documents? You would think it should produce 323 + 368 – 16 = 675 results.
- Why does the query [“information retrieval” -“dynamic hypertext”] return 326 results when you might expect 323-16=307??
- How did the number of results go up when a more restrictive clause was added??
- Why does transposing the terms ([-“dynamic hypertext” “information retrieval”]) return 324 documents?
- While the query [-“information retrieval” “dynamic hypertext”] returns 352, which is consistent (368-16=352), why does a transposition of the terms ([“dynamic hypertext” -“information retrieval”]) return 353, again adding one additional document?
So what is the ground truth? How many matching documents are there, really? I decided to cross-check these numbers with USPTO searches. USPTO seems to use a Boolean search system, so I figured I would compare the results. I searched the USPTO for the two phrases in the title (TTL), abstract (ABST), description (SPEC), and claims (ACLM) fields like this:
TTL/"dynamic hypertext" or SPEC/"dynamic hypertext" or ABST/"dynamic hypertext" or ACLM/"dynamic hypertext"
The table below summarizes what I found:
|Query||USPTO count||Google count|
|“dynamic hypertext” and
While for typical precision-oriented web search this sort of funny math doesn’t matter because you’re not interested in all the results, but just in one, any one, that matches your information need, this is not the case for patent search. Patent search has a significant recall-oriented component when it is in fact important to find every single document that matches your criteria. In such situations, apparently, one might not wish to rely on Google’s algorithms or on its lack of transparency.
Note: If you repeat this experiment, you are likely to get slightly different counts due to the varying availability of different parts of the index over time. None-the-less, this is should either not account for the logical inconsistencies in the results sets, or it should be considered a bug for recall-oriented tasks on collections such as the patent database.