Parsing patents


Since Google announced its distribution of patents, I have been poking around the data trying to understand what’s in there and starting to index it for retrieval. The first challenge I’ve had to deal with is data formats. The second is how to display documents to users efficiently.

The full text of the patents is available in ZIP files, one file per week, based on the date patents were granted. The files cover patents issued from 1976 to (as of this writing) the first week of 2010. In addition to the text, they contain all manner of metadata such as when the patent was filed, who the inventors and assignees were, etc. Interestingly, the zipped up files are in two different formats: patents from 2001 on are in XML, while earlier ones are in a funky ad hoc text format.


The XML format was easy to parse using standard tools, whereas the proprietary format required some specialized code. Despite that, the XML format proved the more problematic in some ways:

  • The DTD wasn’t really available in machine-readable form; the closest I could find was this page on the USPTO site
  • While both formats indicated the distinction between the patent’s summary vs. its details, the XML format chose to do it through processing instructions rather than via containment. Unfortunately, standard implementations of the DefaultHandler (such as the XmlSlurper I was using) drop processing instructions on the floor. To use that data would have required a complete rewrite of the handler. In the end, I decided to fudge it and added a pre-processing step that scanned for the specific processing instructions that bracketed the summary and details sections of the patent and replaced them with elements that the XML parser would represent properly. Unfortunately, this hack cost me DTD compliance, so I had to turn validation off.

For those curious about the text format, it contains stuff like this:

WKU  D02428814
SRC  5
APN  611301&
APT  4
ART  292
APD  19750908
TTL  Diver's helmet
ISD  19770104
NCL  1
ECL  1
EXP  Feifer; Melvin B.
NDR  2
NFG  6
TRM  14
NAM  Jones; Richard F.
CTY  Santa Barbara

The file is organized into field codes with associated values (e.g., TTL is the title). Values that are longer than a single line continue on the next line which starts with a few spaces to distinguish it as a continuation. Field codes are grouped into sections identified by four-character codes (e.g., INVT, above) with three-character codes for sub-fields. In short, it’s pretty easy to parse. In addition, it clearly sets out the summary (BSUM) and the details (DETD) fields.

The reason this is important is that Xue and Croft  found that the summary field appears to be most effective for relevance feedback searches on patent.

The meaning of the codes is documented here, although the example is in SGML, rather than in the format described above. This SGML example does not have a one-to-one correspondence between the field codes and the element names, but the associated documentation lists the field codes and their definitions.

One interesting (aka bizarre) aspect of both formats is that the patent number itself is encoded in the WKU field in a manner that does not correspond directly to the value you see on an actual patent! I reverse-engineered the actual patent with the following regex:


which looks for an optional prefix, then a zero, then the 5-7 digit patent number, and then another digit (which seems to be a check-sum of some kind, which I discard). The prefix and the patent number, concatenated, produce the patent number that the USPTO search engine recognizes. Of course this doesn’t handle the patent applications; those need to be processed separately.


My goal in indexing these patents is to experiment with retrieval algorithms and user interfaces, but I don’t want to actually host all the data. While I can store and display the text associated with each patent, I don’t really want to store all the PDFs or drawings or page images, as that will consume all those terabytes Google was so proud of. Unfortunately, I have not yet figured out how to get Google or the USPTO to serve up a PDF or a page image for a specific patent that I can identify with a patent number.

While Google does provide JavaScript API for searching its patents (among other collections), the API is rather limited and I am not sure how to get it to return a PDF of a specific patent reliably.

It would be a terrible waste of resources to host that stuff and to serve it through a single point in the network. That leaves me with a few possibilities, including figuring out a way to obtain documents from a third-party based on the patent number or joining some kind of a consortium that would host the documents  (for research purposes only!).

I wonder if NIST is willing to do this.


  1. Twitter Comment

    Posted “Parsing patents” [link to post]

    Posted using Chat Catcher

  2. Twitter Comment

    RT @HCIR_GeneG: Posted “Parsing patents” [link to post]

    Posted using Chat Catcher

  3. […] poking around on the USPTO and Google to try to figure out how to get single PDF documents for my indexing project, I discovered that the Google advanced search interface won’t retrieve any documents based on […]

  4. […] Working on parsing and indexing the patent collection that Google made available has been an interesting education in just how noisy allegedly clean data really is, and in the scale of the collection. I am by no means done; in fact, I’ve had to start over a couple of times. I have learned a few things so far, in addition to my earlier observations. […]

  5. mpatton says:

    I have been involved with automated d as a patent attorney/hobbyist for some years. If you are still working on the project I could help you generate URLs to fetch individual pages from the PTO or from the European Patent Office website.

    Please email me if you are interested or would like to compare notes, as I am just starting work on my own parser. I worked on a similar parser for the PTO’s trademark data 10-12 years ago at my old law firm.

Comments are closed.