Since Google announced its distribution of patents, I have been poking around the data trying to understand what’s in there and starting to index it for retrieval. The first challenge I’ve had to deal with is data formats. The second is how to display documents to users efficiently.
The full text of the patents is available in ZIP files, one file per week, based on the date patents were granted. The files cover patents issued from 1976 to (as of this writing) the first week of 2010. In addition to the text, they contain all manner of metadata such as when the patent was filed, who the inventors and assignees were, etc. Interestingly, the zipped up files are in two different formats: patents from 2001 on are in XML, while earlier ones are in a funky ad hoc text format.
The XML format was easy to parse using standard tools, whereas the proprietary format required some specialized code. Despite that, the XML format proved the more problematic in some ways:
- The DTD wasn’t really available in machine-readable form; the closest I could find was this page on the USPTO site
- While both formats indicated the distinction between the patent’s summary vs. its details, the XML format chose to do it through processing instructions rather than via containment. Unfortunately, standard implementations of the DefaultHandler (such as the XmlSlurper I was using) drop processing instructions on the floor. To use that data would have required a complete rewrite of the handler. In the end, I decided to fudge it and added a pre-processing step that scanned for the specific processing instructions that bracketed the summary and details sections of the patent and replaced them with elements that the XML parser would represent properly. Unfortunately, this hack cost me DTD compliance, so I had to turn validation off.
For those curious about the text format, it contains stuff like this:
PATN WKU D02428814 SRC 5 APN 611301& APT 4 ART 292 APD 19750908 TTL Diver's helmet ISD 19770104 NCL 1 ECL 1 EXP Feifer; Melvin B. NDR 2 NFG 6 TRM 14 INVT NAM Jones; Richard F. CTY Santa Barbara STA CA
The file is organized into field codes with associated values (e.g., TTL is the title). Values that are longer than a single line continue on the next line which starts with a few spaces to distinguish it as a continuation. Field codes are grouped into sections identified by four-character codes (e.g., INVT, above) with three-character codes for sub-fields. In short, it’s pretty easy to parse. In addition, it clearly sets out the summary (BSUM) and the details (DETD) fields.
The reason this is important is that Xue and Croft found that the summary field appears to be most effective for relevance feedback searches on patent.
The meaning of the codes is documented here, although the example is in SGML, rather than in the format described above. This SGML example does not have a one-to-one correspondence between the field codes and the element names, but the associated documentation lists the field codes and their definitions.
One interesting (aka bizarre) aspect of both formats is that the patent number itself is encoded in the WKU field in a manner that does not correspond directly to the value you see on an actual patent! I reverse-engineered the actual patent with the following regex:
which looks for an optional prefix, then a zero, then the 5-7 digit patent number, and then another digit (which seems to be a check-sum of some kind, which I discard). The prefix and the patent number, concatenated, produce the patent number that the USPTO search engine recognizes. Of course this doesn’t handle the patent applications; those need to be processed separately.
My goal in indexing these patents is to experiment with retrieval algorithms and user interfaces, but I don’t want to actually host all the data. While I can store and display the text associated with each patent, I don’t really want to store all the PDFs or drawings or page images, as that will consume all those terabytes Google was so proud of. Unfortunately, I have not yet figured out how to get Google or the USPTO to serve up a PDF or a page image for a specific patent that I can identify with a patent number.
It would be a terrible waste of resources to host that stuff and to serve it through a single point in the network. That leaves me with a few possibilities, including figuring out a way to obtain documents from a third-party based on the patent number or joining some kind of a consortium that would host the documents (for research purposes only!).
I wonder if NIST is willing to do this.