Working on parsing and indexing the patent collection that Google made available has been an interesting education in just how noisy allegedly clean data really is, and in the scale of the collection. I am by no means done; in fact, I’ve had to start over a couple of times. I have learned a few things so far, in addition to my earlier observations.
Data is messy
The data that the USPTO provided to Google consists of several formats. Older patents (per 2002) are stored in a text format that I described in an earlier post. The 2002 ZIP archive has two files, one with a v2.4 DTD, and one with a v2.5 DTD. The first file has an .sgm extension, the second .xml. The two files seem to contain the same data, however. In any case,the DTDs for 2002-2004 archives are a crazy set of mostly numbered deeply-nested elements such as SDOBI/B100/B110/DNUM/PDAT which leads to the patent number, and SDOBI/B700/B720/B721/PARTY-US/NAM/SNM/STEXT/PDAT which is an inventor’s last name.
The DTD for the year 2005 (“us-patent-grant-v40-2004-12-02.dtd”) is a variant on the current one, with just a few annoying differences. For example, the name of the element that contains the “field of search” codes is different: the 2005 version uses “field-of-search” while later ones use “us-field-of-classification-search”. 2006 uses v41, butI have not figured out what the differences are between the v41 and v42 DTD are. In fact, I was not aware that the DTD for 2006 was different until quite recently.
Finally, the years 2007-2010 share the v42 DTD (“us-patent-grant-v42-2006-08-23.dtd”).
|2002||SGML?||PUBLIC “-//USPTO//DTD ST.32 US PATENT GRANT V2.4 2000-09-20//EN”|
These small differences are annoying because they make data extraction an uncertain process. Did I actually get all the differences, or will something crop up later that I hadn’t seen before? As things stand, for example, there are still two samples (from 2002 and from 2009) that failed to get parsed; one produced the following startling error from the SAX parser:
[Fatal Error] :1:1: The parser has encountered more than "64,000" entity expansions in this document; this is the limit imposed by the application.
The upshot is that I guess I won’t know whether I’ve been able to parse everything until I actually run the entire dataset.
Another set of hard-to-parse data are the classification codes. I am sure they are documented somewhere, but the sheer variety of them makes parsing and formatting them in a consistent and recognizable way a challenge. I still don’t know what to do with this one, for example: Patent 04795540 for class 204 has a subclass code that reads “243 R-247″ in the data, but I could not figure out based on USPTO documentation what that was or how to render it. Google shows it as “204/243R” (search for patent:4795540), the PDF file has it as “204/243 R-247″ and the USPTO summary has “204/243 R-247/” with the trailing slash, something that seems inconsistent with their convention of separating the class from the subclass with the slash.
Even such simple things as country names for patent assignees are messy: some patents will add an X to the end of a country abbreviation, producing ‘JPX’ for Japan, ‘DEX’ for Germany, etc. Others use ‘JP’ and ‘DE’. For US patents, some specify the country name, others don’t. And by far the most surprising assignee for 62 of the 55K patents I have sampled is listed as “The United States of America as represented by the Secretary of the Army, Washington, DC”. I wonder what that means: I thought that Federal employees were not allowed to patent inventions, but it’s seemingly OK for others to assign their inventions to the government. Odd.
And there is a lot of it
I’ve been trying to do test-driven development on this project to learn how and to cope with the evolving code given my understanding of what I am doing. Overall things have gone pretty well, and I have accumulated a bunch of tests for the various document types and field values. But just pulling out a few examples was inadequate, as digging deeper into the dataset produced more and more variations.
I’ve taken to running my parser on a sample of documents from the past 22 years (one week’s worth per year) to get decent coverage, but we’re talking about processing 675 MB of zipped-up data, which takes a few hours on my reasonably-capable unix box.
This has been an interesting (if somewhat painful) learning experience. My initial optimism that I could have enough done to write up a submission for the PAIR workshop led me to take a few shortcuts that cost me quite a bit of time, and a chance at actually writing up the work. But I have certainly learned a lot about working with largish datasets. I expect there will be other unpleasant surprises in store, but at least I won’t be surprised when my code runs for hours and hours without getting anywhere.
I would love to hear from anyone else who has looked at the data or tried parsing it.