Parsing patents, take 2

on

Working on parsing and indexing the patent collection that Google made available has been an interesting education in just how noisy allegedly clean data really is, and in the scale of the collection. I am by no means done; in fact, I’ve had to start over a couple of times. I have learned a few things so far, in addition to my earlier observations.

Data is messy

The data that the USPTO provided to Google consists of several formats. Older patents (per 2002) are stored in a text format that I described in an earlier post. The 2002 ZIP archive has two files, one with a v2.4 DTD, and one with a v2.5 DTD. The first file has an .sgm extension, the second .xml.  The two files seem to contain the same data, however. In any case,the DTDs for 2002-2004 archives are a crazy set of mostly numbered deeply-nested elements such as SDOBI/B100/B110/DNUM/PDAT which leads to the patent number, and SDOBI/B700/B720/B721/PARTY-US/NAM/SNM/STEXT/PDAT which is an inventor’s last name.

The DTD for the year 2005 (“us-patent-grant-v40-2004-12-02.dtd”) is a variant on the current one, with just a few annoying differences. For example, the name of the element that contains the “field of search” codes is different: the 2005 version uses “field-of-search” while later ones use “us-field-of-classification-search”. 2006 uses v41, butI have not figured out what the differences are between the v41 and v42 DTD are. In fact, I was not aware that the DTD for 2006 was different until quite recently.

Finally, the years 2007-2010 share the v42 DTD (“us-patent-grant-v42-2006-08-23.dtd”).

Years Format DTD
-2001 text n/a
2002 SGML? PUBLIC “-//USPTO//DTD ST.32 US PATENT GRANT V2.4 2000-09-20//EN”
2002-2004 XML ST32-US-Grant-025xml.dtd
2005 XML us-patent-grant-v40-2004-12-02.dtd
2006 XML us-patent-grant-v41-2005-08-25.dtd
2007-2010 XML us-patent-grant-v42-2006-08-23.dtd

These small differences are annoying because they make data extraction an uncertain process. Did I actually get all the differences, or will something crop up later that I hadn’t seen before? As things stand, for example, there are still two samples (from 2002 and from 2009) that failed to get parsed; one produced the following startling error from the SAX parser:

[Fatal Error] :1:1: The parser has encountered more than "64,000" entity expansions in this document; this is the limit imposed by the application.

The upshot is that I guess I won’t know whether I’ve been able to parse everything until I actually run the entire dataset.

Another set of hard-to-parse data are the classification codes. I am sure they are documented somewhere, but the sheer variety of them makes parsing and formatting them in a consistent and recognizable way a challenge. I still don’t know what to do with this one, for example: Patent 04795540 for class 204 has a subclass code that reads “243 R-247″ in the data, but I could not figure out based on USPTO documentation what that was or how to render it. Google shows it as “204/243R” (search for patent:4795540), the PDF file has it as “204/243 R-247″ and the USPTO summary has “204/243 R-247/” with the trailing slash, something that seems inconsistent with their convention of separating the class from the subclass with the slash.

Even such simple things as country names for patent assignees are messy: some patents will add an X to the end of a country abbreviation, producing ‘JPX’ for Japan, ‘DEX’ for Germany, etc. Others use ‘JP’ and ‘DE’. For US patents, some specify the country name, others don’t.  And by far the most surprising assignee for 62 of the 55K patents I have sampled is listed as “The United States of America as represented by the Secretary of the Army, Washington, DC”. I wonder what that means: I thought that Federal employees were not allowed to patent inventions, but it’s seemingly OK for others to assign their inventions to the government. Odd.

And there is a lot of it

I’ve been trying to do test-driven development on this project to learn how and to cope with the evolving code given my understanding of what I am doing. Overall things have gone pretty well, and I have accumulated a bunch of tests for the various document types and field values. But just pulling out a few examples was inadequate, as digging deeper into the dataset produced more and more variations.

I’ve taken to running my parser on a sample of documents from the past 22 years (one week’s worth per year) to get decent coverage, but we’re talking about processing 675 MB of zipped-up data, which takes a few hours on my reasonably-capable unix box.

In short

This has been an interesting (if somewhat painful) learning experience. My initial optimism that I could have enough done to write up a submission for the PAIR workshop led me to take a few shortcuts that cost me quite a bit of time, and a chance at actually writing up the work. But I have certainly learned a lot about working with largish datasets. I expect there will be other unpleasant surprises in store, but at least I won’t be surprised when my code runs for hours and hours without getting anywhere.

I would love to hear from anyone else who has looked at the data or tried parsing it.

8 Comments

  1. Sarah Vogel says:

    Interesting observations Gene.

    As an FYI, the US government is very active in patenting and by some estimates has title to over 30,000 patents. The following recent report will give you an idea of which US agencies have patents http://www.uspto.gov/web/offices/ac/ido/oeip/taf/us_gov.pdf. (Note that they mention having to clean the data to create the report.)

    My understanding is that government is patenting inventions to encourage technology transfer. If you’re interested in the topic, you can get more info in the following 2002 report from the GAO – http://www.gao.gov/new.items/d0347.pdf.

    Good luck!

  2. Thanks for the pointers! I will definitely follow up.

  3. For the record, the XML Parser error I saw can be prevented by specifying more memory for the JVM and by telling the parser to increase the entity limit through command-line arguments to the JVM. I used

    -Xmx1024M -DentityExpansionLimit=100000

    which was sufficient for the patent I had problems with.

    Thanks to Ismail Fahmi who wrote the page Loading ontology into Sesame (error: exceeded) for the solution.

    I suppose the initial default was inspired by Bill Gates: 64K entities should be good enough for anyone.

  4. Devon Baumgarten says:

    Gene,

    I have been working with this data for about a month and a half. I agree with you that the XML is messy. I have made a database to store patent assignments, and have imported a few thousand. I am interested in doing the same for patent grants and trademark assignments and applications.

    Do you have any insight you can offer? The patent assignments weren’t too bad, but I generated graphs of the DTDs for the others to see what I was in for– ouch.

    I would appreciate hearing from you, if you have a spare minute.

    If you can see my email address, feel free to contact me that way. If you cannot, I can send it to you in a twitter direct message.

  5. Hi Devon,

    So far, I’ve dabbled in applications a bit (enough to discover some insidious differences between applications and granted patents), but have not looked at trademarks at all.

    Would be happy to compare notes, odd formatting cases, etc.

  6. Pedram says:

    Gene,

    Thanks for your insightful notes. It is very useful. I am also in the process of parsing XML file. How do you view DTD files? I have a Mac and Ubuntu machine. Any special viewer? Do you have a link to USPTO site where they have these DTD files?

  7. Pedram,

    The DTDs I mentioned are found on the USPTO site on its redbook page.

    As to viewing them, any decent XML tool should handle them; have you tried the XML viewier in Eclipse? It works OK for the XML documents, but I haven’t tried looking at the DTD with it. On the other hand, most of the DTD info is already on the UPSTO site in human-readable form, which should get you started.

  8. Ravi Keecheril says:

    Nice article. I’m glad I stumbled into it, because I was just about to write code to parse the first xml patent file I downloaded yesterday.
    The classification code 204/243R is scary. I had assumed that at least the classification codes will be accurate. I already have all the classes and subclasses entered in to an SQL database. I don’t see any 204/243R at all.
    I queried various combos,
    there is a
    204/243.1 Fused bath, which seems to be the closest.
    I think that R must be something they temporarily assigned, to create a category or something…

Comments are closed.