10 Terabytes for Radicals


Last night I watched Carl Malamud’s fascinating, inspiring, and informative WWW2010 address in which he discussed his 10 Rules for Radicals, a strategy for working with (or against) bureaucracies. I won’t summarize here; watch the video.

I will point out, however, that using very much the same rhetoric that Carl articulated so eloquently, Google has announced that it is making available for free download about 10 terabytes of patent and trademark data. This is great news for those interested in doing their own patent searches in ways that publicly-available systems (e.g., Google Patent search, USPTO, etc.) cannot support. It’s also great news for information retrieval and data-mining researchers who can now access a lot of data in a straight-forward manner.


  1. I have 1.2TB at home.
    Even if I had 10TB, each XML file is about 230MB after unpacking and I can’t even see the format.
    Is there any way to do use this from home?

  2. This data may be at a scale for which the disclaimer “Don’t try this at home” applies. On the other hand, it should be possible to set up some research servers through which subsets of the data could be examined. Seems like a good application for some cloud computing infrastructure.

  3. My informal sample of the patent data Google makes available suggests that if you’re interested in full-text search (rather than the page images and trademark data), there isn’t SO much data there: for the nine years of full-text data that’s available, each week’s documents are stored in a separate file. Uncompressed, each file is about 500MB (conservative estimate based on a sample of 4 files), and there are a total of about 9*52 such files, for a total of about 250GB of XML data. That should fit on your home disk just fine.

