The field of information retrieval is inherently (some might say pathologically) data-driven. We need datasets to test algorithms, to compare systems, etc. This is all good. It’s particularly good to have data that are meaningful and relevant, because it makes it easier to motivate users and to generalize findings to data that people care about.
I expect that in the next few cycles of conference submissions, we will see a number of papers analyze the “cable” data leaked by Bradley Manning to Wikileaks. It’s a large enough dataset with topical relevance that is sure to attract all sorts of analyses, much like the Enron email dataset did in 2004.
But there are some important differences.
Enron executives were charged with a number of crimes, including bank and securities fraud. The data was collected through a discovery process, and was used by prosecutors to build a case against the defendants.
The Wikileaks “cable” data consists of messages sent to the US State Department by ambassadors and other employees during the course of their regular, legal, duties. The data was stolen and then made public to expose alleged misdeeds by the US government in its implementation of its foreign policy.
Presumably the reason for exposing these documents is to affect the foreign policy of the US. Whereas a case for publicity could be made for the earlier leaks of military documents from Afghanistan and Iraq by people who disagreed with the conduct of those wars, no such logic applies to the documents in question. Here we find disclosures of diplomatic channels at work, of people trying to achieve mutual understanding rather than waging war. The release of these documents has not only harmed some of the people who were fostering that peaceful communication, but also makes it less likely that others will engage with diplomats in future.
If we believe that diplomacy is better than war at resolving many issues of foreign relations, thwarting that exercise by stealing and publishing secret documents is not only illegal but also immoral and stupid.
It seems to me, therefore, that using this ill-gotten information for one’s research condones and legitimizes such behavior. Thus I encourage other researchers to reject the temptation to analyze this data, and program committees to discourage submissions based on that data.
We as a research community should not to stoop to trafficking in stolen property. Let’s not sully ourselves with it.