Whither data privacy?

on

On Friday Netflix canceled the sequel to its Netflix prize due to privacy concerns. The announcement of the cancellation has had a mixed reception from both researchers and the public. Narayanan and Shmatikov, the researchers who exposed the privacy issues in the original Netflix prize competition data, write “Today is a sad day. It is also a day of hope.”

The Netflix prize data example is probably the third most famous example of de-anonymization of data that was released with the explicit claim that the data had been anonymized. These examples differ from the privacy breaches discussed by Maribeth Back in her post on ChatRoulette or the issues with Google Buzz discussed as part of Gene Golovchinsky’s post “What’s private on the Web?” . Those examples made sensitive information available directly. In the case of the following three de-anonymization attacks, the data itself was “anonymized,” but researchers were able, with the addition of  publicly available auxiliary information, de-anonymize much of the data.

The most famous example is the identification of then Massachusett’s Governor Weld’s health records from “anonymized” Group Insurance Commision (GIC) data that included date of birth (day and year), gender, and five-digit zip-code. By using publicly available voter rolls, Sweeney determined the one person who had his particularly date of birth, gender, and zipcode, and was thus able to identify the Governor’s health record in the GIC data. She estimated that date of birth, gender, and zipcode uniquely identify 87% of the population (that estimate has now been reduced to 63%), meaning that it is estimated that a majority of the GIC data can be de-anonymized in this way.

The second most famous example is the identification of people from “anonymized” search logs released by AOL. See “You are what you search” or the wikipedia article on the scandal.

While perhaps a lesser privacy breach, the Netflix data de-anonymization was the most surprising. All Netflix published was the rating and the date of review for a sample of roughly 1/8th of the its subscribers, each given a unique ID. On the face of it, such a release seems pretty harmless. The first thing Narayanan and Shmatikov recognized was that a set 8 movie ratings with dates within a 14 day period identified 99% of the individuals uniquely. Still, how harmful could that be since we still don’t know who these people are? Narayanan and Shmatikov then introduced publicly available auxiliary information in the form of movie ratings from the internet Movie Database (IMDb).  They cross correlated a small sample from IMDb with the Netflix data under the hypothesis that when a user rated a given movie in both IMDb and Netflix, the user would give roughly the same rating in both IMDb and Netflix, and make both ratings around the same date.  They found some very strong matches. In this way, from data users posted publicly, on IMDb, they were able to discover preferences revealed only in the private ratings the users had specified internal to their private Netflix accounts.

The proposed release of Netflix data for the sequel prompted lawsuits, as mentioned in the Netflix announcement. Narayanan and Shmatikov, in their open letter to Netflix, regret that Netflix canceled the sequel and hope that Netflix will work with privacy researchers to enable a competition that respects privacy. They make two specific suggestions; an opt in policy, and the use of differential privacy.

Their letter is just the most recent addition to an ever widening dialog on data privacy issues in general. To what extent should privacy be protected, and how should that be accomplished? How do we even define privacy? These are hot topics within legal, policy, and research circles. The week of Feb 22 I attended the IPAM workshop on data privacy aimed at establishing a coherent “foundation for research on data privacy.” I’ll talk about differential privacy and the IPAM workshop in a future post. Next week, CodeX (The Stanford Center of Computers and Law) will be holding an “Intelligent Information Privacy Management Symposium.” There will be many more such meetings as we as a society wrestle with these complex issues.

3 Comments

  1. Very interesting stuff; thanks for posting this.

    It seems that one cannot release a large dataset without knowing in advance what else will supplement that dataset, as in the way that Narayanan and Shmatikov were able to use the IMDb to de-anonymize the data. Apparently, no dataset is an island, an entire unto itself.

    So here is a question: has anyone tried to formulate a standard for anonymization in the same way that the Government has standards for encryption? Studies like this may show that such a standard is necessary.

    I supposed one could protect the anonymity of the people feeding into a dataset by somehow adding null data that would “widen the bell curve” and make it harder to identify individuals. The problem is that this would render the dataset useless for any other kind of statistical analysis. But from a privacy perspective, I could live with that. :-)

  2. […] success of de-anonymization efforts, as discussed here, suggests that older anonymization methods no longer work, especially in light of the large amount […]

  3. Sheldon,

    Instead of answering your question here, I wrote a post that discusses the leading effort to define privacy and provide mechanisms for obtaining it. If successful, such an effort could be adopted as a standard. Some of people as the IPAM workshop I attended serve on committees, some sponsored by the government, to try to formulate standards for privacy sensitive information release at least for certain types of data. Interestingly, some of the most active researchers in this field are cryptographers, so the current leading definition, differential privacy, has the feel of a cryptographic definition. It was put forward by cryptographer Cynthia Dwork now at Microsoft research. And yes, you are right, that most of the mechanisms for attaining privacy add noise in one way or another and that it is tricky to do so in a way that preserves the utility of the released data while obtaining a significant level of privacy. Differential privacy has been able to attain a reasonable trade off in some instances. I talk a little more about differential privacy here.

Comments are closed.