On Friday Netflix canceled the sequel to its Netflix prize due to privacy concerns. The announcement of the cancellation has had a mixed reception from both researchers and the public. Narayanan and Shmatikov, the researchers who exposed the privacy issues in the original Netflix prize competition data, write “Today is a sad day. It is also a day of hope.”
The Netflix prize data example is probably the third most famous example of de-anonymization of data that was released with the explicit claim that the data had been anonymized. These examples differ from the privacy breaches discussed by Maribeth Back in her post on ChatRoulette or the issues with Google Buzz discussed as part of Gene Golovchinsky’s post “What’s private on the Web?” . Those examples made sensitive information available directly. In the case of the following three de-anonymization attacks, the data itself was “anonymized,” but researchers were able, with the addition of publicly available auxiliary information, de-anonymize much of the data.
The most famous example is the identification of then Massachusett’s Governor Weld’s health records from “anonymized” Group Insurance Commision (GIC) data that included date of birth (day and year), gender, and five-digit zip-code. By using publicly available voter rolls, Sweeney determined the one person who had his particularly date of birth, gender, and zipcode, and was thus able to identify the Governor’s health record in the GIC data. She estimated that date of birth, gender, and zipcode uniquely identify 87% of the population (that estimate has now been reduced to 63%), meaning that it is estimated that a majority of the GIC data can be de-anonymized in this way.
While perhaps a lesser privacy breach, the Netflix data de-anonymization was the most surprising. All Netflix published was the rating and the date of review for a sample of roughly 1/8th of the its subscribers, each given a unique ID. On the face of it, such a release seems pretty harmless. The first thing Narayanan and Shmatikov recognized was that a set 8 movie ratings with dates within a 14 day period identified 99% of the individuals uniquely. Still, how harmful could that be since we still don’t know who these people are? Narayanan and Shmatikov then introduced publicly available auxiliary information in the form of movie ratings from the internet Movie Database (IMDb). They cross correlated a small sample from IMDb with the Netflix data under the hypothesis that when a user rated a given movie in both IMDb and Netflix, the user would give roughly the same rating in both IMDb and Netflix, and make both ratings around the same date. They found some very strong matches. In this way, from data users posted publicly, on IMDb, they were able to discover preferences revealed only in the private ratings the users had specified internal to their private Netflix accounts.
The proposed release of Netflix data for the sequel prompted lawsuits, as mentioned in the Netflix announcement. Narayanan and Shmatikov, in their open letter to Netflix, regret that Netflix canceled the sequel and hope that Netflix will work with privacy researchers to enable a competition that respects privacy. They make two specific suggestions; an opt in policy, and the use of differential privacy.
Their letter is just the most recent addition to an ever widening dialog on data privacy issues in general. To what extent should privacy be protected, and how should that be accomplished? How do we even define privacy? These are hot topics within legal, policy, and research circles. The week of Feb 22 I attended the IPAM workshop on data privacy aimed at establishing a coherent “foundation for research on data privacy.” I’ll talk about differential privacy and the IPAM workshop in a future post. Next week, CodeX (The Stanford Center of Computers and Law) will be holding an “Intelligent Information Privacy Management Symposium.” There will be many more such meetings as we as a society wrestle with these complex issues.