Almost 10 years ago, L. Sweeny published an analysis of summary census data that was used to identify 87% of respondents based only on their ZIP code, gender, and date of birth, data that we all think of (and the census treats as) relatively anonymous. At about the same time, I visited a friend at a large consulting firm who demonstrated data mining software that combined data from multiple sources and was able to discover many facts about people, that while not particularly revealing individually, painted a much more complete picture when federated. Now comes the news (thanks Daniel) that a group at MIT was able to make better-than-chance predictions about people’s sexual orientation using Facebook friends as training data. Whereas the census analysis and the data mining tools could be considered academic exercises on datasets to which most people don’t have access, the MIT results have much more immediate and potentially damaging implications.
While the researchers took steps to prevent their specific data from being leaked, by performing this research, they have publicly announced that it is possible to do this kind of analysis. Whether their results can be generalized, and whether their prediction rates are sufficiently accurate is all but irrelevant. What they have done is proclaim, clearly and loudly, that anyone interested in outing gay people has a new tool. Although the techniques are not new, and similar inference approaches have been suggested for other forms of social data (as discussed in the Boston.com article), the fact that this analysis has been done on such widely-available de facto public data means that anyone with modest technical skills (or with the money to hire such a technical person) can repeat it.
Why is this a problem? People may chose to conceal their sexual orientation for a number of reasons, some of which have to do with social stigma involving their families, their communities or their places of work. Giving a tool that those interested in discriminating against gays can use to attempt to discover (or believe to discover) this information is not a socially-responsible act. In fact, it is disturbing that the ethics committee approved this research in the first place.
The best way to keep a secret is not to share it with anyone. Given the utility of social networking applications, however, not using them may not be an option for some people. A challenge to technologists interested in protecting online privacy: can tools be developed that suggest to users that their actions with respect to data sharing on SNSs may cause unintended consequences? Can we build tools that analyze a person’s Facebook page for known vulnerabilities? Is it revealing information that was not meant to be revealed?
The analysis doesn’t have to be (and cannot possibly be) perfect, but some warnings that information you publish about yourself can be used in specific ways by the unscrupulous may cause at least some people to reflect on their decisions. Online privacy (or the lack thereof) is something we as a society have not yet come to grips with, but examples like this remind us that we need to take this issue seriously. Perhaps the MIT students who did this work might consider writing a Facebook App to educate people about potential consequences of their actions.They should do it quickly. How would it look if someone from Harvard beat them to it?