It's not what you know, it's whom you know


Almost 10 years ago, L. Sweeny published an analysis of summary census data that was used to identify 87% of respondents based only on their ZIP code, gender, and date of birth, data that we all think of (and the census treats as) relatively anonymous. At about the same time, I visited a friend at a large consulting firm who demonstrated data mining software that combined data from multiple sources and was able to discover many facts about people, that while not particularly revealing individually, painted a much more complete picture when federated. Now comes the news (thanks Daniel) that a group at MIT was able to make better-than-chance predictions about people’s sexual orientation using Facebook friends as training data. Whereas the census analysis and the data mining tools could be considered academic exercises on datasets to which most people don’t have access, the MIT results have much more immediate and potentially damaging implications.

While the researchers took steps to prevent their specific data from being leaked, by performing this research, they have publicly announced that it is possible to do this kind of analysis. Whether their results can be generalized, and whether their prediction rates are sufficiently accurate is all but irrelevant. What they have done is proclaim, clearly and loudly, that anyone interested in outing gay people has a new tool. Although the techniques are not new, and similar inference approaches have been suggested for other forms of social data (as discussed in the article), the fact that this analysis has been done on such widely-available de facto public data means that anyone with modest technical skills (or with the money to hire such a technical person) can repeat it.

Why is this a problem? People may chose to conceal their sexual orientation for a number of reasons, some of which have to do with social stigma involving their families, their communities or their places of work. Giving a tool that those interested in discriminating against gays can use to attempt to discover (or believe to discover) this information is not a socially-responsible act. In fact, it is disturbing that the ethics committee approved this research in the first place.

The best way to keep a secret is not to share it with anyone. Given the utility of social networking applications, however, not using them may not be an option for some people. A challenge to technologists interested in protecting online privacy: can tools be developed that suggest to users that their actions with respect to data sharing on SNSs may cause unintended consequences? Can we build tools that analyze a person’s Facebook page for known vulnerabilities? Is it revealing information that was not meant to be revealed?

The analysis doesn’t have to be (and cannot possibly be) perfect, but some warnings that information you publish about yourself can be used in specific ways by the unscrupulous may cause at least some people to reflect on their decisions. Online privacy (or the lack thereof) is something we as a society have not yet come to grips with, but examples like this remind us that we need to take this issue seriously. Perhaps the MIT students who did this work might consider writing a Facebook App to educate people about potential consequences of their actions.They should do it quickly. How would it look if someone from Harvard beat them to it?

Share on: 


  1. Megan A. Winget says:

    Hi Gene:

    Do you know Fred Stutzman at UNC (he’s a doctoral student colleague – one of Gary’s students). He works on these issues. You can find his stuff at: – he does lots of interesting work.

    Also, I’m a grammar wonk (forgive me): I think that “whom” is used as the the direct object of a prepositional phrase: “of whom, for whom, from whom…” etc.


  2. Winter Mason says:

    It’s not as much of a problem as your blog implies. The Boston Globe article says the researchers “downloaded” the data, but it’s not so easy to do that. Essentially you either have to build an application that people choose to add and permit to see their data, have access to a popular ‘network’ (such as Dallas-Ft. Worth, which it sounds like the researchers used), or get data directly from Facebook. While the first two of these are not extremely high barriers, they do strongly limit the information that could be gained (and the damage that could be done) using this technique.

  3. @Gene: The U.S. HIPAA laws about privacy prevent the release even of year of birth, much less date of birth:

    This is a big problem for pediatricians and geriatricians, where age is a critical component of care.

    Of course, if I know a bunch of your medical conditions, even redacted medical records can identify you.

    @Gene II: “Ethics committees”? Do you mean the internal review boards (IRB) that oversee human subjects experiments in bio/psych departments? When I was a prof a decade ago, we didn’t submit undergrad projects to the IRB!

    @Megan It’s “who” for “he/she” and “whom” for “him/her”. With relativizers, you look at the elided position where the pronoun would have been in a declarative sentence. Because it’s “you know him”, the “correct” relative forms are “whom you know” and “who knows him”. I didn’t pass my qualifier in syntactic theory for nothing!

    Of course, using “whom” in the U.S. just sounds pretentious.

  4. @Megan Thanks for the pointer to Fred’s web site. His work does indeed sound interesting.

    @Winter My guess is that it’s not so difficult to build an app that requires people to share their network to operate. If you provide minimal value through the app, you should be able to get enough network data to start mining it. While you can declare that you’re using network data for “research” purposes, my guess is that people will agree to share their data with you. It’s just awfully hard for people to understand consequences of their sharing behavior.

    @Bob I think the danger of data federation lies not in revealing any IID, but in its ability to capitalize on data leakage to triangulate on identifying information. And even if the identity is not perfect, if it’s good enough that may still be undesirable.

    As far as “ethics committees” goes, that was the terminology in the newspaper article. I agree with you that IRB is hte more common term. Given that the focus of the course was on ethics, I am not surprised that a review was held; rather, what is surprising is that they chose such a problematic aspect to investigate. They could have just as easily tried to discover Republicans among the student body.

Comments are closed.