The success of de-anonymization efforts, as discussed here, suggests that older anonymization methods no longer work, especially in light of the large amount of publicly available data that can serve as auxiliary information. The quest to find suitable replacements for these methods is ongoing. As one starting point in this broader quest, we need useful definitions of privacy.
It has proven surprisingly difficult to find pragmatic definitions of privacy, definitions that capture a coherent aspect of privacy, are workable in the sense that it is possible to protect privacy defined in this way, and are sufficiently formal to provide means for determining if a method protects this type of privacy and, if so, how well.
The best attempt to date is the notion of differential privacy. Informally, differential privacy ensures that adding your data to a database will only negligibly impact your privacy. More formally, suppose K is a randomized mechanism for information release. In other words, K is a randomized function from a set of databases D to the set of information releases S. For example, D might be the set of all possible databases containing weight information on people and K takes the average of all of the weights contained in a database d in D, adds noise, and releases this noisy average. A mechanism K is said to be differentially private if, for every subset s of S, the probability that K outputs an element of s upon input of a database d is close to the probability that K outputs an element of s upon input of any database d’ that differs from database d in one row. Specifically, for all s in S and for all databases d and d’ differing in one row, the ratio of Pr[K(d) in s] and Pr[K(d’) in s] is bounded by e^ε, where ε is a privacy parameter. A smaller ε provides greater differential privacy.
An important aspect of this definition is that merely defines a notion of privacy, it does not specify a means for attaining it. A variety of mechanisms for attaining differential privacy are now known, and developing further methods is currently an active area of research.
Another significant property of this definition is that it makes no claim as to whether a person’s privacy will be hurt by the release. It only guarantees that a person’s privacy won’t be harmed much more by participating than by refusing to contribute her data.
This definition certainly nicely captures one type of privacy. That it has led to the discovery of a variety of mechanisms for preserving this type of privacy means that it is also a useful definition. These mechanisms, and the notion of differential privacy no matter how it is attained, put stringent requirements on the amount and type of information that can be released. At the IPAM data privacy workshop I attended, opinions ranged from a belief that differential privacy was completely inadequate and would never be adopted in real world cases to a belief that differential privacy was not only the only reasonable definition we currently have, but also the only reasonable one we are ever likely to have. I’ll probably ramble on in a future post about my thoughts on this subject. In the meantime, I’d be curious to hear initial reactions to this definition of privacy from readers of this blog.