Data Liberation: What do you Own?


Recently Google announced a new initiative: The Data Liberation Front:

The Data Liberation Front is an engineering team at Google whose singular goal is to make it easier for users to move their data in and out of Google products.  We do this because we believe that you should be able to export any data that you create in (or import into) a product. We help and consult other engineering teams within Google on how to “liberate” their products.  This is our mission statement: Users should be able to control the data they store any of Google’s products. Our team’s goal is to make it easier for them to move data in and out.

This is a fantastically worthy goal, and I whole-heartedly applaud it.  However, I am beginning to wonder: What data is yours to own, in the first place?

For example, consider web searching.  Google’s Data Liberation Front lets you extract your Web History, which consists of (1) all query strings that you ran and (2) the identities of any pages (URLs) that you clicked, as a result of those queries. But does Google let you extract the URLs of pages that you didn’t click?  Those pages are ones that you still interacted with, not by clicking, but by not clicking. You read the snippet, made a relevance judgment, and decided not to visit.  Might not you want to know, in the future, which pages you (implicitly) decided were non-relevant?  Isn’t that decision also part of your search data?  So shouldn’t Google also let you extract that information as well, in case you want to use that information in the future, for example to compare the results of the same query at different points in time?

Or is there a question of ownership of the set of results to your query?  Does Google feel that it owns the result set as a whole, even though you also had a part in constructing that set via your query?

Certainly no one would argue that Google owns the algorithms that produced the set.  But does Google own the set itself?  There is potentially a lot of value in that set, being able to extract it yourself and reuse and remix it in the future in any way that you see fit.  So it makes sense that Google might want to control its distribution.  But if you are also an owner or co-owner of that set, Google shouldn’t attempt that control. So the big question is: Are you a (co-)owner?

Here is an analogy by way of Adobe Photoshop.  Suppose you open one of your images in the online (webapp) version of Photoshop, apply the Gaussian Blur (soft focus) filter to the image, and then save that result out again.  It’s clear that you own the input (it’s your photo), that Adobe owns the Gaussian Blur algorithm (or at least the implementation of it), and that you own the resulting image.  Adobe doesn’t lay ownership claim to the output of the algorithm, even though it was their algorithm that produced the output.

So how is this different from a web search?  You own the input (the query string that you type).  Google owns the algorithm that transforms that input into a list of results.  So wouldn’t you also then own the output of that transformation?  Not the algorithm, but the output of the algorithm, i.e. the results set.  Just like you own the output in Photoshop.

It will be interesting to see whether or not Google will be open enough to allow you to extract this particular form of your data.  Currently, they do not.

Share on: 


  1. Your analogy to photoshop is compelling. But there’s a difference worth keeping an eye on if we pursue the photoshop -> photo and algo -> search_results symmetry. Specifically, an entity like Google makes (or at least could make) direct use of your clickthrough data. That is, my click history provides feedback, data that informs the algorithmic development.

    Clickthrough data is capital.

    So is Google renting my data? If that’s a fair characterization, what does that imply about the social contract between the searcher and the search engine?

  2. Google would probably argue (and this may be in the terms of service) that your clickstream is how you pay for getting the documents in the first place.

    The point that Jeremy is making, I think, is that once you’ve paid for the documents with your clickstream, you should retain the right to use that information for your own purposes.

  3. @Gene: Yes, I think Google would make that argument. You might still own your own clickstream, but you at least give Google a license to use it in whatever way they see fit. You retain the right to use your own clickstream as you see fit, but so do they.

    @Miles: Clickthrough data is indeed capital. And I think they aren’t renting it, so much as you’ve agreed to give them a perpetual, non-exclusive license to use it. Still, your point raises an interesting thought: Do I, the user, have any control over the terms of the license? What if I, the user, were to release my clickthrough data under a GPL license? We know from Von Neumann that code = data, data = code. So under a GPL licence of my clickthrough data, Google would be free to modify my clickthrough “source code”, but if they wanted to release any software (or service) which is based on that source, they have to do so under a GPL license. In other words, they would have to open-source their ranking algorithm, if that ranking algorithm relied on my open-source, GPL-licensed user data. Right?

    Interesting, interesting.

Comments are closed.