On data ownership in a networked world

Every piece of content has a creator and owner (in this post, I will assume they are by default the same entity). I do not mean ownership in the traditional sense of, e.g., stashing a piece of paper in a drawer, but in the metaphysical sense that each artifact is forever associated with one or more “creators.”

This is certainly true of the end-products of intellectual labor, such as the article you are reading. However, it is also true of more mundane things, such as checkbook register entries or credit card activity. Whenever you pay a bill or purchase an item, you implicitly “create” a piece of content: the associated entry in your statement.  This has two immediately identifiable “creators”: the payer (you) and the payee.  The same is true for, e.g., your email, your IM chats, your web searches, etc. Interesting tidbit: over 20% of search terms entered daily in Google are new, which would imply roughly 20 million new pieces of content per day, or over 7 billion (over twice the earth’s population) per year—all this from just one activity on one website.

When I spend a few weeks working on, say, a research paper, I have certain expectations and demands about my rights as a “creator.” However, I give almost no thought to my rights on the trail of droppings (digital or otherwise) that I “create” each day, by searching the web, filling up the gas tank, getting coffee, going through a toll booth, swiping my badge, and so on.  However, with the increasing ease of data collection and distribution in digital form, we should re-think our attitudes towards “authorship”.

Unique identity

People call me “Spiros”, my identity documents list me as “Spyridon Papadimitriou” and on most online sites I’m registered as spapadim.  However, sometimes I’m s_papadim or spiros_papadimitriou, and so on.  Like most people, I lost track of all my accounts a time ago.  Vice versa, I’m not the only “Spiros Papadimitriou” in the real world.  For example, I occasionally get confused with my cousin, and receive comments about my interesting architectural designs!  Nor am I the only spapadim on the net.

A framework and mechanisms that allow (but do not enforce) asserting and verifying which of those labels (i.e., names, userids, etc) refer to the same entity (i.e., me) is missing. However, this is a prerequisite: how can we talk about data ownership and tackle portability, transparency and accountability, if we have to jump through countless hoops just to prove identity?

Some people, especially in the US, may object or even outright panic at the thought of such a global identifier.  In Greece, and in much of Europe, we’ve had national identity cards for decades.  Which is fine, as long as you know they exist and what are permissible uses-in other words, as long as transparency is ensured.  Furthermore, the illusion of privacy should not be confused with privacy itself—if in doubt, I suggest reading “Database Nation” (official site).  Its examples are largely US-centric, but the lessons are not.

OpenID (despite some shortcomings) and OAuth are emerging as open standards for authentication and authorization.  OpenID allows reuse of authentication credentials from one site on others: I can reuse, say, my Google username and password to log in to other sites (e.g., to leave a comment on this blog), without having to create yet another account from scratch.  OAuth resembles Kerberos’s ticket granting service but for the web, permitting other web services to ask for access to a subset of personal information: I could allow Facebook to access only my Google addressbook and not, potentially, all of my data on any Google service.  OpenID and OAuth can, at least in principle, work together.

Both high-profile individual developers and major companies are involved in these efforts.  For example, Yahoo! already supports OpenID and plans to support OAuth as well, while Google supports OAuth directly and OpenID indirectly in various ways.  Wide adoption of these standards would be a major step forwards for data portability and web interoperability.  However, I suspect they fall slightly short of providing a truly permanent and global personal identity.  What if, for any reason, my Yahoo! account disappears, either because I decided to shut it down or because Yahoo! went bust?

I was going to suggest a DNS-based solution and I was surprised when I found that the generic top-level domain .name has been instituted since 2001 to provide URIs for personal identities. You can register for a free three-month trial on FreeYourID (after that, it’s $11/year). What’s more, their service already provides OpenID authentication. In principle, this should allow easy switching of authentication and authorization service providers. Just as I can still keep the “label” for this site even if I move to a different web host, I can still keep my personal “label” no matter who I choose to manage my personal information.  So, now my universal username is spiros.papadimitriou.name, any emails sent to spiros@papadimitriou.name will find their way to me, you can call me on Skype using spiros.papadimitriou.name/call, and so on.

With such a unique identity tied to authorization and authentication services, the Giant Global Graph and its materializations would be one step closer to becoming really useful. If I want to use my identity to log and controll access to my data, I should be able to prove my claims.  Currently, FOAF and XFN allow assertion of relationshipt but provide no way to verify them.

Data portability

The point of this mental exercise so far is the following: A unique identity that can be verifiably associated with each and every data item that I produce is a prerequisite for making data ownership claims. Subsequently, we need to ask what fundamental rights should be associated with data ownership.  The first is the right to keep my information with me or, in other words, “data portability”. Just as I can freely move my money from one financial institution to another, I should be able to move any of my information from one data warehouse to another.

For example, consider my web search history. I don’t think I need to argue about the importance of historical information to improve search quality. If I decide for any reason to move to another search provider, I should be able to carry along all the information that’s directly associated with me.  This should include my search keyword history, as well as any additional information I may have contributed.

The actual details, however, may not be that straightforward.  Take, say, the third hit on a Google search.  Who is the “creator”?  Me by entering the search keywords, Google by producing the search results in response to those keywords, or the person who wrote the web page that contains them in the first place?  Similarly, when I buy gas, who is the “creator” of the transaction entry: me, Mobil, or American Express?

Even though intuition can often be wrong, my intuitive response to the Google search example would be that both I and Google have an ownership claim on this particular search, which includes the query keywords as well as a ranking of URLs.  On the other hand, the person who wrote the contents of, say, the third URL has ownership claims only on those, and not the search results.  Furthermore, the thousands of people that provided feedback to Google’s ranking algorithms by clicking on this URL on similar searches have ownership claims on those searches, but not on mine.

Finally, those two ownership claims (on keywords and on rankings) should probably not be treated the same.  If they were, then, say, MS Live could effectively copy Google by getting many users to move.  It seems reasonable to have the right to move my search history, but not the actual search results. However, I can imagine that some form of ownership claim on the rankings may be useful for other personal rights.

This is a highly idealized example and I’m not sure what an appropriate litmus test for ownership is, but some form of legal consensus must be in place.

Transparency

The second fundamental right is that I should know who is using my personal information and how. For example, if an insurance company accesses my credit history to give me a rate quote, I can find this out. It may not be a completely painless process but it is certainly possible today, with a regulatory framework that ensures this.  Similar regulations should be instituted to cover any and all forms of access to personal information.

Data access should be fully transparent to all parties involved. If the an insurance company accesses my medical records, I should know this.  If the government does a background check on me, I should know this too.  Transparency is a prerequisite for accountability. Otherwise, individuals have very limited power to protect themselves from improper uses of their personal information.

Concluding remarks

Much of the privacy research in computer science seems to assume that we can keep the existing legal and regulatory frameworks intact. Computer scientists taking such a position is even sadder than lawyers doing so; we have no excuse of failing to understand the technical issues. We cannot and should not make this assumption. Technical solutions should be subsidiary to new regulations.  But that doesn’t mean technologists cannot lead.  We should work towards supporting full transparency (for both individuals, as well as governments and corporations) rather than opacity and I’m currently in favor of a “shoot first, ask questions later” approach (and help lawmakers figure out the answers). After all, if there is anything that the DRM wars have taught us, it’s that information really wants to be free. Why do we think it’s technically hard (to say the least) to prevent copying of music, movies and software but we still think it may be possible to prevent copying of personal information? As I pointed out in an older post, it’s usually the use and not the possession of information that’s the problem.

My point in this post is simple: we should not fight the wrong war. Instead, we need an easy way to make data ownership claims, and use this to enforce at least two fundamental rights: the ability to keep any personal data with us, and the ability to know who is using this data and how.

Postscript. This post was wallowing for a while as a draft (originally separated from this post, then forgotten).  Since then, a recent MIT TR article discusses some aspects of data ownership.  Even better, I have since found an excellent short piece in the same issue by Esther Dyson, with which I could not agree more.

Update. After posting this last night, I did some further Googling and found another piece by Esther Dyson in the Scientific American. If you’ve read through my ramblings so far, then I’d urge you to read her article; she’s a much better writer than me, and has apparently been thinking about these issues for almost a decade, way before many people even knew what the Internet is. I should probably follow her more closely myself, as I agree disturbingly often with what I’ve read from her so far.