On data ownership in a networked world

Every piece of content has a creator and owner (in this post, I will assume they are by default the same entity). I do not mean ownership in the traditional sense of, e.g., stashing a piece of paper in a drawer, but in the metaphysical sense that each artifact is forever associated with one or more “creators.”

This is certainly true of the end-products of intellectual labor, such as the article you are reading. However, it is also true of more mundane things, such as checkbook register entries or credit card activity. Whenever you pay a bill or purchase an item, you implicitly “create” a piece of content: the associated entry in your statement.  This has two immediately identifiable “creators”: the payer (you) and the payee.  The same is true for, e.g., your email, your IM chats, your web searches, etc. Interesting tidbit: over 20% of search terms entered daily in Google are new, which would imply roughly 20 million new pieces of content per day, or over 7 billion (over twice the earth’s population) per year—all this from just one activity on one website.

When I spend a few weeks working on, say, a research paper, I have certain expectations and demands about my rights as a “creator.” However, I give almost no thought to my rights on the trail of droppings (digital or otherwise) that I “create” each day, by searching the web, filling up the gas tank, getting coffee, going through a toll booth, swiping my badge, and so on.  However, with the increasing ease of data collection and distribution in digital form, we should re-think our attitudes towards “authorship”.

Unique identity

People call me “Spiros”, my identity documents list me as “Spyridon Papadimitriou” and on most online sites I’m registered as spapadim.  However, sometimes I’m s_papadim or spiros_papadimitriou, and so on.  Like most people, I lost track of all my accounts a time ago.  Vice versa, I’m not the only “Spiros Papadimitriou” in the real world.  For example, I occasionally get confused with my cousin, and receive comments about my interesting architectural designs!  Nor am I the only spapadim on the net.

A framework and mechanisms that allow (but do not enforce) asserting and verifying which of those labels (i.e., names, userids, etc) refer to the same entity (i.e., me) is missing. However, this is a prerequisite: how can we talk about data ownership and tackle portability, transparency and accountability, if we have to jump through countless hoops just to prove identity?

Some people, especially in the US, may object or even outright panic at the thought of such a global identifier.  In Greece, and in much of Europe, we’ve had national identity cards for decades.  Which is fine, as long as you know they exist and what are permissible uses-in other words, as long as transparency is ensured.  Furthermore, the illusion of privacy should not be confused with privacy itself—if in doubt, I suggest reading “Database Nation” (official site).  Its examples are largely US-centric, but the lessons are not.

OpenID (despite some shortcomings) and OAuth are emerging as open standards for authentication and authorization.  OpenID allows reuse of authentication credentials from one site on others: I can reuse, say, my Google username and password to log in to other sites (e.g., to leave a comment on this blog), without having to create yet another account from scratch.  OAuth resembles Kerberos’s ticket granting service but for the web, permitting other web services to ask for access to a subset of personal information: I could allow Facebook to access only my Google addressbook and not, potentially, all of my data on any Google service.  OpenID and OAuth can, at least in principle, work together.

Both high-profile individual developers and major companies are involved in these efforts.  For example, Yahoo! already supports OpenID and plans to support OAuth as well, while Google supports OAuth directly and OpenID indirectly in various ways.  Wide adoption of these standards would be a major step forwards for data portability and web interoperability.  However, I suspect they fall slightly short of providing a truly permanent and global personal identity.  What if, for any reason, my Yahoo! account disappears, either because I decided to shut it down or because Yahoo! went bust?

I was going to suggest a DNS-based solution and I was surprised when I found that the generic top-level domain .name has been instituted since 2001 to provide URIs for personal identities. You can register for a free three-month trial on FreeYourID (after that, it’s $11/year). What’s more, their service already provides OpenID authentication. In principle, this should allow easy switching of authentication and authorization service providers. Just as I can still keep the “label” for this site even if I move to a different web host, I can still keep my personal “label” no matter who I choose to manage my personal information.  So, now my universal username is spiros.papadimitriou.name, any emails sent to spiros@papadimitriou.name will find their way to me, you can call me on Skype using spiros.papadimitriou.name/call, and so on.

With such a unique identity tied to authorization and authentication services, the Giant Global Graph and its materializations would be one step closer to becoming really useful. If I want to use my identity to log and controll access to my data, I should be able to prove my claims.  Currently, FOAF and XFN allow assertion of relationshipt but provide no way to verify them.

Data portability

The point of this mental exercise so far is the following: A unique identity that can be verifiably associated with each and every data item that I produce is a prerequisite for making data ownership claims. Subsequently, we need to ask what fundamental rights should be associated with data ownership.  The first is the right to keep my information with me or, in other words, “data portability”. Just as I can freely move my money from one financial institution to another, I should be able to move any of my information from one data warehouse to another.

For example, consider my web search history. I don’t think I need to argue about the importance of historical information to improve search quality. If I decide for any reason to move to another search provider, I should be able to carry along all the information that’s directly associated with me.  This should include my search keyword history, as well as any additional information I may have contributed.

The actual details, however, may not be that straightforward.  Take, say, the third hit on a Google search.  Who is the “creator”?  Me by entering the search keywords, Google by producing the search results in response to those keywords, or the person who wrote the web page that contains them in the first place?  Similarly, when I buy gas, who is the “creator” of the transaction entry: me, Mobil, or American Express?

Even though intuition can often be wrong, my intuitive response to the Google search example would be that both I and Google have an ownership claim on this particular search, which includes the query keywords as well as a ranking of URLs.  On the other hand, the person who wrote the contents of, say, the third URL has ownership claims only on those, and not the search results.  Furthermore, the thousands of people that provided feedback to Google’s ranking algorithms by clicking on this URL on similar searches have ownership claims on those searches, but not on mine.

Finally, those two ownership claims (on keywords and on rankings) should probably not be treated the same.  If they were, then, say, MS Live could effectively copy Google by getting many users to move.  It seems reasonable to have the right to move my search history, but not the actual search results. However, I can imagine that some form of ownership claim on the rankings may be useful for other personal rights.

This is a highly idealized example and I’m not sure what an appropriate litmus test for ownership is, but some form of legal consensus must be in place.

Transparency

The second fundamental right is that I should know who is using my personal information and how. For example, if an insurance company accesses my credit history to give me a rate quote, I can find this out. It may not be a completely painless process but it is certainly possible today, with a regulatory framework that ensures this.  Similar regulations should be instituted to cover any and all forms of access to personal information.

Data access should be fully transparent to all parties involved. If the an insurance company accesses my medical records, I should know this.  If the government does a background check on me, I should know this too.  Transparency is a prerequisite for accountability. Otherwise, individuals have very limited power to protect themselves from improper uses of their personal information.

Concluding remarks

Much of the privacy research in computer science seems to assume that we can keep the existing legal and regulatory frameworks intact. Computer scientists taking such a position is even sadder than lawyers doing so; we have no excuse of failing to understand the technical issues. We cannot and should not make this assumption. Technical solutions should be subsidiary to new regulations.  But that doesn’t mean technologists cannot lead.  We should work towards supporting full transparency (for both individuals, as well as governments and corporations) rather than opacity and I’m currently in favor of a “shoot first, ask questions later” approach (and help lawmakers figure out the answers). After all, if there is anything that the DRM wars have taught us, it’s that information really wants to be free. Why do we think it’s technically hard (to say the least) to prevent copying of music, movies and software but we still think it may be possible to prevent copying of personal information? As I pointed out in an older post, it’s usually the use and not the possession of information that’s the problem.

My point in this post is simple: we should not fight the wrong war. Instead, we need an easy way to make data ownership claims, and use this to enforce at least two fundamental rights: the ability to keep any personal data with us, and the ability to know who is using this data and how.

Postscript. This post was wallowing for a while as a draft (originally separated from this post, then forgotten).  Since then, a recent MIT TR article discusses some aspects of data ownership.  Even better, I have since found an excellent short piece in the same issue by Esther Dyson, with which I could not agree more.

Update. After posting this last night, I did some further Googling and found another piece by Esther Dyson in the Scientific American. If you’ve read through my ramblings so far, then I’d urge you to read her article; she’s a much better writer than me, and has apparently been thinking about these issues for almost a decade, way before many people even knew what the Internet is. I should probably follow her more closely myself, as I agree disturbingly often with what I’ve read from her so far.

Comments

First thoughts on Android

T-Mobile G1I recently upgraded to a T-Mobile G1 (aka. HTC Dream), running Android.  The G1 is a very nice and functional device. It’s also compact and decent looking, but perhaps not quite a fashion statement: unlike the iPhone my girlfriend got last year, which was immediately recognizable and a stare magnet, I pretty much have to slap people on the face with the G1 to make them look at it.  Also, battery life is acceptable, but just barely.  But this post is not about the G1, it’s about Android, which is Google’s Linux-based, open-source mobile application platform.

I’ll start with some light comments, by one of the greatest entertainers out there today: Monkey Boy made fun of the iPhone in January, stating that “Apple is selling zero phones a year“. Now he’s making similar remarks about Android, summarized by his eloquent “blah dee blah dee blah” argument.  Less than a year after that interview, the iPhone is ahead of Windows Mobile in worldwide market share of smartphone operating systems (7M versus 5.5M devices). Yep, this guy sure knows how entertain—even if he makes a fool of himself and Microsoft.

Furthermore, Monkey Boy said that “if I went to my shareholder meeting [...] and said, hey, we’ve just launched a new product that has no revenue model! [...] I’m not sure that my investors would take that very well. But that’s kind of what Google’s telling their investors about Android.”  Even if this were true, perhaps no revenue model is better than a simian model.

Anyway, someone from Microsoft should really know better—and quite likely he does, but can’t really say it out loud. There are some obvious parallels between Microsoft MS-DOS and Google Android:

  • Disruptive technology: In the 80s, it was the personal computer.  Today, many think it is “cloud computing” (or “services”, or “ubiquitous computing”, or “utility computing”, or whatever else you want to call it).
  • Commodity infrastructure: In the 80s, PC-compatibles became a commodity through standardization of the hardware platform and fierce competition that drove prices (and profit margins) down. Today, network infrastructure (the Internet at the core, and mobile devices on the fringes) as well as systems software (LAMP) are facing similar pressures.
  • Common software platform: MS-DOS was the engine that fueled the growth of the personal computer.  For cloud computing, there is still some way to go (which Android hopes to help pave).
  • Revenue model: Microsoft made a profit out of every PC sold. In today’s networked world, profit should come from services offered over the network and accessed via a multitude of devices (including mobile phones), rather than from selling software licenses.

An executive once said that money is really made by controlling the middleware platform. Lower levels of the stack face high competition and have low profit margins.  Higher levels of the stack (except perhaps some key applications) are too special-purpose and more of a niche.  The sweet-spot lies somewhere in the middle. This is where MS-DOS was and where Android wants to be.

Technology stacks

Microsoft established itself by providing the platform for building applications on the “revolution” of its day, the personal computer. MS-DOS became the de-facto standard, much more open than anything else at that time. Subsequently, Microsoft took a cut of the profits out of each PC sold ever since. Taiwanese “PC-compatibles” helped fuel Microsoft’s (as well as Intel’s) growth. The rest is history.

In “cloud” computing, the ubiquitous, commodity infrastructure is the network.  This enables access to applications and information from any networked device. Even though individual components matter, it is common standards, rather than a single, common software platform, which further enable information sharing. If you believe that the future will be the same as the past, i.e., selling shrink-wrapped applications and software licenses, then Android not only has no revenue model, but has no hope of ever coming up with one. Ballmer would be absolutely right.  But if there is a shift towards network-hosted data and applications, money can be made whenever users access those.  There are plenty of obvious examples which could be profitable: geographically targeted advertising, smart shopping broker/assistant (see below), mobile office and add-on services, online games (location based or not), and so on. It’s not clear whether Google plans to get directly involved in those (I would doubt it), or just stay mostly on the back end and provide an easy-to-use “cloud infrastructure” for application developers.

The services provided by network operators are becoming commodities. This is nothing new. A quote I liked is that “ISPs have nothing to offer other than price and speed“.  I wouldn’t really include security in their offerings, as it is really an end-to-end service. As for devices, there is already evidence that commoditization similar to that of PC-compatibles may happen. Just one month after Android was open-sourced, Chinese manufacturers have started deploying it on smartphones. Even big manufacturers are quickly getting in the game; for example, Huawei recently announced an Android phone. Most cellphones are already manufactured in China anyway.  The iPhone is assembled in Shenzhen, where Huawei’s headquarters are also located (coincidence?). The Chinese already have a decent track record when it comes to building hardware and it’s only a matter of time until they fully catch up.

So, it’s quite simple: Android wants to be for ubiquitous services as MS-DOS was for personal computers. But Microsoft in the 80s did not really start out by saying “our revenue model is this: we’ll build a huge user base at all costs, which will subsequently allow us to get $200 out of each and every PC sold”?  Not really.  Similarly, Google is not going to say that “we want to build a user base, so we can make a profit from all services hosted on the [our?] cloud and accessed via mobile devices [and set-top boxes, and cars, and...].”  Such an announcement would be premature, and one of the surest ways to scare off your user base: unless Google first provides more evidence that it means no evil, the general public will tend to assume the worst.

The most interesting feature of Android is it’s component-based architecture, as pointed out by some of the more insightful blog posts. Components are like iGoogle gadgets, only Android calls them “activities.” Applications themselves are built using a very browser-like metaphor: a “task” (which is Android-speak for running applications) is simply a stack of activites, which users can navigate backwards and forwards. The platform already has a set of basic activities that handle, e.g., email URLs, map URLs, calendar URLs, Flickr URLs, Youtube URLs, photo capture, music files, and so on. Any application can seamlessly invoke any of these reusable activities, either directly or via a registry of capabilities (which, roughly speaking, are called “intents”). The correspondence between a task and an O/S process is not necessarily one-to-one. Processes are used behind the scenes, for security and resource isolation purposes. Activities invoked by the same task may or may not run in the same process.

In addition to activities and intents, Android also supports other types of components, such as “content providers” (to expose data sources, such as your calendar or todo list, via a common API), “services” (long-running background tasks, such as a music player, which can be controlled via remote calls) and “broadcast receivers” (handlers for external events, such as receiving an SMS).

I think that Google is really pushing Android because it needs a component-based platform, and not so much to avoid the occasional snafu. If embraced by developers, this is the major ace up Android’s sleeve.  Furthermore, the open source codebase is the strongest indication (among several) that Google has no intention to tightly regulate application frameworks like Apple, or to leverage it’s position to attack the competition like Microsoft has done in the past.  Google wants to give itself enough leverage to realize it’s cloud-based services vision. If others benefit too, so much the better—Google is still too young to be “evil“.  After all, as Jeff Bezos said, “like our retail business, [there] is not going to be one winner. [...] Important industries are rarely made by single companies.” I find the comparison to retail interesting. In fact, it is quite likely that many “cloud services” themselves will also become commodities.

I’d wager that really successful Android applications won’t be just applications, but components with content provided over the network. A shopping list app is nice. It was exciting in the PalmPilot era, a decade ago. But a shopping list component, accessible from both my laptop and my cellphone, able to automatically pull good deals from a shopping component, and allow a navigation component to alert me that the supermarket I’m about to drive by has items I need—well, that would be great! Android is built with that vision in mind, even though it’s not quite there yet.

It’s kind of disappointing, but not surprising, that many app developers do not yet think in terms of this component-based architecture. In fairness, there are already efforts, such as OpenIntents, to build collections of general-purpose intents. Furthermore, the sync APIs are not (yet) for the faint of heart. Even Google-provided services could perhaps be improved. For example, Google Maps does not synchronize stored locations with the web-based version. When I recently missed a highway exit on the way to work and needed to get directions, I had to pull over and re-type the full address. Neither does it expose those locations via a data provider. When I installed Locale, I had to manually re-enter most of “My Locations” from the web version of Google Maps. So, there are clearly some rough edges that I’m sure will be smoothed out.  After all, there have been other rough edges, such as forgotten debugging hooks, something I find more amusing than alarming or embarrassing and certainly not the “Worst. Bug. Ever.

Android has a lot of potential, but it still needs work and Google should move fast. The top two items on my wish list would be:

  1. Release a “signature” device (or two), like the Motorola Razr was a couple of years ago and the Apple iPhone was last year. The G1 is really nice, but not enough.  A device that people desire may be neither a necessary nor a sufficient condition for success, but it will sure help as a vehicle to move Android forward in market share.
  2. Expand the set of available activities and content providers, and release an easy-to-use data sync service and API. In principle, everything that is an iGoogle gadget should also be an Android activity, sharing the same data sources. This is at the core of what “cloud computing” is about.  After all, you could think of Android as a glorified modern browser for devices with small screens, intermittent network connectivity, location sensors, and so on.

I suspect it might not be that hard to build a Google gadget container for Android.  Google Gears is already there and some form of interaction with the local device via Javascript is already allowed.  Many gadgets don’t need that much screen real estate anyway, so this may be an interesting hack to try out.

But not many people will buy an Android device for what it could do some day. Google has created a lot of positive buzz, backed by a few actual features. Now it needs some sexy devices and truly interesting apps, to really jumpstart the necessary network effect. Building the smart shopping list app should be as easy as building the dumb one. In the longer run, the set of devices on which Android is deployed should be expanded.  Move beyond cell phones, to in-car computers, set-top boxes, and so on (Microsoft Windows does both cars and set-top boxes already, but with limited success so far)—in short, anything that can be used to access network-hosted data and applications.

Comments

Randy Pausch in CACM

The September issue of CACM has a one-page, seven-question interview with Randy Pausch. It is definitely worth reading, so I’ll give you a sneak peek (unfortunately, CACM is not “open access”):

What about advice for CS teachers and professors?

That it’s time for us to start being more honest with ourselves about what our field is and how we should approach teaching it. Personally, I think that if we had named the field “Information Engineering” as opposed to “Computer Science,” we would have had a better culture for the discipline. For example, CS departments are notorious for not instilling concepts like testing and validation the way many other engineering disciplines do.

Is there anything you wish someone had told you before you began your own studies?

Just that being technically strong is only one aspect of an education.

[...]

Alice has proven phenomenally successful at teaching young women, in particular, to program. What else should we be doing to get more women engaged in computer science?

Well, it’s important to note that Alice works for both women and men. I think female-specific “approaches” can be dangerous for lots of reasons, but approaches like Alice, which focus on activities like storytelling, work across gender, age, and cultural background. It’s something very fundamental to want to tell stories. And Caitlin Kelleher’s dissertation did a fantastic job of showing just how powerful that approach is.

The interview was conducted a few weeks before his death. I’ll just say that, somehow, I suspect someone not in his position would never have said at least one of these things.  It’s a sad thought, but Randy’s message is, as always, positive.

Comments

We’ve moved!

Early this week we moved from White Plains to Manhattan.  So far, we’ve decorated the apartment using organic landscape elements, in harmony with the surrounding environment.  Here is what I mean:

Urban decoration

On the left is the view outside the window and on the right is what you see inside.

Comments (1)

NYC initiation: rental application

We recently signed a lease to rent in UES. Besides the usual credit check, most places in NYC ask for a slew of personal information: bank statements (with balances and account numbers), federal tax return and W-2 copies, letter of employment stating yearly salary, and three character reference letters.  (As for the landlord, I only know her name)

I’m told that managed buildings may skip some of these, but the apartment we found is in a condominium. Even though the landlord had already approved us, our broker prepared all the paperwork to a tee for the upcoming condo board review.

He even sent us some anonymized character reference letter samples.  Some were quite amusing.  For example (emphasis mine):

[...] I have always found him to be serious and responsible about his works [sic] and his private life. His home life is extremely quiet, and I would think ideal for his neighbors. Virtually all of his social gatherings are conducted in restaurants. He travels throughout nine months of the year and would probably be at home for only short periods of time between those trips. And quite frankly, his time at home is usually spent resting as part of his recovery from his traveling and preparation for his next trip. He is just the kind of quiet, unobtrusive neighbor that I would like to have.

I couldn’t help but wonder how many boards went through letters like this one. For a moment or two, I entertained the thought of asking a friend to write a pithy one-liner instead:

Spiros = corpse – odor + money    ⇒    Spiros = dream tenant !

but I eventually decided that the “⇒” notation might be too much and dropped the idea altogether.

I just hope those sample letters do not really reflect life in NYC!

Comments (1)

Data harvesting with MapReduce

Combine harvesters
(original image source)

“The combine harvester, [...] is a machine that combines the tasks of harvesting, threshing and cleaning grain crops.” If you have acres upon acres of wheat and want to separate the grain from the chaff, a group of combines is what you really want. If you have a bonsai tree and want to trim it, a harvester may be less than ideal.

MapReduce is like a pack of harvesters, well-suited for weeding through a huge volumes of data, residing on a distributed storage system. However, a lot of machine learning work is more akin to trimming bonsai into elaborate patterns. Vice versa, it’s not uncommon to see trimmers used to harvest a wheat field. Well-established and respected researchers, as recently as this year write in their paper “Planetary Scale Views on a Large Instant-messaging Network“:

We gathered data for 30 days of June 2006. Each day yielded about 150 gigabytes of compressed text logs (4.5 terabytes in total). Copying the data to a dedicated eight-processor server with 32 gigabytes of memory took 12 hours. Our log-parsing system employed a pipeline of four threads that parse the data in parallel, collapse the session join/leave events into sets of conversations, and save the data in a compact compressed binary format. This process compressed the data down to 45 gigabytes per day. Processing the data took an additional 4 to 5 hours per day.

Doing the math, that’s five full days of processing to parse and compress the data on a beast of a machine. Even more surprisingly, I found this exact quote singled out among all the interesting results in the paper! Let me make clear that I’m not criticizing the study; in fact, both the dataset and the exploratory analysis are interesting in many ways. However, it is somewhat surprising that, at least among the research community, such a statement is still treated more like a badge of honor rather than an admission of masochism.

The authors should be applauded for their effort. Me, I’m an impatient sod. Wait one day for the results, I think I can do that. Two days, what the heck. But five? For an exploratory statistical analysis? I’d be long gone before that. And what if I found a serious bug half way down the road? That’s after more than two days of waiting, in case you weren’t counting. Or what if I decided I needed a minor modification to extract some other statistic? Wait another five days? Call me a Matlab-spoiled brat, but forget what I said just now about waiting one day. I changed my mind already. A few hours, tops. But we need a lot more studies like this. Consequently, we need the tools to facilitate them.

Hence my decision to frolic with Hadoop. This post focuses on exploratory data analysis tasks: the kind I usually do with Matlab or IPython/SciPy scripts, which involve many iterations of feature extraction, data summarization, model building and validation. This may be contrary to Hadoop’s design priorities: it is not intended for quick turnaround or interactive response times with modestly large datasets. However, it can still make life much easier.

Scale up on large datasets

First, we start with a very simple benchmark, which scans a 350GB text log. Each record is one line, consisting of a comma-separated list of key=value pairs. The job extracts the value for a specific key using a simple regular expression and computes the histogram of the corresponding values (i.e., how many times each distinct value appears in the log). The input consists of approximately 500M records and the chosen key is associated with about 130 distinct values.

Scalability: histogram

The plot above shows aggregate throughput versus number of nodes. HDFS and MapReduce cluster sizes are always equal, with HDFS rebalanced before each run. The job uses a split size of 256MB (or four HDFS blocks) and one reducer. All machines have a total of four cores (most Xeon, a few AMD) and one local disk. Disks range from ridiculously slow laptop-type drives (the most common type), to ridiculously fast SAS drives. Hadoop 0.16.2 (yes, this post took a while to write) and Sun’s 1.6.0_04 JRE were used in all experiments.

For such an embarrassingly parallel task, scaleup is linear. No surprises here, but it’s worth pointing out some numbers. As you can see from the plot, extracting simple statistics from this 350GB dataset took less than ten minutes with 39 nodes, down from several hours on one node. Without knowing the details of how the data were processed, if we assume similar throughput, then processing time of the raw instant messaging log could be roughly reduced from five days to just a few hours. In fact, when parsing a document corpus (about 1TB of raw text) to extract a document-term graph, we witnessed similar scale-up, going down from well over a day on a beast of a machine, to a couple of hours on the Hadoop cluster.

Hadoop is also reasonably simple to program with. It’s main abstraction is natural, even if your familiarity with functional programming concepts is next to none. Furthermore, most distributed execution details are, by default, hidden: if the code runs correctly on your laptop (with a smaller dataset, of course), then it will run correctly on the cluster.

Single core performance

Linear scaleup is good, but how about absolute performance? I implemented the same simple benchmark in C++, using Boost for regex matching. For a rough measure of sustained sequential disk throughput, I simply cat the same large file to /dev/null.

I collected measurements from various machines I had access to: (i) a five year old Mini-ITX system I use with my television at home, running Linux FC8 for this experiment, (ii) a two year old desktop at work, again with FC8, (iii) my three year old Thinkpad running Windows XP and Cygwin, and (iv) a recent IBM blade running RHEL4.

Single core performance

The hand-coded version in C++ is about 40% faster on the older machines and 33% faster on the blade [Note: I'm missing the C++ times for my laptop and it's drive crashed since then -- I was too lazy to reload the data and rerun everything, so I simply extrapolated from single-thread Hadoop assuming a 40% improvement, which seems reasonable enough for these back-of-the-envelope calculations]. Not bad, considering that Hadoop is written in Java and also incurs additional overheads to process each file split separately.

Perhaps I’m flaunting my ignorance but, surprisingly, this workload was CPU-bound and not I/O-bound—with the exception of my laptop, which has a really crappy 2.5″ drive (and Windows XP). Scanning raw text logs is a rather representative workload for real-world data analysis (e.g., AWK was built at AT&T for this purpose).

The blade has a really fast SAS drive (suspiciously fast, except perhaps if it runs at 15K RPM) and the results are particularly instructive. The drive reaches 120MB/sec sustained read throughput. Stated differently, the 3GHz CPU can only dwell on each byte for 24 cycles on average, if it’s to keep up with the drive’s read rate. Even on the other machines, the break-even point is between 30-60 cycles [Note: The laptop drive seems to be an exception, but I wouldn't be so sure that at least part of the inefficiency isn't due to Cygwin].

On the other hand, the benchmark throughput translates into 150-500 cycles per byte, on average. If I get the chance, I’d like to instrument the code with PAPI, validate these numbers and perhaps obtain a breakdown (into average cycles for regex state machine transition per byte, average cycles for hash update per record, etc). I would never have thought the numbers to be so high and I still don’t quite believe it. In any case, if we believe these measurements, at least 4-6 cores are needed to handle the sequential read throughput from a single drive!

The common wisdom in algorithms and databases textbooks, as far as I remember, was that when disk I/O is involved, CPU cycles can be more or less treated as a commodity. Perhaps this is an overstatement, but I didn’t expect it to be so off the mark.

This also raises another interesting question, which was the original motivation for measuring on a broad set of machines: what would be the appropriate cost-performance balance between CPU and disk for a purpose-built machine? I thought one might be able to get away with a setup similar to active disks: a really cheap and power-efficient Mini-ITX board, attached to a couple of moderately priced drives. For example, see this configuration, which was once used in the WayBack machine (I just found out that the VIA-based models have apparently been withdrawn, but the pages are still there for now). This does not seem to be the case.

The blades may be ridiculously expensive, perhaps even a complete waste of money for a moderately tech-savvy person. However, you can’t just throw together any old motherboard and hard disk, and magically turn them into a “supercomputer.” This is common sense, but some of the hype might have you believe the opposite.

Performance on smaller datasets

Once the original, raw data is processed, the representation of the features relevant to the analysis task typically occupies much less space. In this case, a bipartite graph extracted from the same 350GB text logs (the details don’t really matter for this discussion) takes up about 3GB, or two orders of magnitude less space.

Scalability: coclustering iteration

The graph shows aggregate throughput for one iteration of an algorithm similar to k-means clustering. This is fundamentally very similar to computing a simple histogram. In both cases, the output size is very small compared to the input size: the histogram has size proportional to the number of distinct values, whereas the cluster centers occupy space proportional to k. Furthermore, both computations iterate over the entire dataset and perform a hash-based group-by aggregation. For k-means, each point is “hashed” based on its distance to the closest cluster center, and the aggregation involves a vector sum.

Nothing much to say here, except that the linear scaleup tapers off after about 10-15 nodes, essentially due to lack of data: the fixed per-split overheads start dominating the total processing time. Hadoop is not really built to process datasets of modest size, but fundamentally I see nothing to prevent MapReduce from doing so. More importantly, when the dataset becomes really huge, I would expect Hadoop to scale almost-linearly with more nodes.

Hadoop can clearly help pre-process the raw data quickly. Once the relevant features are extracted, they may occupy at least an order of magnitude less space. It may be possible to get away with single-node processing on the appropriate representation of the features, at least for exploratory tasks.  Sometimes it may even be better to use a centralized approach.

Summary

My focus is on exploratory analysis of large datasets, which is a pre-requisite for the design of mining algorithms. Such tasks typically involve (i) raw data pre-processing and feature extraction stages, and (ii) model building and testing stages. Distributed data processing platforms and, in particular, Hadoop are well-suited for such tasks, especially the feature extraction stages.  In fact, tools such as Sawzall (which is akin to AWK, but on top of Google’s MapReduce and protocol buffers), excel at the feature extraction and summarization stages.

The original, raw data may reside in a traditional database, but more often than not they don’t: packet traces, event logs, web crawls, email corpora, sales data, issue-tracking ticket logs, and so on. Hadoop is especially well-suited for “harvesting” those features out of the original data. In its present form, it can also help in model building stages, if the dataset is really large.

In addition to reducing processing time, Hadoop is also quite easy to use. My experience is that the programming effort compares very favorably to the usual approach of writing my own, quick Python scripts for data pre-processing. Furthermore, there are ongoing efforts for even further simplification (e.g., Cascading and Pig).

I was somewhat surprised with the CPU vs I/O trade-offs for what I would consider real-world data processing tasks. Perhaps also influenced by the original work on active disks (one of the inspirations for MapReduce), which suggested using the disk controller to process data. However, there is a cross-over point for the performance of active disks versus centralized processing; I was way off with my initial guess on how much CPU power it takes for a reasonably low cross-over point (which is workload-dependent, of course, and any results herein should be treated as indicative and not conclusive).

Footnote: For what it’s worth, I’ve put up some of the code (and hope to document it sometime). Also, thanks to Stavros Harizopoulos for pointing out the simple cycles-per-byte metric.

Comments

Bad planning

Some visions do not really translate into plans.  Becoming rich, established or happy, for example.  It’s like saying “I want to be a superhero!” How do you go about that?

How to become Spiderman

Nope.  Not much of a plan.

Comments (1)

“Beyond Relational Databases”

The article “Beyond Relational Databases” by Margo Seltzer in the July 2008 issue of CACM claims that “there is more to data access than SQL.”  Although this is a fairly obvious statement, the article is well-written and worth a read.  The main message is simple: bundling data storage, indexing, query execution, transaction control, and logging components into a monolithic system and wrapping them with a veneer of SQL is not the best solution to all data management problems. Consequently, the author makes a call for solutions based on a modular approach, using open components.

However, the article offers no concrete examples at all, so I’ll venture a suggestion. In a growing open source ecosystem of scalable, fault-tolerant, distributed data processing and management components, MapReduce is emerging as a predominant elementary abstraction for distributed execution of a large class of data-intensive processing tasks. It has attracted a lot of attention, proving both a source for inspiration, as well as target of polemic by prominent database researchers.

In database terminology, MapReduce is an execution engine, largely unconcerned about data models and storage schemes.  In the simplest case, data reside on a distributed file system (e.g., GFS, HDFS, or KFS) but nothing prevents pulling data from a large data store like BigTable (or HBase, or Hypertable), or any other storage engine, as long as it

  • Provides data de-clustering and replication across many machines, and
  • Allows computations to execute on local copies of the data.

Arguably, MapReduce is powerful both for the features it provides, as well as for the features it omits, in order to provide a clean and simple programming abstraction, which facilitates improved usability, efficiency and fault-tolerance.

Most of the fundamental ideas for distributed data processing are not new.  For example, a researcher involved in some of the projects mentioned once said, with notable openness and directness, that “people think there is something new in all this; there isn’t, it’s all Gamma“—and he’s probably right.  Reading the original Google papers, none make a claim to fundamental discoveries.  Focusing on “academic novelty” (whatever that may mean) is irrelevant.  Similarly, most of the other criticisms in the irresponsibly written and oft (mis)quoted blog post and its followup miss the point.  The big thing about the technologies mentioned in this post is, in fact, their promise to materialize Margo Seltzer’s vision, on clusters of commodity hardware.

Michael Stonebraker and David DeWitt do have a valid point: we should not fixate on MapReduce; greater things are happening. So, if we are indeed witnessing the emergence of an open ecosystem for scalable, distributed data processing, what might be the other key components?

Data types: In database speak, these are known as “schemas.” Google’s protocol buffers the underlying API for data storage and exchange.  This is also nothing radically new; in essence, it is a binary XML representation,  somewhere between the simple XTalk protocol which underpins Vinci and the WBXML tokenized representation (both slightly predating protocol buffers and both now largely defunct).  In fact, if I had to name a major weakness in the open source versions of Google’s infrastructure (Hadoop, HBase, etc), it would be the lack of such a common data representation format.  Hadoop has Writable, but that is much too low-level (a data-agnostic, minimalistic abstraction for lightweight, mutable, serializable objects), leading to replication of effort in many projects that rely on Hadoop (such as Nutch, Pig, Cascading, and so on).  Interestingly, the rcc record compiler component (which seems to have fallen in disuse) was once called Jute with possibly plans grander than what came to be.  So, I was pleasantly surprised when Google decided to open-source protocol buffers a few days ago—although it may now turn out to be too little too late.

Data access: In the beginning there was BigTable, which has been recently followed by HBase and Hypertable.  It started fairly simple, as a “is a sparse, distributed, persistent multidimensional sorted map” to quote the original paper.  It is now part of the Google App Engine and even has support for general transactions. HBase, at least as of version 0.1 was relatively immature, but there is a flurry of development and we should expect good things pretty soon, given the Hadoop team’s excellent track record so far.  While writing this post, I remembered an HBase wish list item which, although lower priority, I had found interesting: support for scripting languages, instead of HQL. Turns out this has already been done (JIRA entry and wiki entries).  I am a fan of modern scripting languages and generally skeptical about new special-purpose languages (which is not to say that they don’t have their place).

Job and schema management: Pig, from the database community, is described as a parallel dataflow engine and employs yet another special-purpose language which tries to look a little like SQL (but it is no secret that it isn’t). Cascading has received no attention in the research community, but it merits a closer look. It is based on a “build system” metaphor, aiminig to be the equivalent of Make or Ant for distributed processing of huge datasets.  Instead of introducing a new language, it provides a clean Java API and also integrates with scripting languages that support functional programming (at the moment, Groovy).  As I have used neither Cascading nor Pig at the moment, I will reserve any further comparisons.  It is worth noting that both projects build upon Hadoop core and do not integrate, at the moment, with other components, such as HBase. Finally, Sawzall deserves an honorable mention, but I won’t discuss it further as it is a closed technology.

Indexing: Beyond lookups based on row keys in BigTable, general support for indexing is a relatively open topic.  I suspect that IR-style indices, such as Lucene, have much to offer (something that has not gone unnoticed)—more on this in another post.

A number of other projects are also worth keeping an eye on, such as CouchDB, Amazon’s S3, Facebook’s Hive, and JAQL (and I’m sure I’m missing many more).  All of them are, of course, open source.

Comments

A dog’s life

I recently returned from two weeks in Greece, which included four days in Santorini.  Even the dogs sunbathe, enjoying the scenery.  For the most part, I gladly followed their example.

Dog\'s life in Santorini

Upon coming back to New York, somewhat to my surprise, it was the US that felt comparatively dinky.  A few years ago, it used to be the other way around.  However, the contrast between new developments around the 2004 Olympics and New York’s crumbling infrastructure was at least noticeable this time.

Comments

The Fall of CAPTCHAs - really?

I recently saw a Slashdot post dramatically titled “Fallout From the Fall of CAPTCHAs“, citing an equally dramatic article about “How CAPTCHA got trashed“.  Am I missing something? Ignoring their name for a moment, CAPTCHAs are computer programs, following specific rules, and therefore they are subject to the same cat-and-mouse games that all security mechanisms go through. Where exactly is the surprise? So Google’s or Yahoo’s current versions were cracked.  They’ll soon come up with new tricks, and still newer ones after those are cracked, and so on.

In fact, I was always confused about one aspect of CAPTCHAs. I thought that a Turing test is, by definition, administered by a human, so a “completely-automated Turing-test” is an oxymoron, something like a “liberal conservative”. An unbreakable authentication system based on Turing tests should rely fully on human computation: humans should also be at the end that generates the tests. Let humans come up with questions, using references to images, web site content, and whatever else they can think of.  Then match these to other humans who can gain access to a web service by solving the riddles. Perhaps the tests should also be somehow rated, lest the simple act of logging in turns into an absurd treasure hunt. I’m not exactly sure if and how this could be turned into an addictive game, but I’ll leave that to the experts.  The idea is too obvious to miss anyway.

CAPTCHAs, even in their current form, have led to numerous contributions.  A non-exclusive list, in no particular order:

  1. They have a catchy name. That counts a lot. Seriously. I’m not joking; if you don’t believe me, repeat out loud after me: “I have no idea what ‘onomatopoeia’ is—I’d better MSN-Live it” or “… I’d better Yahoo it.”  Doesn’t quite work, does it?
  2. They popularized an idea which, even if not entirely new, was made accesible to webmasters the world over, and is now used daily by thousands if not millions of people.  What greater measure of success can you think of for a technology?
  3. Sowed the seeds for Luis von Ahn’s viral talk on human computation, which has featured in countless universities, companies and conferences.  Although not professionally designed, the slides’ simplicity matches their content in a Jobs-esque way. As for delivery and timing, Steve might even learn something from this talk (although, in fairness, Steve Jobs probably doesn’t get the chance to introduce the same product hundreds of times).

So is anyone really surprised that the race for smarter tests and authentication mechanisms has not ended, and probably never will? (Incidentally, the lecture video above is from 2006, over three years after the first CAPTCHAs were succesfully broken by another computer program—see also CVPR 2003 paper—.  There are no silver bullets, no technology is perfect, but some are really useful. Perhaps CAPTCHAs are, to some extent, victim of their own hype which, however, is instrumental and perhaps even necessary for the wide adoption of any useful technology.  I’m pretty sure we’ll see more elaborate tests soon, not less.

Comments

« Previous entries