My godfather is a Markov model

Before the memory is completely lost in the dust of time, I’d like to document how I ended up with this domain name. It all started last summer, when I decided to start a personal site. Of course, both my first and last names were already taken, even in TLDs I’d never heard of before.  But using my name would have been too easy anyway.  Challenge is good.

Politically-correct and totally un-sarcastic as I am, I originally wanted to go with some combination of “principled anarchy”.  Now, that was available! Apparently, nobody wanted to touch it with a ten foot pole, not even cybersquatters; which kind of gave me a hint.  Wouldn’t want to, say, end up in a three-letter-agency watchlist, at least not while in the US on H1B.  They might not share my sense of humor.

So, armed with online thesauri, dictionaries, the internet anagram server, and things like that, I set out on a name quest.  I don’t remember anymore what I tried; “coredump” (which, in case you didn’t know, has “code rump” as an anagram—still available, if you’re interested), “segfault”, “brainfart”, “farout”, and pretty much anything else I could think of: all taken.   Even these names as well as these are taken (thank god!).

At some point I was naïve enough to hope that a Tolkien name would be free.  No luck of course, anything semi-pronnounceable was taken.  You’d have to go as far as, say,  ”gulduin” (which, by the way, means “magic river” in Elvish) to find something available. Good luck getting people to remember that!  Oh well, at least I had a reason to actually read some of the Silmarillion; if you’ve tried this and you’re not a religiously devoted Tolkien fan, you know what I’m talking about.

After the first week of searching, I think I even got temporarily banned from Yahoo! whois search. In desperation, I finally turned to one of many domain name generators.  I asked omniscient Google to give me one and, as always, it obliged.  By now I had decided that I wanted a name as free of any connotations as possible (say, like Google or Slashdot, not like Facebook or YouTube).  I went through things like “fractors”, “naphead”, “magnarchy”, “aniarchy”, “mallock”, “hexndex”, “squilt”, “terable”, and so on. It’s amazing how several weeks of searching in frustration temper one’s standards of quality. Anyway, one day “bitquill” popped up: neutral, inoffensive, bland, unusual, and a composite which is short and almost pronnounceable!  I couldn’t ask for much more, so I registered it.  

That, and “clusterhack”.  Sorry.  I couldn’t resist.

Comments (1)

On data ownership in a networked world

Every piece of content has a creator and owner (in this post, I will assume they are by default the same entity). I do not mean ownership in the traditional sense of, e.g., stashing a piece of paper in a drawer, but in the metaphysical sense that each artifact is forever associated with one or more “creators.”

This is certainly true of the end-products of intellectual labor, such as the article you are reading. However, it is also true of more mundane things, such as checkbook register entries or credit card activity. Whenever you pay a bill or purchase an item, you implicitly “create” a piece of content: the associated entry in your statement.  This has two immediately identifiable “creators”: the payer (you) and the payee.  The same is true for, e.g., your email, your IM chats, your web searches, etc. Interesting tidbit: over 20% of search terms entered daily in Google are new, which would imply roughly 20 million new pieces of content per day, or over 7 billion (over twice the earth’s population) per year—all this from just one activity on one website.

When I spend a few weeks working on, say, a research paper, I have certain expectations and demands about my rights as a “creator.” However, I give almost no thought to my rights on the trail of droppings (digital or otherwise) that I “create” each day, by searching the web, filling up the gas tank, getting coffee, going through a toll booth, swiping my badge, and so on.  However, with the increasing ease of data collection and distribution in digital form, we should re-think our attitudes towards “authorship”.

Unique identity

People call me “Spiros”, my identity documents list me as “Spyridon Papadimitriou” and on most online sites I’m registered as spapadim.  However, sometimes I’m s_papadim or spiros_papadimitriou, and so on.  Like most people, I lost track of all my accounts a time ago.  Vice versa, I’m not the only “Spiros Papadimitriou” in the real world.  For example, I occasionally get confused with my cousin, and receive comments about my interesting architectural designs!  Nor am I the only spapadim on the net.

A framework and mechanisms that allow (but do not enforce) asserting and verifying which of those labels (i.e., names, userids, etc) refer to the same entity (i.e., me) is missing. However, this is a prerequisite: how can we talk about data ownership and tackle portability, transparency and accountability, if we have to jump through countless hoops just to prove identity?

Some people, especially in the US, may object or even outright panic at the thought of such a global identifier.  In Greece, and in much of Europe, we’ve had national identity cards for decades.  Which is fine, as long as you know they exist and what are permissible uses-in other words, as long as transparency is ensured.  Furthermore, the illusion of privacy should not be confused with privacy itself—if in doubt, I suggest reading “Database Nation” (official site).  Its examples are largely US-centric, but the lessons are not.

OpenID (despite some shortcomings) and OAuth are emerging as open standards for authentication and authorization.  OpenID allows reuse of authentication credentials from one site on others: I can reuse, say, my Google username and password to log in to other sites (e.g., to leave a comment on this blog), without having to create yet another account from scratch.  OAuth resembles Kerberos’s ticket granting service but for the web, permitting other web services to ask for access to a subset of personal information: I could allow Facebook to access only my Google addressbook and not, potentially, all of my data on any Google service.  OpenID and OAuth can, at least in principle, work together.

Both high-profile individual developers and major companies are involved in these efforts.  For example, Yahoo! already supports OpenID and plans to support OAuth as well, while Google supports OAuth directly and OpenID indirectly in various ways.  Wide adoption of these standards would be a major step forwards for data portability and web interoperability.  However, I suspect they fall slightly short of providing a truly permanent and global personal identity.  What if, for any reason, my Yahoo! account disappears, either because I decided to shut it down or because Yahoo! went bust?

I was going to suggest a DNS-based solution and I was surprised when I found that the generic top-level domain .name has been instituted since 2001 to provide URIs for personal identities. You can register for a free three-month trial on FreeYourID (after that, it’s $11/year). What’s more, their service already provides OpenID authentication. In principle, this should allow easy switching of authentication and authorization service providers. Just as I can still keep the “label” for this site even if I move to a different web host, I can still keep my personal “label” no matter who I choose to manage my personal information.  So, now my universal username is spiros.papadimitriou.name, any emails sent to spiros@papadimitriou.name will find their way to me, you can call me on Skype using spiros.papadimitriou.name/call, and so on.

With such a unique identity tied to authorization and authentication services, the Giant Global Graph and its materializations would be one step closer to becoming really useful. If I want to use my identity to log and controll access to my data, I should be able to prove my claims.  Currently, FOAF and XFN allow assertion of relationshipt but provide no way to verify them.

Data portability

The point of this mental exercise so far is the following: A unique identity that can be verifiably associated with each and every data item that I produce is a prerequisite for making data ownership claims. Subsequently, we need to ask what fundamental rights should be associated with data ownership.  The first is the right to keep my information with me or, in other words, “data portability”. Just as I can freely move my money from one financial institution to another, I should be able to move any of my information from one data warehouse to another.

For example, consider my web search history. I don’t think I need to argue about the importance of historical information to improve search quality. If I decide for any reason to move to another search provider, I should be able to carry along all the information that’s directly associated with me.  This should include my search keyword history, as well as any additional information I may have contributed.

The actual details, however, may not be that straightforward.  Take, say, the third hit on a Google search.  Who is the “creator”?  Me by entering the search keywords, Google by producing the search results in response to those keywords, or the person who wrote the web page that contains them in the first place?  Similarly, when I buy gas, who is the “creator” of the transaction entry: me, Mobil, or American Express?

Even though intuition can often be wrong, my intuitive response to the Google search example would be that both I and Google have an ownership claim on this particular search, which includes the query keywords as well as a ranking of URLs.  On the other hand, the person who wrote the contents of, say, the third URL has ownership claims only on those, and not the search results.  Furthermore, the thousands of people that provided feedback to Google’s ranking algorithms by clicking on this URL on similar searches have ownership claims on those searches, but not on mine.

Finally, those two ownership claims (on keywords and on rankings) should probably not be treated the same.  If they were, then, say, MS Live could effectively copy Google by getting many users to move.  It seems reasonable to have the right to move my search history, but not the actual search results. However, I can imagine that some form of ownership claim on the rankings may be useful for other personal rights.

This is a highly idealized example and I’m not sure what an appropriate litmus test for ownership is, but some form of legal consensus must be in place.

Transparency

The second fundamental right is that I should know who is using my personal information and how. For example, if an insurance company accesses my credit history to give me a rate quote, I can find this out. It may not be a completely painless process but it is certainly possible today, with a regulatory framework that ensures this.  Similar regulations should be instituted to cover any and all forms of access to personal information.

Data access should be fully transparent to all parties involved. If the an insurance company accesses my medical records, I should know this.  If the government does a background check on me, I should know this too.  Transparency is a prerequisite for accountability. Otherwise, individuals have very limited power to protect themselves from improper uses of their personal information.

Concluding remarks

Much of the privacy research in computer science seems to assume that we can keep the existing legal and regulatory frameworks intact. Computer scientists taking such a position is even sadder than lawyers doing so; we have no excuse of failing to understand the technical issues. We cannot and should not make this assumption. Technical solutions should be subsidiary to new regulations.  But that doesn’t mean technologists cannot lead.  We should work towards supporting full transparency (for both individuals, as well as governments and corporations) rather than opacity and I’m currently in favor of a “shoot first, ask questions later” approach (and help lawmakers figure out the answers). After all, if there is anything that the DRM wars have taught us, it’s that information really wants to be free. Why do we think it’s technically hard (to say the least) to prevent copying of music, movies and software but we still think it may be possible to prevent copying of personal information? As I pointed out in an older post, it’s usually the use and not the possession of information that’s the problem.

My point in this post is simple: we should not fight the wrong war. Instead, we need an easy way to make data ownership claims, and use this to enforce at least two fundamental rights: the ability to keep any personal data with us, and the ability to know who is using this data and how.

Postscript. This post was wallowing for a while as a draft (originally separated from this post, then forgotten).  Since then, a recent MIT TR article discusses some aspects of data ownership.  Even better, I have since found an excellent short piece in the same issue by Esther Dyson, with which I could not agree more.

Update. After posting this last night, I did some further Googling and found another piece by Esther Dyson in the Scientific American. If you’ve read through my ramblings so far, then I’d urge you to read her article; she’s a much better writer than me, and has apparently been thinking about these issues for almost a decade, way before many people even knew what the Internet is. I should probably follow her more closely myself, as I agree disturbingly often with what I’ve read from her so far.

Comments (1)

Data harvesting with MapReduce

Combine harvesters
(original image source)

“The combine harvester, [...] is a machine that combines the tasks of harvesting, threshing and cleaning grain crops.” If you have acres upon acres of wheat and want to separate the grain from the chaff, a group of combines is what you really want. If you have a bonsai tree and want to trim it, a harvester may be less than ideal.

MapReduce is like a pack of harvesters, well-suited for weeding through a huge volumes of data, residing on a distributed storage system. However, a lot of machine learning work is more akin to trimming bonsai into elaborate patterns. Vice versa, it’s not uncommon to see trimmers used to harvest a wheat field. Well-established and respected researchers, as recently as this year write in their paper “Planetary Scale Views on a Large Instant-messaging Network“:

We gathered data for 30 days of June 2006. Each day yielded about 150 gigabytes of compressed text logs (4.5 terabytes in total). Copying the data to a dedicated eight-processor server with 32 gigabytes of memory took 12 hours. Our log-parsing system employed a pipeline of four threads that parse the data in parallel, collapse the session join/leave events into sets of conversations, and save the data in a compact compressed binary format. This process compressed the data down to 45 gigabytes per day. Processing the data took an additional 4 to 5 hours per day.

Doing the math, that’s five full days of processing to parse and compress the data on a beast of a machine. Even more surprisingly, I found this exact quote singled out among all the interesting results in the paper! Let me make clear that I’m not criticizing the study; in fact, both the dataset and the exploratory analysis are interesting in many ways. However, it is somewhat surprising that, at least among the research community, such a statement is still treated more like a badge of honor rather than an admission of masochism.

The authors should be applauded for their effort. Me, I’m an impatient sod. Wait one day for the results, I think I can do that. Two days, what the heck. But five? For an exploratory statistical analysis? I’d be long gone before that. And what if I found a serious bug half way down the road? That’s after more than two days of waiting, in case you weren’t counting. Or what if I decided I needed a minor modification to extract some other statistic? Wait another five days? Call me a Matlab-spoiled brat, but forget what I said just now about waiting one day. I changed my mind already. A few hours, tops. But we need a lot more studies like this. Consequently, we need the tools to facilitate them.

Hence my decision to frolic with Hadoop. This post focuses on exploratory data analysis tasks: the kind I usually do with Matlab or IPython/SciPy scripts, which involve many iterations of feature extraction, data summarization, model building and validation. This may be contrary to Hadoop’s design priorities: it is not intended for quick turnaround or interactive response times with modestly large datasets. However, it can still make life much easier.

Scale up on large datasets

First, we start with a very simple benchmark, which scans a 350GB text log. Each record is one line, consisting of a comma-separated list of key=value pairs. The job extracts the value for a specific key using a simple regular expression and computes the histogram of the corresponding values (i.e., how many times each distinct value appears in the log). The input consists of approximately 500M records and the chosen key is associated with about 130 distinct values.

Scalability: histogram

The plot above shows aggregate throughput versus number of nodes. HDFS and MapReduce cluster sizes are always equal, with HDFS rebalanced before each run. The job uses a split size of 256MB (or four HDFS blocks) and one reducer. All machines have a total of four cores (most Xeon, a few AMD) and one local disk. Disks range from ridiculously slow laptop-type drives (the most common type), to ridiculously fast SAS drives. Hadoop 0.16.2 (yes, this post took a while to write) and Sun’s 1.6.0_04 JRE were used in all experiments.

For such an embarrassingly parallel task, scaleup is linear. No surprises here, but it’s worth pointing out some numbers. As you can see from the plot, extracting simple statistics from this 350GB dataset took less than ten minutes with 39 nodes, down from several hours on one node. Without knowing the details of how the data were processed, if we assume similar throughput, then processing time of the raw instant messaging log could be roughly reduced from five days to just a few hours. In fact, when parsing a document corpus (about 1TB of raw text) to extract a document-term graph, we witnessed similar scale-up, going down from well over a day on a beast of a machine, to a couple of hours on the Hadoop cluster.

Hadoop is also reasonably simple to program with. It’s main abstraction is natural, even if your familiarity with functional programming concepts is next to none. Furthermore, most distributed execution details are, by default, hidden: if the code runs correctly on your laptop (with a smaller dataset, of course), then it will run correctly on the cluster.

Single core performance

Linear scaleup is good, but how about absolute performance? I implemented the same simple benchmark in C++, using Boost for regex matching. For a rough measure of sustained sequential disk throughput, I simply cat the same large file to /dev/null.

I collected measurements from various machines I had access to: (i) a five year old Mini-ITX system I use with my television at home, running Linux FC8 for this experiment, (ii) a two year old desktop at work, again with FC8, (iii) my three year old Thinkpad running Windows XP and Cygwin, and (iv) a recent IBM blade running RHEL4.

Single core performance

The hand-coded version in C++ is about 40% faster on the older machines and 33% faster on the blade [Note: I'm missing the C++ times for my laptop and it's drive crashed since then -- I was too lazy to reload the data and rerun everything, so I simply extrapolated from single-thread Hadoop assuming a 40% improvement, which seems reasonable enough for these back-of-the-envelope calculations]. Not bad, considering that Hadoop is written in Java and also incurs additional overheads to process each file split separately.

Perhaps I’m flaunting my ignorance but, surprisingly, this workload was CPU-bound and not I/O-bound—with the exception of my laptop, which has a really crappy 2.5″ drive (and Windows XP). Scanning raw text logs is a rather representative workload for real-world data analysis (e.g., AWK was built at AT&T for this purpose).

The blade has a really fast SAS drive (suspiciously fast, except perhaps if it runs at 15K RPM) and the results are particularly instructive. The drive reaches 120MB/sec sustained read throughput. Stated differently, the 3GHz CPU can only dwell on each byte for 24 cycles on average, if it’s to keep up with the drive’s read rate. Even on the other machines, the break-even point is between 30-60 cycles [Note: The laptop drive seems to be an exception, but I wouldn't be so sure that at least part of the inefficiency isn't due to Cygwin].

On the other hand, the benchmark throughput translates into 150-500 cycles per byte, on average. If I get the chance, I’d like to instrument the code with PAPI, validate these numbers and perhaps obtain a breakdown (into average cycles for regex state machine transition per byte, average cycles for hash update per record, etc). I would never have thought the numbers to be so high and I still don’t quite believe it. In any case, if we believe these measurements, at least 4-6 cores are needed to handle the sequential read throughput from a single drive!

The common wisdom in algorithms and databases textbooks, as far as I remember, was that when disk I/O is involved, CPU cycles can be more or less treated as a commodity. Perhaps this is an overstatement, but I didn’t expect it to be so off the mark.

This also raises another interesting question, which was the original motivation for measuring on a broad set of machines: what would be the appropriate cost-performance balance between CPU and disk for a purpose-built machine? I thought one might be able to get away with a setup similar to active disks: a really cheap and power-efficient Mini-ITX board, attached to a couple of moderately priced drives. For example, see this configuration, which was once used in the WayBack machine (I just found out that the VIA-based models have apparently been withdrawn, but the pages are still there for now). This does not seem to be the case.

The blades may be ridiculously expensive, perhaps even a complete waste of money for a moderately tech-savvy person. However, you can’t just throw together any old motherboard and hard disk, and magically turn them into a “supercomputer.” This is common sense, but some of the hype might have you believe the opposite.

Performance on smaller datasets

Once the original, raw data is processed, the representation of the features relevant to the analysis task typically occupies much less space. In this case, a bipartite graph extracted from the same 350GB text logs (the details don’t really matter for this discussion) takes up about 3GB, or two orders of magnitude less space.

Scalability: coclustering iteration

The graph shows aggregate throughput for one iteration of an algorithm similar to k-means clustering. This is fundamentally very similar to computing a simple histogram. In both cases, the output size is very small compared to the input size: the histogram has size proportional to the number of distinct values, whereas the cluster centers occupy space proportional to k. Furthermore, both computations iterate over the entire dataset and perform a hash-based group-by aggregation. For k-means, each point is “hashed” based on its distance to the closest cluster center, and the aggregation involves a vector sum.

Nothing much to say here, except that the linear scaleup tapers off after about 10-15 nodes, essentially due to lack of data: the fixed per-split overheads start dominating the total processing time. Hadoop is not really built to process datasets of modest size, but fundamentally I see nothing to prevent MapReduce from doing so. More importantly, when the dataset becomes really huge, I would expect Hadoop to scale almost-linearly with more nodes.

Hadoop can clearly help pre-process the raw data quickly. Once the relevant features are extracted, they may occupy at least an order of magnitude less space. It may be possible to get away with single-node processing on the appropriate representation of the features, at least for exploratory tasks.  Sometimes it may even be better to use a centralized approach.

Summary

My focus is on exploratory analysis of large datasets, which is a pre-requisite for the design of mining algorithms. Such tasks typically involve (i) raw data pre-processing and feature extraction stages, and (ii) model building and testing stages. Distributed data processing platforms and, in particular, Hadoop are well-suited for such tasks, especially the feature extraction stages.  In fact, tools such as Sawzall (which is akin to AWK, but on top of Google’s MapReduce and protocol buffers), excel at the feature extraction and summarization stages.

The original, raw data may reside in a traditional database, but more often than not they don’t: packet traces, event logs, web crawls, email corpora, sales data, issue-tracking ticket logs, and so on. Hadoop is especially well-suited for “harvesting” those features out of the original data. In its present form, it can also help in model building stages, if the dataset is really large.

In addition to reducing processing time, Hadoop is also quite easy to use. My experience is that the programming effort compares very favorably to the usual approach of writing my own, quick Python scripts for data pre-processing. Furthermore, there are ongoing efforts for even further simplification (e.g., Cascading and Pig).

I was somewhat surprised with the CPU vs I/O trade-offs for what I would consider real-world data processing tasks. Perhaps also influenced by the original work on active disks (one of the inspirations for MapReduce), which suggested using the disk controller to process data. However, there is a cross-over point for the performance of active disks versus centralized processing; I was way off with my initial guess on how much CPU power it takes for a reasonably low cross-over point (which is workload-dependent, of course, and any results herein should be treated as indicative and not conclusive).

Footnote: For what it’s worth, I’ve put up some of the code (and hope to document it sometime). Also, thanks to Stavros Harizopoulos for pointing out the simple cycles-per-byte metric.

Comments (1)

The Fall of CAPTCHAs – really?

I recently saw a Slashdot post dramatically titled “Fallout From the Fall of CAPTCHAs“, citing an equally dramatic article about “How CAPTCHA got trashed“.  Am I missing something? Ignoring their name for a moment, CAPTCHAs are computer programs, following specific rules, and therefore they are subject to the same cat-and-mouse games that all security mechanisms go through. Where exactly is the surprise? So Google’s or Yahoo’s current versions were cracked.  They’ll soon come up with new tricks, and still newer ones after those are cracked, and so on.

In fact, I was always confused about one aspect of CAPTCHAs. I thought that a Turing test is, by definition, administered by a human, so a “completely-automated Turing-test” is an oxymoron, something like a “liberal conservative”. An unbreakable authentication system based on Turing tests should rely fully on human computation: humans should also be at the end that generates the tests. Let humans come up with questions, using references to images, web site content, and whatever else they can think of.  Then match these to other humans who can gain access to a web service by solving the riddles. Perhaps the tests should also be somehow rated, lest the simple act of logging in turns into an absurd treasure hunt. I’m not exactly sure if and how this could be turned into an addictive game, but I’ll leave that to the experts.  The idea is too obvious to miss anyway.

CAPTCHAs, even in their current form, have led to numerous contributions.  A non-exclusive list, in no particular order:

  1. They have a catchy name. That counts a lot. Seriously. I’m not joking; if you don’t believe me, repeat out loud after me: “I have no idea what ‘onomatopoeia’ is—I’d better MSN-Live it” or “… I’d better Yahoo it.”  Doesn’t quite work, does it?
  2. They popularized an idea which, even if not entirely new, was made accesible to webmasters the world over, and is now used daily by thousands if not millions of people.  What greater measure of success can you think of for a technology?
  3. Sowed the seeds for Luis von Ahn’s viral talk on human computation, which has featured in countless universities, companies and conferences.  Although not professionally designed, the slides’ simplicity matches their content in a Jobs-esque way. As for delivery and timing, Steve might even learn something from this talk (although, in fairness, Steve Jobs probably doesn’t get the chance to introduce the same product hundreds of times).

So is anyone really surprised that the race for smarter tests and authentication mechanisms has not ended, and probably never will? (Incidentally, the lecture video above is from 2006, over three years after the first CAPTCHAs were succesfully broken by another computer program—see also CVPR 2003 paper—.  There are no silver bullets, no technology is perfect, but some are really useful. Perhaps CAPTCHAs are, to some extent, victim of their own hype which, however, is instrumental and perhaps even necessary for the wide adoption of any useful technology.  I’m pretty sure we’ll see more elaborate tests soon, not less.

Comments

Web science: what and how?

From the article “Web Science: An Interdisciplinary Approach to Understanding the Web” in the July issue CACM (which, by the way, looks quite impressive after the editorial overhaul!):

At the micro scale, the Web is an infrastructure of artificial languages and protocols; it is a piece of engineering. [...] The macro system, that is, the use of the micro system by many users interacting with one another in often-unpredicted ways, is far more interesting in and of itself and generally must be analyzed in ways that are different from the micro system. [...] The essence of our understanding of what succeeds on the Web and how to develop better Web applications is that we must create new ways to understand how to design systems to produce the effect we want.  The best we can do today is design and build in the micro, hoping for the best, but how do we know if we’ve built in the right functionality to ensure the desired macroscale effects? How do we predict other side effects and the emergent properties of the macro? [...] Given the breadth of the Web and its inherently multi-user (social) nature, its science is necessarily interdisciplinary, involving at least mathematics, CS, artificial intelligence, sociology, psychology, biology and economics.

This is a noble goal indeed. The Wikipedia article on sociology sounds quite similar on many aspects:

Sociologists research macro-structures and processes that organize or affect society [...] And, they research micro-processes [...] Sociologists often use  quantitative methods—such as social statistics or network analysis—to investigate the structure of a social process or describe patterns in social relationships. Sociologists also often use qualitative methods—such as focused interviews, group discussions and ethnographic methods—to investigate social processes.

First, we have to keep in mind that the current Western notion of “science” is fairly recent.  Furthermore, it has not always been the case that technology follows science. As an example, in the book “A People’s History of Science” by Clifford Conner, one can find the following quotation from Gallileo’s Two New Sciences, about Venice’s weapons factory (the Arsenal):

Indeed, I myself, being curious by nature, frequently visit this place for the mere pleasure of observing the work of those who, on account of their superiority over other artisans, we call “first rank men.” Conference with them has often helped me in the investigation of certain effects, including not only those which are striking, but also those which are recondite and almost incredible.

Later on, Conner says (p.284), quoting again Gallileo himself from the same source:

[Gallileo] demonstrated mathematically that “if projectiles are fired … all having the same speed, but each having a different elevation, the maximum range … will be obtained when the elevation is 45°: the other shots, fired at angles greater or less will have a shorter range. But in recounting how he arrived at that conclusion, he revealed that his initial inspiration came from discussions at the Arsenal: “From accounts given by gunners, I was already aware of the fact that in the use of cannons and mortars, the maximum range, that is the one in which the shot goes the farthest, is obtained when the elevation is 45°.” Although Gallileo’s mathematical analysis of the problem was a valuable original contribution, it did not tell workers at the Arsenal anything htey had not previously learned by empirical tests, and had little effect on the practical art of gunnery.

In any case, facilitating “technology” or “engineering” is certainly not the only good reason to pursue scientific knowledge. Conversely, although “pure science” certainly has an important role, it is not the only ingredient of technological progress (something I’ve alluded to in a previous post about, essentially, the venture capital approach to research).  Furthermore, some partly misguided opinions about the future of science have brightly shot through the journalistic sphere.

However, if, for whatever reason, we decide to go the way of science (a worthy pursuit), then I am reminded of the following interview of Richard Feynman by the BBC in 1981 (full programme):

Privacy concerns notwithstanding, the web gives us unprecedented opportunities to collect measurements in quantities and levels of detail that simply were not possible when the venerable state-of-the-art involved, e.g., passing around written notes among a few people. So, perhaps we can now check hypotheses more vigorously and eventually formulate universal laws (in the sense of physics).  Perhaps the web will allow us to prove Feynman wrong.

I’m not entirely convinced that it is possible to get quantitative causal models (aka. laws) of this sort. But if it is, then we need an appropriate experimental apparatus for large-scale data analysis to test hypotheses—what would be, say, the LHC-equivalent for web science?  (Because, pure science seems to have an increasing need for powerful apparatuses.) I’ll write some initial thoughts and early observations on this in another post.

I’m pretty sure that my recent posts have been doing circles around something, but I’m not quite sure yet what that is.  In any case, all this seems an interesting direction worth pursuing.  Even though Feynman was sometimes a very hard critic, we should pehaps remember his words along the way.

Comments (1)

Research and new media: the academic clowd

I have a little secret: Slashdot may have lost its lustre now, but back in 2001, shortly after returning from my refreshing internship at Almaden, I posted a question to “Ask Slashdot” for the first and last time. I posed the question rather poorly and was ignored. Although I could not find exactly what I wrote back then, it was something along the lines of “why aren’t academic venues more like SourceForge?” You have to remember that this was the early 2000′s, when large and transparent user communities existed only in the technical sphere, and things like SourceForge were the prototypical sites for online focused communities. So why couldn’t academia and the research community open things up a bit more, and leverage new media to set up virtual forums for world-wide lively discussions and collaborations?

Fast-forward seven years. I got a feeling of deja-vu when I saw two recent blog posts and a Slashdot post. The first two question specific aspects of current publishing practices, while the “Ask Slashdot” post wonders whether academic journals are obsolete. The technologies and media have changed dramatically since then, but the essence remains the same.

Going over the comments on Slashdot, even though there are some surprisingly (for Slashdot) insightful ones, there is also one fundamental misconception. I was genuinely surprised at its prevalence. Many commenters seem to identify the general notion of “peer evaluation” with the specific mechanisms currently employed to do it. Is the current way of doing things so deeply entrenched, that people are blind to other possibilities?

Quoting a random vicious comment: “The purpose of restricting published work to that which has passed peer review is to ensure that results do not become obsolete. They must uphold the same quality standards that we expect from all scientific disciplines—not blog-style fads that have become popular and at some stage will cease to be popular.” I wonder if commenter has ever written a blog himself, or whether he even just taken a look at, say, Technorati: there are over four million blogs out there and 99% have just one reader (the author). Very few blogs are popular (i.e., the actually read by a significant number of people). An explosion in quantity of published content does not imply a proprtional explosion in its consumption; quite the contrary. If anything, there is more competition for attention, not less.

Another commenter said that “there isn’t any direct communication between reviewers and submitters.” Not so. Take a look at Julian Besag’s “On the Statistical Analysis of Dirty Pictures” (unfortunately JSTOR is restricted-access, but maybe your institution has a subscription), published in the Journal of the Royal Statistical Society as recently as 1986. The actual paper is 21 pages, while the other 23 pages are devoted to an open discussion. This looks oddly familiar (deja vu again): it looks like very popular blogs, which often have comment sections larger than the original posts. A free and open discussion of ideas has always been an organic part of the research process. A few centuries ago, scientific articles appeared with a date on which they were “read” to the community (just take a look at, e.g., the an issue of the Philosophical Transactions of the Royal Society).

Research on the web

Reaching far out into the long tail of ideas, which I also discussed in a previous post, should arguably be a top priority for research. In other endeavors it is an important means to success (financial or otherwise), but in academia and the research community it is usually an end in itself. The web itself was originally conceived as a venue for the exchange of scientific ideas, but even its creators probably did not envision the full potential nor realize all the implications of democratizing publication.

Modern technology allows more researchers (whether they work for startups, academic institutions, or large corporations) to try out more ideas. In other words, the production of research output is scaling up to unprecedented levels. However, I strongly suspect that traditional ways for evaluating research will not scale for much longer, being unable to keep up with the explosive growth in the rate of new ideas.

The typical process for evaluating and disseminating research—at least in computer science with which I am familiar—seems to be the following (with perhaps a few exceptions). First you come up with an interesting idea. Next, you build a story around it and do the minimal work to support that story. If everything works out, you write it up and submitted to a conference or, more rarely, a journal. On average, three people (chosen largely at random) review your work, making some comments in private. Once your work is published, you move on to the next paper.

I would simply name two artifacts as the main “products” of computer science research: papers and software. The latter is often overlooked, but it’s at least as important as the first. Anyway, what might be the state-of-the-art media for each of those artifacts?

There are some well-known efforts to use the web for the former. For example, there is arXiv for physics and sciences, CoRR for computer science, and PLOS for life sciences. There is also VideoLectures for open access to some talks. All of these, however, largely mirror the established ways of doing things: they are still built using the paradigm of a “library”. Although very important steps in the right direction, they perhaps play second fiddle to traditional media (there is a reason that arXiv is called a “pre-print server”) and thus fail to fully realize the potential offered by the rapidly emerging social media.

Things are perhaps a little more advanced for software artifacts. There are SourceForge, Google Code, and countless other similar sites for hosting source code, tracking issues and holding online discussions. There is also Freshmeat, Ohloh, and other project directories, as well as source code search engines such as Koders. However, none of these (or, as far as I know, anything similar) have been widely embraced by the research community.

Enough about today. It is more interesting to try and imagine how all these things, and more, may come together in the future.

The academic clowd scenario

Shamelessly copying this post, let’s imagine the academic clowd (cloud + crowd).

You have a great new idea and decide to try it out. You write a proof-of-concept implementation and run it on the cloud, using large datasets that also live out there. The implementation itself is available to the clowd, which can analyze the revision control logs and find out who really worked on what.

Your idea works and you decide to write a research article about it. The clowd knows what papers you wrote, who are your co-authors and which conferences and journals you publish in (cf. DBLP). It also knows the content of your papers (cf. CiteSeer). So, when you publish your new article, it compares it with the existing literature and finds the most relevant experts (in terms of content, co-citations, venues of publication, etc) to evaluate your work. It knows who your close friends and relatives are (from Facebook) and automatically excludes them from the list of potential reviewers. It also exlcudes your co-authors from the past three years. Then, it solicits reviews from those experts. Of course, it also allows others who are interested to participate in the discusssion.

In addition to the original paper, all review comments are public and can be moderated (say, similar to Digg or to Slashdot, but perhaps in a more principled and civilized manner). Thus, the review comments are ranked for their correctness, originality and usefulness. These rankings propagate to the papers they refer to.

You present your work in public and the video of your lecture is on the clowd, exposing you to a much larger audience. Anyone can also comment on it and respond to it. The videos are linked to each other, as well as to the articles and to the implementations. They are organized into thousands “virtual research tracks” with several tens of talks in each. “Best of” virtual conference compilations appear on the clowd.

Rising papers and their authors get introduced to each other by the clowd. You can easily find ten potential new collaborators with mutual interests. You try out more things together, write more articles, and so on …until one day you all save the world together (well, maybe not, but it would be nice! :-).

So, what will the future really look like?

Well, who knows? I’m pretty sure the above scenario will seem as ridiculous in ten years, as the SourceForge ideal looks today (what was I thinking then?). Nonetheless, I believe it should be part of the current vision for research. I don’t think that the web and social media will lead to less selection via peer evaluation. Quite the contrary. Nor do I think that they will lead to less elitism. This follows from simple math. Taking the simplistic but common measure of “acceptance ratio”, the numerator cannot grow much, because people’s capacity to absorb information will not grow that much. But, if the potential to produce published content makes the denominator grow to infinity, then the ratio has to approach zero. Methods for evaluating research output need to scale up to this level of filtering, and I simply don’t think that the current way of evaluating research can achieve this.

Comments (1)

The long tail of ideas

It’s not clear how you measure the size of an idea. Is it billions of generated revenue? Is it number of papers published? Number of citations? Brain-ounces (whatever that is)? But, let’s say that based on any or all of these measures, some ideas are big and some are small. The long tail applies here too: a few ideas make it big, but most of them remain small. But how do we find these big ideas?

I’ve heard the following piece of advice several times over the past few years, and more recently in a talk by one VP. It goes something like this: “Find an idea an ask yourself: is the potential market worth at least one billion [sic] dollars? If not, then walk away.” This is very similar to something I read in one of Seth Godin’s riffs, about a large consulting company recommending to a large book publisher that they should “only publish bestsellers.” They would, if they knew in advance which books would be bestsellers. But, in reality, this advice is simply absurd.

For example, who thought back in the 90′s that search would be so important, with search marketing worth about 10 billion and expected to exceed 80 billion within 10 years? Nobody, and perhaps following the above advice, projects such as CLEVER and it’s follow-up (which put a “business intelligence” spin to search), WebFountain, went nowhere. The only thing that went somewhere is the researchers; they moved to Google, Yahoo! and Microsoft.

When I onced talked to someone from WebFountain, two things he said struck me and I still remember them after several years. First, the cost of crawling a reasonable fraction of the web and maintaining an index was quite small (a handful of machines, a T1 pipe and maybe one sysadm). Second, it turns out that back then WebFountain received some flak for “starting small.”

In other words, the engineers seemed to recognize they were starting at the far end of the tail, and decided to put some wheels on their idea and see how to move it up towards the head, growing along the way. But management wanted more to justify the project. Four wheels is just a car, but how about four hundred? “Is this something really big or isn’t it? If it is, then why 10 machines and not 300? Why 5 people and not 50? Why just make 3 features that work, instead of design and advertise 30 or 50?” As far as I can tell as an outsider, this is what happened and such an over-planning (combined, perhaps, with rather poor execution on the development side) did not lead to the expected results.

Perhaps such a mentality would have made sense a few decades ago, when computing power was far from a commodity, barriers to entry were large, and supercomputing was thriving. These days however, instead of asking “how much is this idea worth”, it’s better to ask “how much does it cost to try this idea?” and strive to make the answer “almost zero.” You don’t know in advance what will be big—that was always true. You should not start big because it’s not cost-efficient—that was not always possible, but it is today. Start big and you will likely end up small (like countless startups from the bubble-era); but start small and you may end up big. Google just appeared one day and did only one thing (search) for many years; Amazon sold only books; and so on.

Barriers to entry should not be made artificially high. Some companies seem to recognize this better than others (although this may be changing as they grow), and strive to provide an infrastructure, environment and culture that makes it easy to try out many new things by starting small and cheap. And other companies are enabling the masses to do the same.

I’m not saying that you should try every idea, even if it seems clearly unpromising. I’m also not saying that any idea can become big or that, once an idea becomes big, it will still cost zero to scale it up. But, technology, if properly used and combined with the right organizational structures, allows more ideas from the long tail to be tried out, at minimal cost. You’re expected to fail most of the time, but if the cost to try is near-zero, it doesn’t matter.

Comments (3)

The shift from private to public channels of information

Many discussions about privacy these days obsess over the shifting balance between public and private channels of information, while missing the real issues and opportunities.

The information landscape is unquestionably changing. We are experiencing the emergence and rapid proliferation of social media, such as instant messaging (e.g., IRC, Jabber et al., AIM, MSN, Skype), sharing sites (e.g., Flickr, Picasa, YouTube, Plaxo), blogs (e.g., Blogger, WordPress, LiveJournal) and forums (e.g., Epinions), wikis (e.g., Wikipedia, PBWiki), microblogs (e.g., Twitter, Jaiku), social networks (e.g., MySpace, Facebook, Ning), and so on. Also, much financial information (e.g., your bank’s website or Quicken) as well as health records are or soon will be online.

A rather obvious distinction is between public vs. private channels of information or content:

  • In public channels, the default policy on data sharingis “opt-in”.
  • In private channels, the default is “opt-out” (along with some, hopefully enforceable, guarantees that this is the case).

Most people, at least of a certain age, take the former for granted. However, this is changing. Just a couple of decades back, schoolchildren would keep journals (you know, those with a locket and “Hello Kitty” or “Transformers” on the cover). These days they are on MySpace and Twitter, and they do not assume “opt-out” is the default. Quoting from the article “The Talk of Town: You” (subscriber-only access) in the MIT Technology Review:

New York‘s reporter made a big deal about how “the kids” made her “feel very, very old.” Not only did they casually accept that the record of their lives could be Googled by anyone at any time, but they also tended to think of themselves as having an audience. Some even considered their elders’ expectations about privacy to be a weird, old-fogey thing—a narcissistic hang-up.

Said differently, an increasing fraction of content is produced in public, rather than private channels and “opt-in” is becoming the norm rather than the exception. Social aggregation sites, such as Profilactic, are a step towards easy access to this corpus. Despite some alarmism about blogs, Twitter, MySpace profiles, etc, all this information is, by definition, in public channels. Perhaps soon 99% of information will be in public channels.

So, which information channels should be perceived as public? Many people have a knee-jerk reaction when it comes to thinking of what should be private. For example, this blog is clearly a public channel. But how about your health records? In an interesting opinion about making health records public, most commenters’ expressed a fear of being denied health coverage by an insurance company. However, this is more an indication of a broken healthcare system, than of a problem with making this data public. Most countries (the U.S. included) are behind in this area, but others (such as the Scandinavians or Koreans) are making important steps forward. Now, how about your financial records? For example, credit reporting already relies on aggregation and analysis of publicly available data. How about your company’s financial records? Or how about your phonecall records? Or your images captured by surveillance cameras? The list can go on forever.

We should avoid that knee-jerk reaction and carefully consider what can be gained by moving to public channels, as well as what technology and regulation is required to make this work. The benefits can be substantial; for example, the success of the open source movement is largely due to switching to public, transparent channels of communication, as well as open standards. Openness is usually a good thing.

Even in the enterprise world of grownups, tools such as SmallBlue (aka. Atlas) are effectively changing the nature of intra-company email from a private to a (partially) public channel. The alternative would be to establish new public channels and favor their use over the older, “traditional” (and usually private) channels. Both approaches are equivalent.

Moreover, how should we deal with the information in private channels? The danger with private channels arises when privacy is breached. If that happens, not only do you get a false sense of security when you have none, but you may also have a very hard time proving that it happened. However, the notion itself of a “breach” in public channels is clearly meaningless. In that sense, public channels are a safer option and should be carefully considered.

Even when the data itself is private, who is accessing it and for what purpose should be public information. The MIT TR article continues to mention David Brin’s opinion that

“[...] our only real choice is between a society that offers the illusion of privacy, by restricting the power of surveillance to those in power, and one where the masses have it too.”

The need for full transparency on data how they are used is more pressing than ever. Ensuring that individuals’ rights are not violated requires less secrecy, not more. A recent CACM article by a gang of CS authority figures makes a similar case (although their proposal for an ontology-based heavyweight scheme for all data out there is somewhat dubious; it might make sense for the 1% niche of sensitive data, though). Interestingly, one of their key examples is essentially about health records and they also come to the same conclusion, i.e., that the problem is inappropriate use of the data.

I actually look forward to the day I’ll be able to type “creator:spapadim@bitquill.net” on Google (as well as any other search engine) and find all the content that I ever produced. And going one step beyond that, also find the “list of citations” (i.e., all the content that referenced or used my data), like I can find for my research papers on Google scholar, or for posts on this blog with trackbacks. Although I cannot grasp all the implications, it would at least mean we’ve addressed most of these issues and the world is a more open, democratic place. McLuhan’s notion of the global village is more relevant than ever, but his doom and gloom is largely misplaced; let’s focus on the positive potential instead.

Comments (1)