Data Mining: “I’m feeling lucky” ?

In an informal presentation on MapReduce that I recently gave, I included the following graphic, to summarize the “holy grail” of systems vs. mining:

Systems vs. Data mining

This was originally inspired by a quote that I read sometime ago:

Search is more about systems software than algorithms or relevance tricks.

How often do you click the “lucky” button, instead of “search”? Incidentally, I would be very interested in finding some hard numbers on this (I couldn’t)—but that button must exist for good reason, so a number of people must be using it. Anyway, I believe it’s a safe assumption that “search” gets clicked more often than “lucky” by most people. And when you click “search”, you almost always expect to get something relevant, even if not perfectly so.

In machine learning or data mining, the holy grail is to invent algorithms that “learn from the data” or that “discover the golden nugget of information in the massive rubble of data”. But how often have you taken a random learning algorithm, fed it a random dataset, and expected to get something useful. I’d venture a guess: not very often.

So it doesn’t quite work that way. The usefulness of the results is a function of both the data and the algorithm. That’s common sense: drawing any kind of inference involves both (i) making the right observations, and (ii) using them in the right way. I would argue that in most succesful applications, it’s the data takes center stage, rather than the algorithms. Furthermore, mining aims to develop the analytic algorithms, but systems development is what enables running those algorithms on the appropriate and, often, massive data sets. So, I do not see how the former makes sense without the latter. In research however, we sometimes forget this, and simply pick our favorite hammer and clumsily wield it in the air, ignoring both (i) the data collection and pre-processing step, and (ii) the systems side.

It may be that “I’m feeling lucky” often hits the target (try it, you may be surprised). However, in machine learning and mining research, we sometimes shoot the arrow first, and paint the bullseye around it. There are many reasons for this, but perhaps one stands out. A well-known European academic (from way up north) once said that his government’s funding agency once criticized him for succeeding too often. Now, that’s something rare!