Mobile OCR input: “Fully automatic” and reality

Recently I’ve been toying around with WordSnap OCR (project page, source code, app on Android Market), an app for OCR-based camera input on Android. In the process, I found out a few things about “smart” versus “fast”.

At least in data mining, “fully automatic” is an often unquestioned holy grail.  There are certainly several valid reasons for this, such as if you’re trying to scan huge collections of books such as this, or index images from your daily life like this.  In this case, you use all the available processing power to make as few errors as possible (i.e., maximize accuracy).

However, if the user is sitting right in front of your program, watching your algorithms and their output, things are a little different. No matter how smart your algorithm is, some errors will occur. This tends to annoy users. In that sense, actively involved users are a liability. However, they can also be an asset: since they’re sitting there anyway, waiting for results, you may as well get them really involved. If you have cheap but intelligent labor ready and willing, use it! The results will be better or, at the very least, no worse.  Also, users tend to remember the failures. So, even if end results were similar on average, allowing users to correct failures as early as possible will make them happier.

Instead of making algorithms as smart as possible, the goal now is to make them as fast as possible, so that they produce near-realtime results that don’t have to be perfect; they just shouldn’t be total garbage. When I started playing with the idea for WordSnap, I was thinking how to make the algorithms as smart as possible.  However, for the reasons above, I soon changed tactics.

The rest of this post describes some of the successful design decisions but,  more importantly, the failures in the balance between “automatic” and “realtime guidance”. The story begins with the following example image:

Original image

Incidentally, this image was the inspiration for WordSnap: I wanted to look up “inimical” but I was too lazy to type. Also, for the record, WordSnap uses camera preview frames, which are semi-planar YUV data at HVGA resolution (480×320). This image is a downsampled (512×384) full-resolution photograph taken with the G1 camera (2048×1536); most experiments here were performed before WordSnap existed in any usable form. Finally, I should point out that OCR isn’t really my area; what I describe below is based on common sense rather than knowledge of prior art, although just before writing this post I did try a quick review of the literature.

Binarization

A basic operation for OCR is binarization: mapping grayscale intensities between 0 and 255 to just two values: black (0) and white (1).  Only then can we start talking about shapes (lines, words, characters, etc).  One of the most widely used binarization algorithms is Otsu’s method.  It picks a single, global threshold so that it maximizes the within-class (black/white) variance, or equivalently maximizes the across-class variance. This is very simple to implement, very fast and works well for flatbed scans, which have uniform illumination.

However, camera images are not uniformly illuminated. The example image may look fine to human eyes, but it turns out that even for this image no global threshold is suitable (click on image for animation showing various global thresholds):

Binarization with global threshold

If you looked at the animation carefully, you might have noticed that at some point, at least the word of interest (“inimical”) is correctly binarized in this picture.  However, if the lighting gradient were steeper, this would not be possible. Incidentally, ZXing uses Otsu’s method for binarization, because of it is fast. So, if you wondered why barcode scanning sometimes fails, now you know.

So, a slightly smarter approach is needed: instead of using one global threshold, the threshold should be determined individually for each pixel (i,j). A natural threshold t(i,j) is the mean intensity μw(i,j) of pixels within a w×w neighborhood around pixel (i,j).  The key operation here is mean filtering: convolving the original image with a w×w matrix with constant entries 1/w2.

The problem is that, using pure Java running on Dalvik, mean filtering is prohibitively slow.  First, Dalvik is fully interpreted (no JIT, yet). Firthermore, the fact that Java bytes are always signed doesn’t help: casting to int and masking off the 24 most significant bits almost doubles running time.

Method Dalvik (msec) JNI (msec) Speedup
Naïve 109,882 ± 4,813 1,712 ± 261 64×
Sliding 2,435 ± 141 71 ± 19 34×

JNI to the rescue. The table above shows speedups for two implementations. The naïve approach uses a triple nested loop and has complexity O(w2mn), where m and n is the image height and width, respectively (m = 348, n = 512 in this example). The 1-D equivalent would simply be:

for i = 0 to N-1:
   s = 0
   for j = max(i-r,0) to min(i+r,N-1):
      s += a[j]

where w = 2r+1 is the window size. The second implementation updates the sums incrementally, based on the values of adjacent windows. The complexity now is just O(mn). An interesting aside is the relative performance of two implementations for sliding window sums (where w = 2r+1 is the window size). The first checks border conditions inside each iteration:

Initialize s = sum(a[0]..a[r])
for i = 1 to N-1:
   if i > r:
      s -= a[i-r-1]
   if i < N-r:
      s += a[i+r]

The second moves the border condition checks outside the loop which, if you think about it for a second, amounts to:

Initialize s = sum(a[0]..a[r])
for i = 1 to r:
   s += a[i+r]
for i = r+1 to N-r-1:
   s -= a[i-r-1]
   s += a[i+r]
for i = N-r to N-1:
   s -= a[i-r-1]

Among these two, the first one is faster, at least on a laptop running Sun’s JVM with JIT (I didn’t time Dalvik or JNI). I’m guessing that the second one messes loop unrolling, but I haven’t checked my guess.

It turns out that there is a very similar approach in the literature, called Sauvola’s method. Furthermore, there are efficient methods to compute it, using integral images. These are simply the 2-D generalization of partial sums. In 1-D, if partial sums are pre-computed, window sums can be estimated in O(1) time using the simple observation that sum(i…j) = sum(1..j) – sum(1..i-1).

Savuola’s method also computes local variance σw(i,j), and uses a relative threshold t(i,j) = μw(i,j)(1 + λσw(i,j)/127). WordSnap uses the global variance and an additive threshold t(i,j) = μw(i,j) + λσglobal, but after doing a contrast stretch of the original image (i.e., linearly mapping minimum intensity to 0 and maximum to 255). Doing floating point math or 64-bit integer arithmetic is much more expensive, hence the additive threshold. Furthermore, WordSnap does not use integral images because the same runtime can be achieved without the need to allocate a large buffer. Memory allocation on a mobile device is not cheap: the time needed to allocate a 480×320 buffer of 32-bit integers (about 600KB total) varies significantly depending on how much system memory is available, whether the garbage collector is triggered and so on, but on average it’s about half a second on the G1. Even though most buffers can be allocated once, startup time is important for this application: if it takes more than 2-3 seconds to start scanning, the user might as well have typed the result.

Anyway, here is the final result of locally adaptive thresholding:

Binarization with local mean filter

Conclusion: In this case we needed the slightly smarter approach, so we invested the time to implement it efficiently. WordSnap currently uses a 21×21 neighborhood.  Altogether, binarization takes under 100ms.

Skew

Another problem is that the orientation of the text lines may not be aligned with image edges.  This is called skew and makes recognition much harder.

Initially, I set out to find a way to correct for skew.  After a few searches on Google, I came across the Hough transform.  The idea is simple.  Sayyou want to detect a curve desribed by a set of parameters. E.g., for a line, those would be distance ρ from origin and slope θ. For each black pixel, find the parameter values for all possible curves to which this pixel may belong. For a line, that’s all angles θ from 0 to 180 degrees, and all distances ρ from 0 to sqrt(m2+n2).  Then, compute the density distribution of parameter tuples.  If a line (ρ00) is present in the image, then the parameter density distribution should have a local maximum at (ρ00).

If we apply this approach to our example image, the first maximum is detected at an angle of 20 degrees. Here is the image counter-rotated by that amount:

After rotating by angle detected using Hough transform

Success!  However, computing the Hough transform is too slow!  Typical implementations bucketize the parameter space. This would require a buffer of about 180×580 32-bit integers (for a 480×320 image), or about 410KB. In addition, it would require trigonometric operations or lookups to find the buckets for each pixel, not to mention counter-rotation. There are obvious optimizations one can try, such as computing histograms at multiple resolutions to progressively prune the parameter space.  Still, the cost implied by back-of-the envelope calculations put me off from even trying to implement this on the phone. Instead, why not just try to use the users:

Finder alignment guides

Conclusion: Simple approach with help from user wins, and the computer doesn’t even have to do anything to solve the problem! Incidentally, the guideline width is determined by the size of typical newsprint text at the smallest distance that the G1′s camera can focus.

Font size

Next, we need to detect individual words.  The approach WordSnap uses is to dilate the binary image with a rectangular structuring element (in the following image, the size 7×7), and then expand a rectangle (shown in green) until it covers the connected component which, presumably, is one word.

Dilation with 7x7 rectangle

However, the size of the structuring element should really depend on the inter-word spacing, which in turn depends on the typeface as well as the distance of the camera from the text.  For example, if we use a 5×5 element, we would get the following:

Dilation with 5x5 rectangular element

I briefly toyed with two ideas for font size detection.  The first is to do a Fourier transform.  Presumably the first spatial frequency mode would correspond to inter-word and/or inter-line spacing and the second mode to inter-character spacing. But that assumes we apply Fourier to a “large enough” portion of the image, and things start becoming complicated.  Not to mention computationally expensive.

The second approach (which also appears to be the most common?) is to to hierarchical grouping. First expand rectangles to cover individual letters (or, sometimes, ligatures), then compute histogram of horizontal distances and re-group into word rectangles, and so on.  This is also non-trivial.

Instead, WordSnap uses a fixed dilation radius.  The implementation is optimized to allow near-realtime annotation of the detected word extent.  This video should give you an idea:

Conclusion: Simple wins again, but this time we have to do something (and let the user help with the rest). But, instead of trying to be smart and find the best parameters given the camera position, we try to be fast: fix the parameters and let the user find the camera position that works given the parameters. WordSnap uses a 5×5 rectangular structuring element, although you can change that to 3×3 or 7×7 in the preferenfces screen. Altogether, word extent detection takes about 150-200ms, although it could be significantly optimized, if necessary, by using only JNI only, instead of a mix of pure Java and JNI calls.


I’m now looking into the possibility of moving OCR into the “live” loop: as you move the camera, the phone shows not only the word extent rectangle, but also the recognized word.  Perhaps as a hyperlink to Google, or along with Google Translate results.  Then I can justifiably use the buzzword of the day, “augmented reality”!  It looks that it might just be possible, but let me get back to you in a week or two.  :)

Postscript: Some of the papers referenced were pointed out to me by Hideaki Goto, who started and maintains WeOCR. Also, skew detection and correction experiments are based on this quick-n-dirty Python script (needs OpenCV and it ain’t pretty!). Update (9/2): Fixed really stupid mistake in parametrization of line.

Comments (13)

Randy Pausch in CACM

The September issue of CACM has a one-page, seven-question interview with Randy Pausch. It is definitely worth reading, so I’ll give you a sneak peek (unfortunately, CACM is not “open access”):

What about advice for CS teachers and professors?

That it’s time for us to start being more honest with ourselves about what our field is and how we should approach teaching it. Personally, I think that if we had named the field “Information Engineering” as opposed to “Computer Science,” we would have had a better culture for the discipline. For example, CS departments are notorious for not instilling concepts like testing and validation the way many other engineering disciplines do.

Is there anything you wish someone had told you before you began your own studies?

Just that being technically strong is only one aspect of an education.

[...]

Alice has proven phenomenally successful at teaching young women, in particular, to program. What else should we be doing to get more women engaged in computer science?

Well, it’s important to note that Alice works for both women and men. I think female-specific “approaches” can be dangerous for lots of reasons, but approaches like Alice, which focus on activities like storytelling, work across gender, age, and cultural background. It’s something very fundamental to want to tell stories. And Caitlin Kelleher’s dissertation did a fantastic job of showing just how powerful that approach is.

The interview was conducted a few weeks before his death. I’ll just say that, somehow, I suspect someone not in his position would never have said at least one of these things.  It’s a sad thought, but Randy’s message is, as always, positive.

Comments

The bless of dimensionality

The cover story in Wired by Chris Anderson, “The End of Theory” relies on a silent assumption, which may be obvious but is still worth stating. The reason that such a “petabyte approach” works is that reality occupies only a tiny fraction of the space of all possibilities.  For example, the human genome consists of about three billion base pairs.  However, not every billion-lengths string of four symbols corresponds to a viable organism, much less an existing one or a human individual.  In other words, the intrinsic dimensionality of your sample (the human population) is much smaller than the raw dimensionality of the possibilities (about 4^3,2000,000 strings).

I won’t try to justify “traditional” models. But I also wouldn’t go so far as to say that models will disappear, just that many will be increasingly statistical in nature. If you can throw the dice a large enough number of times, it doesn’t matter whether “God” plays them or not.  The famous quote by Einstein suggests that quantum mechanics was originally seen as a cop-out by some: we can’t find the underlying “truth”, so we settle with probability distributions for position and momentum.  However, this was only the beggining.

Still, we need models.  Going back to the DNA example, I suspect that few people models the genome as a single, huge, billion-length string.  That is not a very useful random variable.  Chopping it up into pieces with different functional significance and coming up with the appropriate random variables, so one can draw statistical inferences, sounds very much like modeling to me.

Furthermore, hypothesis testing and confidence intervals won’t go away either.  After all, anyone who has taken a course in experimental physics knows that repeating a measurement and calculating confidence intervals based on multiple data points is a fundamental part of the process (and also the main motivating force in the original development of statistics).  Now we can collect petabytes of data points.  Maybe there is a shift in balance between theory (in the traditional, Laplacian sense, which I suspect is what the article really refers to) and experiment.  But the fundamental principles remain much the same.

So, perhaps more is not fundamentally different after all, and we still need to be careful not to overfit.  I’ll leave you with a quote from “A Random Walk down Wall Street” by Burt Malkiel (emphasis mine):

[...] it’s sometimes possible to correlate two completely unrelated events.  Indeed, Mark Hulbert reports that stock-market researcher David Leinweber found that the indicator most closely correlated with the S&P 500 Index is the volume of butter production in Bangladesh.

Dimensionality may be a bless, but it can still be a curse sometimes.

Comments (1)

Research and new media: the academic clowd

I have a little secret: Slashdot may have lost its lustre now, but back in 2001, shortly after returning from my refreshing internship at Almaden, I posted a question to “Ask Slashdot” for the first and last time. I posed the question rather poorly and was ignored. Although I could not find exactly what I wrote back then, it was something along the lines of “why aren’t academic venues more like SourceForge?” You have to remember that this was the early 2000′s, when large and transparent user communities existed only in the technical sphere, and things like SourceForge were the prototypical sites for online focused communities. So why couldn’t academia and the research community open things up a bit more, and leverage new media to set up virtual forums for world-wide lively discussions and collaborations?

Fast-forward seven years. I got a feeling of deja-vu when I saw two recent blog posts and a Slashdot post. The first two question specific aspects of current publishing practices, while the “Ask Slashdot” post wonders whether academic journals are obsolete. The technologies and media have changed dramatically since then, but the essence remains the same.

Going over the comments on Slashdot, even though there are some surprisingly (for Slashdot) insightful ones, there is also one fundamental misconception. I was genuinely surprised at its prevalence. Many commenters seem to identify the general notion of “peer evaluation” with the specific mechanisms currently employed to do it. Is the current way of doing things so deeply entrenched, that people are blind to other possibilities?

Quoting a random vicious comment: “The purpose of restricting published work to that which has passed peer review is to ensure that results do not become obsolete. They must uphold the same quality standards that we expect from all scientific disciplines—not blog-style fads that have become popular and at some stage will cease to be popular.” I wonder if commenter has ever written a blog himself, or whether he even just taken a look at, say, Technorati: there are over four million blogs out there and 99% have just one reader (the author). Very few blogs are popular (i.e., the actually read by a significant number of people). An explosion in quantity of published content does not imply a proprtional explosion in its consumption; quite the contrary. If anything, there is more competition for attention, not less.

Another commenter said that “there isn’t any direct communication between reviewers and submitters.” Not so. Take a look at Julian Besag’s “On the Statistical Analysis of Dirty Pictures” (unfortunately JSTOR is restricted-access, but maybe your institution has a subscription), published in the Journal of the Royal Statistical Society as recently as 1986. The actual paper is 21 pages, while the other 23 pages are devoted to an open discussion. This looks oddly familiar (deja vu again): it looks like very popular blogs, which often have comment sections larger than the original posts. A free and open discussion of ideas has always been an organic part of the research process. A few centuries ago, scientific articles appeared with a date on which they were “read” to the community (just take a look at, e.g., the an issue of the Philosophical Transactions of the Royal Society).

Research on the web

Reaching far out into the long tail of ideas, which I also discussed in a previous post, should arguably be a top priority for research. In other endeavors it is an important means to success (financial or otherwise), but in academia and the research community it is usually an end in itself. The web itself was originally conceived as a venue for the exchange of scientific ideas, but even its creators probably did not envision the full potential nor realize all the implications of democratizing publication.

Modern technology allows more researchers (whether they work for startups, academic institutions, or large corporations) to try out more ideas. In other words, the production of research output is scaling up to unprecedented levels. However, I strongly suspect that traditional ways for evaluating research will not scale for much longer, being unable to keep up with the explosive growth in the rate of new ideas.

The typical process for evaluating and disseminating research—at least in computer science with which I am familiar—seems to be the following (with perhaps a few exceptions). First you come up with an interesting idea. Next, you build a story around it and do the minimal work to support that story. If everything works out, you write it up and submitted to a conference or, more rarely, a journal. On average, three people (chosen largely at random) review your work, making some comments in private. Once your work is published, you move on to the next paper.

I would simply name two artifacts as the main “products” of computer science research: papers and software. The latter is often overlooked, but it’s at least as important as the first. Anyway, what might be the state-of-the-art media for each of those artifacts?

There are some well-known efforts to use the web for the former. For example, there is arXiv for physics and sciences, CoRR for computer science, and PLOS for life sciences. There is also VideoLectures for open access to some talks. All of these, however, largely mirror the established ways of doing things: they are still built using the paradigm of a “library”. Although very important steps in the right direction, they perhaps play second fiddle to traditional media (there is a reason that arXiv is called a “pre-print server”) and thus fail to fully realize the potential offered by the rapidly emerging social media.

Things are perhaps a little more advanced for software artifacts. There are SourceForge, Google Code, and countless other similar sites for hosting source code, tracking issues and holding online discussions. There is also Freshmeat, Ohloh, and other project directories, as well as source code search engines such as Koders. However, none of these (or, as far as I know, anything similar) have been widely embraced by the research community.

Enough about today. It is more interesting to try and imagine how all these things, and more, may come together in the future.

The academic clowd scenario

Shamelessly copying this post, let’s imagine the academic clowd (cloud + crowd).

You have a great new idea and decide to try it out. You write a proof-of-concept implementation and run it on the cloud, using large datasets that also live out there. The implementation itself is available to the clowd, which can analyze the revision control logs and find out who really worked on what.

Your idea works and you decide to write a research article about it. The clowd knows what papers you wrote, who are your co-authors and which conferences and journals you publish in (cf. DBLP). It also knows the content of your papers (cf. CiteSeer). So, when you publish your new article, it compares it with the existing literature and finds the most relevant experts (in terms of content, co-citations, venues of publication, etc) to evaluate your work. It knows who your close friends and relatives are (from Facebook) and automatically excludes them from the list of potential reviewers. It also exlcudes your co-authors from the past three years. Then, it solicits reviews from those experts. Of course, it also allows others who are interested to participate in the discusssion.

In addition to the original paper, all review comments are public and can be moderated (say, similar to Digg or to Slashdot, but perhaps in a more principled and civilized manner). Thus, the review comments are ranked for their correctness, originality and usefulness. These rankings propagate to the papers they refer to.

You present your work in public and the video of your lecture is on the clowd, exposing you to a much larger audience. Anyone can also comment on it and respond to it. The videos are linked to each other, as well as to the articles and to the implementations. They are organized into thousands “virtual research tracks” with several tens of talks in each. “Best of” virtual conference compilations appear on the clowd.

Rising papers and their authors get introduced to each other by the clowd. You can easily find ten potential new collaborators with mutual interests. You try out more things together, write more articles, and so on …until one day you all save the world together (well, maybe not, but it would be nice! :-).

So, what will the future really look like?

Well, who knows? I’m pretty sure the above scenario will seem as ridiculous in ten years, as the SourceForge ideal looks today (what was I thinking then?). Nonetheless, I believe it should be part of the current vision for research. I don’t think that the web and social media will lead to less selection via peer evaluation. Quite the contrary. Nor do I think that they will lead to less elitism. This follows from simple math. Taking the simplistic but common measure of “acceptance ratio”, the numerator cannot grow much, because people’s capacity to absorb information will not grow that much. But, if the potential to produce published content makes the denominator grow to infinity, then the ratio has to approach zero. Methods for evaluating research output need to scale up to this level of filtering, and I simply don’t think that the current way of evaluating research can achieve this.

Comments (1)

The long tail of ideas

It’s not clear how you measure the size of an idea. Is it billions of generated revenue? Is it number of papers published? Number of citations? Brain-ounces (whatever that is)? But, let’s say that based on any or all of these measures, some ideas are big and some are small. The long tail applies here too: a few ideas make it big, but most of them remain small. But how do we find these big ideas?

I’ve heard the following piece of advice several times over the past few years, and more recently in a talk by one VP. It goes something like this: “Find an idea an ask yourself: is the potential market worth at least one billion [sic] dollars? If not, then walk away.” This is very similar to something I read in one of Seth Godin’s riffs, about a large consulting company recommending to a large book publisher that they should “only publish bestsellers.” They would, if they knew in advance which books would be bestsellers. But, in reality, this advice is simply absurd.

For example, who thought back in the 90′s that search would be so important, with search marketing worth about 10 billion and expected to exceed 80 billion within 10 years? Nobody, and perhaps following the above advice, projects such as CLEVER and it’s follow-up (which put a “business intelligence” spin to search), WebFountain, went nowhere. The only thing that went somewhere is the researchers; they moved to Google, Yahoo! and Microsoft.

When I onced talked to someone from WebFountain, two things he said struck me and I still remember them after several years. First, the cost of crawling a reasonable fraction of the web and maintaining an index was quite small (a handful of machines, a T1 pipe and maybe one sysadm). Second, it turns out that back then WebFountain received some flak for “starting small.”

In other words, the engineers seemed to recognize they were starting at the far end of the tail, and decided to put some wheels on their idea and see how to move it up towards the head, growing along the way. But management wanted more to justify the project. Four wheels is just a car, but how about four hundred? “Is this something really big or isn’t it? If it is, then why 10 machines and not 300? Why 5 people and not 50? Why just make 3 features that work, instead of design and advertise 30 or 50?” As far as I can tell as an outsider, this is what happened and such an over-planning (combined, perhaps, with rather poor execution on the development side) did not lead to the expected results.

Perhaps such a mentality would have made sense a few decades ago, when computing power was far from a commodity, barriers to entry were large, and supercomputing was thriving. These days however, instead of asking “how much is this idea worth”, it’s better to ask “how much does it cost to try this idea?” and strive to make the answer “almost zero.” You don’t know in advance what will be big—that was always true. You should not start big because it’s not cost-efficient—that was not always possible, but it is today. Start big and you will likely end up small (like countless startups from the bubble-era); but start small and you may end up big. Google just appeared one day and did only one thing (search) for many years; Amazon sold only books; and so on.

Barriers to entry should not be made artificially high. Some companies seem to recognize this better than others (although this may be changing as they grow), and strive to provide an infrastructure, environment and culture that makes it easy to try out many new things by starting small and cheap. And other companies are enabling the masses to do the same.

I’m not saying that you should try every idea, even if it seems clearly unpromising. I’m also not saying that any idea can become big or that, once an idea becomes big, it will still cost zero to scale it up. But, technology, if properly used and combined with the right organizational structures, allows more ideas from the long tail to be tried out, at minimal cost. You’re expected to fail most of the time, but if the cost to try is near-zero, it doesn’t matter.

Comments (3)