If you're looking for a new project idea, take a look at this.

This was kicked off by this tweet by @bkeegan, quoting this tweet by @ed_hawkins. He suggested that a tool for converting an image of a table to a comma-separated value file would open up new possibilities for climate research.

I replied All the pieces for this exist. What lacks is to craft them into a smoothly functioning tool. @Corey_Yanofsky asked

A great question! Here's what I meant.

The problem decomposes nicely:

  1. Rotate images so that the table is straight
  2. Identify and extract individual rows
  3. From each row, identify and extract individual columns
  4. From each cell, identify and extract individual character image
  5. Convert each character to its unicode or ASCII equivalent
  6. Combine the results into a single CSV

Methods for doing all these steps exist. There are even a few tools out there that almost get the job done, like Tabula and ABBYY FineReader, but none of them quite go all the way from scanned-in (or handwritten!) tables to csv files.

The biggest challenge here is robustness to variety. Getting the method to work on a single set of images with the same formatting is great, but chances are it will fail on the next set. Getting a method dialed in to where it is reliable is where you really get to build your knowledge and show your stuff. Each step in reliability, from 50% to 90%, to 99%, to 99.9%, to each nine after that, takes about three times the effort as the step before. This is good news. It means you can scope the project to the scale of your ambition and be assured challenging technical work throughout.

The potential impact is considerable. There is a vast amount of data in printed form. The ability to automatically bring it into a well structured data format would let us answer questions not just around climate events, but also economics and health. If you are looking for a novel PhD topic, portfolio project, or a startup idea, this one is worth taking a look at.

Follow up

I shared this on Twitter and LinkedIn. It looks like there are a couple of tools that might do this. Microsoft Excel Mobile does an OK job, but isn't yet robust. Amazon's Textract shows cool results, but I'd love to see it put through its paces on historical data. A tool called Covis tackles the same problem too, and I'd love to see more testing data on it.

As far as I've seen, creating an open source version of image-to-csv would still add a ton of value (all these are proprietary), and it's still not clear how robust existing solutions are.