Wednesday, August 08, 2012

The Data is Free and Computing is Cheap, but Imagination is Dear

Recently published research, What Makes Paris Look like Paris?, attempts to classify images of street scenes according to their city of origin.  This is a fairly typical supervised machine learning project, but the source of the data is of interest.  The authors obtained a large number of Google Street View images, along with the names of the cities they came from.  Increasingly, large volumes of interesting data are being made available via the Internet, free of charge or at little cost.  Indeed, I published an article about classifying individual pixels within images as "foliage" or "not foliage", using information I obtained using on-line searches for things like "grass", "leaves", "forest" and so forth.

A bewildering array of data have been put on the Internet.  Much of this data is what you'd expect: financial quotes, government statistics, weather measurements and the like- large tables of numeric information.  However, there is a great deal of other information: 24/7 Web cam feeds which are live for years, news reports, social media spew and so on.  Additionally, much of the data for which people once charged serious bucks is now free or rather inexpensive.  Already, many firms augment the data they've paid for with free databases on the Web.  An enormous opportunity is opening up for creative data miners to consume and profit from large, often non-traditional, non-numeric data which are freely available to all, but (so far) creatively analyzed by few.


Jeremy Dalletezze said...

I am just learning the tricks of utilizing free web data, but never thought of the image route. With so much seo emphasis of images, i could definitely see some useful classifying research based on images & alt tags/titles/etc...
Thanks for the idea and cool post. Would definitely appreciate a follow-up post on that foliage project giving some tips/advice. Kind Regards,

Will Dwinnell said...

The publication to which I referred was the Jan-26-2007 posting, Pixel Classification Project, to my Data Mining in MATLAB log, at

Sadly, the figures accompanying that posting are no longer available.

Note that further details are provided in a later (Feb-02-2007) posting,

As I said, this is a very straightforward supervised learning project, and the derived features would be familiar to any image processing novice.

Regardless, this is an excellent illustration of what can be made of free data from the Internet.

Miller said...

I really liked the idea of classifying images from Google Street View with the city from which they came from. It reminded me of a different article that I read here: in which the authors use a deep convolutional neural network to read numbers from street signs.

A convolutional neural network takes advantage of the 2D structure of an image found in a location (such as Google Street View) and then subsamples from that image, passing the raw data over a threshold filter. The filter allows the network to only focus on the most distinct features found in the image and ignore all of the excess noise. It makes it so that the network is easier to train and much faster than a fully connected neural network. Then, the network works much like a traditional neural network, connecting many hidden layers, with the added step of combining hidden layers with the output from previous layers as other features. See for a summary.

It is really interesting that the authors from the other paper are able to predict the numbers on street signs with around 96% accuracy. It makes me wonder how security authentications will change on websites, considering many websites use images of numbers to verify that the user is a real person and not a robot. With the advancements in deep learning, privacy and security issues arise that were not possible a few years ago. I have a few friends who are working on a security measure that analyzes the typing speed of an individual to authenticate a user. However, I’m sure that once that solution is implemented, a neural network or other machine learning technique (random forest, Bayes’ network, perceptron, rules-based, etc.) will be able to learn how to type like a human. There seems to be an arms race between computer security techniques and our ability to program algorithms to beat the security.

I completely agree with your blog post, where you say that our imagination is the limiting factor in data analysis. Data mining is not limited to the traditional computer science department, but extends to all aspects of business. Just a few years ago, no one would have imagined that data mining techniques would be added into fields such as biology, mathematics, finance, consumer spending, image classification, and natural language processing. I work in the genetic field, mining DNA. It is amazing that techniques that seem completely isolated from my department (such as the convolutional neural network) can be integrated into my analyses. It reminds me of what Karl Popper, a philosopher of science said: “we are not students of some subject matter, but students of problems. And problems may cut right across the borders of any subject matter or discipline.” Truth and techniques are relevant wherever they come from. Thank you for the blog post. Best luck to you!