Wednesday, February 21, 2007

Is Data Mining Too Complicated?

I just read an interesting post on Infoworld entitled Data Mining Donald. In it there is a very interesting comment, and I quote:
Data mining is the future, and as of yet, it's still far too complicated for the ordinary IT guy to grasp.
Is this so? If data mining is too complicated for the typical IT guy, is it also too complicated for the typical grunt analyst?

Before I comment further, I'll just open it up to any commenters here. There are other very interesting and important parts of the post as well.

9 comments:

Kevin Hillstrom said...

During my twenty years in retail analytics, I've yet to meet a single person outside of data mining who finds data mining, on the surface, easy to understand.

What I have learned is that interpersnal skills are critical for data miners. If the data miner communicates in business-speak, and earns trust, the business will allow the data miners to do their jobs, and will respect the work the data miner does.

Vincent Granville said...

I've been in data mining for over 20 years, I still don't know what data mining exactly is, how it's different from statistics for instance. In my opinion it's an art more than a science.

It's one of these words like six sigma, keyword analytics, conjoint analysis, random forests, seemingly covering very complicated topics while in essence representic very simple concepts. Yet it is being misused by many.

Sandro Saitta said...

I don't have the experience you guys have in data mining. However, I think it's certainly more complicated to explain to someone what data mining is than to open a book and learn existing algorithm, their applications and possible traps (But I'm not saying that being a data miner is an easy job :-)

Will Dwinnell said...

It may be helpful to turn this question around. Should one trust the database administrator's role to a seasoned data miner who has only read a few articles or a book on databases? I believe that it is logical to conclude that I.T. and data mining, despite their relationship and occasional overlap, are different specialties which produce very different work-product.

Business analysts, on the other hand are, at least sometimes, producing a product which is similar to data mining's output. However, in my experience: 1. Their analysis tends to employ model representations which are extremely simplistic by data mining standard (matrices of thresholds on raw variables, for instance), 2. Many do not perform anything resembling model validation and 3. They tend to downplay issues of statistical significance, if they are aware of it at all.

Some business analysts could certainly transit to data mining, but I suspect many would have as much un-learning as learning to do on the way.

Dean Abbott said...

At the risk of sounding like I'm walking both sides of the fence, I'll say the answer is "yes" and "no" both.

I usually don't find it very difficult to explain what data mining does--the end result, that is. We are trying to discover relationships in the data (to predict or describe). In fact, most analysts do it every day. We all look at data and make decisions from that data.

Where communications break down is over methodology rather than function. There are too many communications hurdles to enumerate them all here, but take one example: sampling.

Most IT (or business people for that matter) I meet don't inherently understand the need for sampling--it has to be explained, and examples have to be shown. I am working with a customer right now--very very smart people--who built a rules-based system that worked very well, but it apparently failed on new data precisely because it was (in data mining terminology) overfit. It wasn't obvious that this was the case to them when they were building the rules, but it was obvious something was wrong when the system went live because the performance drop was dramatic. They would have detected the problem had they created a hold-out sample (and there was enough data to do this). Again, this isn't saying anything negative about these folks--it just wasn't obvious to them.

Compound this with Kevin's point that good technical people (data miners say) are rarely good at communicating the hows and whys to a non-technical audience. I think this is because it is difficult for any of us to know what we know and know what we don't know. Teaching by analogy is critical in my opinion to explaining why data mining is helpful (or statistics or predictive analytics...).

IT people (as Will states) are different. Just because they work with data doesn't mean they understand data from a statistical perspective. So they have little inherent advantage in undestanding data mining (or statistics).

So back to the question: is data mining too complicated? Yes, if it has to be explained over your favorite beverage. It just takes too long for the principles to sink in. But, give us some time (weeks at least) to reinforce the message, and yes, it can be explained, I believe.

Donald Farmer said...

As the "Data Mining Donald" in the original article, I guess I have a vested interest here. :-) I'm pleased its making a little ripple, and not for personal or product ego, but because I really care about this topic.
I do think there is a concensus forming in the industry that data mining is indeed too complicated. Of course, I work for Microsoft, and we have just released our
Data Mining Add-ins for Office
which are aimed to make mining easier for the business analyst, being task-oriented and hopefully attractive and intuitive in use. So I also have a certain interest in making the claim. However, it's not just Microsoft who are working to achieve more usability for the business user. I was chatting with Charlie Berger of Oracle at the DMA conference in December and he kindly showed me their data mining wizards for Excel which also aim to bring complex mining capabilities to business users. Oracle also have a very interesting
paper
on data-centric automated mining that's worth reading for anyone interested in this area.
I would say this also ... the question of whether data mining is too complicated rather hinges on how complicated the experience should ideally be. I sometimes hear the argument that, just as you do not need to understand the engineering under the bonnet, in order to drive a car, you should not need to understand all the technologies you use. True to an extent - but even drivers of automatics can be better drivers if they understand a little about gear ratios, torque, mechanical grip and so on. Witness the rather futile attempts of many Seattlites to drive in the recent snowstorms!
Business analysts making significant decisions with mining will surely make better informed decisions if they can master at least some of the concepts. Otherwise they are working in the dark. One customer I spoke with described his well-known data mining solution as "The Magic 80k Ball." (He must have the cheapest options!)
The challenge facing folks like Microsoft and Oracle is certainly not how to over-simplify data mining, but how to enable business users to build effective predictive analytics with enough supporting information to have confidence in their results.
Dean certainly has it right in his comments - given time to reinforce the messages of data mining's value, business users can grasp even quite complex scenarios with confidence. And, surely Kevin is right too - the ability to communicate those complexities effectively is an essential skill for those who would promote the use of our favourite technology.
I personally see the user experience of software as a most subtle form of communication, so I hope we're up for this challenge!
BTW, if any of you would like to connect with me directly, feel free to mail me (Donald Farmer) at donald_dot_farmer_at_microsoft_dot_com
Great conversation - thanks!

Will Dwinnell said...

Not to be glib, but it seems reasonable to expect data mining tools and methods to be as complicated as they need to be to solve data mining problems.

I resist the driving analogy since the problem of driving is travel, which is fundamentally a simple problem. The complicated part of the car is the intricate mechanism which makes travel quick, convenient and nearly effortless through the conversion of stored energy to motion. Drivers, however are concerned with traveling, not the car's engine.

In data mining, I argue, the problem itself is complicated. Witness the difficulty of getting many business managers to understand the most basic modeling principles.

Data mining involves many pitfalls, and at present there is no simple way to make their solution simple or easy.

Ralph Winters said...

I don't believe that being an average analyst or IT person locks you out from comprehending DM concepts. DM is a relatively new field with it's own terminology, so communication is very, very important.

However it is important that the person appreciate the power of statistics as well as the basic algorithms behind powerful DM methods such as Clustering, Decision Trees etc.

Jamie MacLennan said...

Responding to glibness, travel over long distances is only a "simple problem" due to technological advances. Cross-oceanic or continental travel 200 years ago was a highly complex adventure that only very few seasoned professionals would dare try and required months of complicated planning. Donald's analogy holds to advanced data exploration and data mining. You can argue that the problem is not reducible, but I believe that by being smart enough, we can make the sea-change that will make predictive analytics and, more importantly, data comprehension accessible to everyone. It took people thousands of years to solve the "simple" problem of travel over long distances. We're just starting solving the problem of handling massive amounts of data. We still have time.