Thursday, March 19, 2009

How many software packages are too much?

I just saw a question at SmartDataCollective about how many data mining packages one needs. He writes,
we found out that a particular client is using THREE Data Mining softwares. Not statistical softwares or the base versions, but the complete, very expensive Data Mining softwares – SAS EM, SPSS Clementine and KXEN.

I was like, “Wow!!! But do you really need 3 Data Mining softwares???” Our initial questions and the client’s answers confirmed that inconsistent data formats was not the reason as the client already has a BI/DW system. Their reason? Well, they have the opinion that some algorithms/techniques in a particular DM software is much better and accurate than the same algorithms/techniques in another DM software.

I believe there are truly good reasons to have more than one data mining software package. Each tool has its own strengths and weaknesses. As one example, Affinium Model is very good at building hundreds or even thousands of models automatically, whereas Tibco S+ (formerly Insightful Miner) only builds one model at a time. On the other hand, the flexibility of Miner in data preparation, sampling, and settings for building models is much richer than Model. I like to have several tools around for these kind of reasons.

A second reason to have (or to be proficient in) multiple tools as an analytics consultant is that you can plug into nearly any organization if they have tools they want you to use. Currently, I'm working on projects that are using Clementine, Matlab, Statistica, and Insightful Miner. Last year I worked with a customer that was using CART (Salford Systems) and Oracle Data Miner, Polyanalyst, and even briefly IBM Intelligent Miner.

However, except for very rare circumstances, the algorithms themselves are not appreciably different from tool to tool. Yes I know that some tools have extra knobs and options, but backprop is backprop, the Gini index is the Gini index, Entropy is Entropy. The only reason I would have both KXEN and SAS/EM or Clementine is if I wanted the automation of KXEN sometimes, and the full control of of EM or Clementine (it is hard for me to imagine why I would want both Clementine and EM--any takers on this one?).

Monday, March 16, 2009

eMetrics Conference

Early-bird pricing ends Friday for the May 4-7 eMetrics conference in San Jose. You get a 12% discount if you use the promo code ABBOTT12 (don't worry, I don't get anything except the satisfaction that a reader of this blog got a discount). I can't go, but hope to get to one before too long.


Predictive Analytics Webinar

I'm participating in a free webinar through The Modeling Agency tomorrow at 4pm EDT (1pm PDT) for anyone interested in listening in. Tony Rathburn is doing the first technical part, and I follow with about 20 minutes of vignettes. If you do listen in, feel free to post comments here on the content (all critiques welcomed!) We'll repeat the webinar on April 7th and April 22nd.

Sunday, March 08, 2009

Some Interesting Analyses

I find it interesting to learn what other people are working on. To me, the applications can be as interesting as the technology- even if they're not saving millions or curing cancer. Some of these analyses could be a bit more rigorous, but they do suggest avenues for further research, and at least they aren't boring! Here are a few things I've run across in cyberspace recently:

Is Warhammer Balanced?

MLB Payroll Efficiency, 2006-2008

Wired magazine: issue 17.02

Analysis of the price of a piece of a lego set

Modeling Win Probability for a College Basketball Game

Saturday, March 07, 2009

Data Mining: Does It Get Any Better Than This?

The article Doing the Math to Find the Good Jobs appeared in the Jan-26-2009 issue of The Wall Street Journal, listing the top 3 "best" jobs (of 200 studied) as:

1. Mathematician
2. Actuary
3. Statistician

I assume that "data miner" fits in somewhere among these, yipee!