Thursday, March 19, 2009

How many software packages are too much?

I just saw a question at SmartDataCollective about how many data mining packages one needs. He writes,
we found out that a particular client is using THREE Data Mining softwares. Not statistical softwares or the base versions, but the complete, very expensive Data Mining softwares – SAS EM, SPSS Clementine and KXEN.

I was like, “Wow!!! But do you really need 3 Data Mining softwares???” Our initial questions and the client’s answers confirmed that inconsistent data formats was not the reason as the client already has a BI/DW system. Their reason? Well, they have the opinion that some algorithms/techniques in a particular DM software is much better and accurate than the same algorithms/techniques in another DM software.

I believe there are truly good reasons to have more than one data mining software package. Each tool has its own strengths and weaknesses. As one example, Affinium Model is very good at building hundreds or even thousands of models automatically, whereas Tibco S+ (formerly Insightful Miner) only builds one model at a time. On the other hand, the flexibility of Miner in data preparation, sampling, and settings for building models is much richer than Model. I like to have several tools around for these kind of reasons.

A second reason to have (or to be proficient in) multiple tools as an analytics consultant is that you can plug into nearly any organization if they have tools they want you to use. Currently, I'm working on projects that are using Clementine, Matlab, Statistica, and Insightful Miner. Last year I worked with a customer that was using CART (Salford Systems) and Oracle Data Miner, Polyanalyst, and even briefly IBM Intelligent Miner.

However, except for very rare circumstances, the algorithms themselves are not appreciably different from tool to tool. Yes I know that some tools have extra knobs and options, but backprop is backprop, the Gini index is the Gini index, Entropy is Entropy. The only reason I would have both KXEN and SAS/EM or Clementine is if I wanted the automation of KXEN sometimes, and the full control of of EM or Clementine (it is hard for me to imagine why I would want both Clementine and EM--any takers on this one?).

8 comments:

Tim Manns said...

re: "Clementine and EM"

Main reason I can think of is to justify more budget :)

I've not got a varied experience of different data mining tools (only used Clementine and databases Teradata, SQL Server, Oracle etc), but its nice to hear that there's not much difference between the algorithms.

Seriously though, I reckon its an HR and training issue. I've generally found SAS analyts to be a determined bunch and the organisation's reliance upon specific SAS data sets is another reason why once SAS in it don't easily come out...

Tim Manns said...

I've just read though the original post, and some previous posts on the same blog (I didn't read the blog before now).

I'm assuming some of the posts are related.

Maybe I'm a mean perfectionist bastard, but I didn't see much evidence of using a data mining tool (in this case Clementine) to *anywhere near* its capability. Maybe the data mining projects using SAS EM were more comprehenisve, or maybe the blog post simplified the work for the benefit of the readers.

But if a data mining segmentation project consists of plugging in near raw summarised data and running two-step or k-means on 15 columns of unnormalised/untreated data, then I'm not suprised a customer needs three tools (or maybe a miracle even...)

I work in telco and know that there is tons of data, the problem is too much. Our segmentation uses far more columns than 15 and heavily manipulated data. Simply quoting 'call count' or 'minutes of use' as one of 15 inputs would result in discplinary action in my dept :)

So, in my current frame of mind I reckon there are multiple simpliest solutions, which when combined give some acceptable results. By having a more comprehensive project using one tool it might give better results and dismiss the idea that many tools are needed (btw- which I agree with, I think only one is needed).

- rant over - :)

Will Dwinnell said...

I agree that there's little reason to pay for overlap in tools, but I was thinking about Dean's comment about one's own career.

Very often, I think, people frame skills and experience in terms of software packages, as opposed to analysis techniques. My perspective is that, if you understand how logistic regression works, and you've experience using it in <fill in your favorite tool here>, then it should not be too difficult to perform logistic regression in <fill in an alternative tool here>.

While there is a "ramp up" time for someone who is new to a tool, there is a much longer learning period for understanding the difference between local models and global models, appropriate sampling concepts, rigorous testing techniques, and so forth.

Tim Manns said...

Hi Will,

I completely agree. What I am struggling with though is that why would some want multiple tools that perform very similar roles (unless they were a consultant etc)???

SAS EM or Clementine is a classic example. I can't see why a telco or bank might with a RDBMS would want both. A good user of either would probably have no need for the other tool. After reading a few or the original blog's other posts my conclusion is that the use of each tool is rather simple, and therefore using both gets results by chance that using one would not (make sense?).

If the data miners have no grasp of "..appropriate sampling concepts, rigorous testing techniques, and so forth..." then using multiple, tools might by chance land them an acceptable answer.

btw - I consider KXEN a litle different, since I understand its main function is in model management.

I think i've said too much and probably appear like a complete mean bastard now...

Sandro Saitta said...

Hi there,

Very interesting discussion as it's often the case on DM&PA :-)

I also consider that one tool should be enough. I can't find a particular example where several tools could be useful.

Also, in the case of SAS, it is already difficult to choose between the different SAS softwares. For data aggregation, one can use SAS Base, Data Integration Studio or Enterprise Guide... already difficult enough to choose.

Datalligence said...

missed this one before!

the major reason for my amazement is that the client is not a consulting firm or a DM services provider.

let's just assume, the client is a telco company. now why should it have/use 3 DM softwares? :-)

Dean Abbott said...

There are a few combinations I've seen work well personally. For example, some customers I've had use Affinium Model, which is great for automating model building and speaking the language of direct marketers. But if one wants to perform more intensive data prep, it is useful to have another tool for this purpose. One customer I've worked with has Model, Insightful Miner (now Tibco Spotfire Miner) and also CART for building decision trees.

A second type of customer I've dealt with that has multiple tools is one that uses something like Oracle Data Miner for quick turn-around in model building in-database, but then for a more in-depth iterative approach, they use another tool that has more options for data prep or model development.

That stated, it is still unusual--the vast majority of my customers are one-tool shops (at least functionally, even if they own more than one tool). I like having multiple tools available though as there are strengths and weaknesses to each.

Jožo Kováč said...

My SAS Base can (re)train lot of models in a short time.

I think Clementine (oh yes - PASW Modeller) can to the same job. I just have to learn CLEM syntax...

One DM package is more than enought. If different people in same organization have different prefference, setup workshop and let them fight.

Study your software DEEPLY, use to it and you will never need (nor want) another.

Neighbour's grass looks always greener, but 99.9% times it's not :-)