Monday, January 01, 2007

For the Best Answer, Ask the Best Question

A subject of great interest to data mining novices is the selection of data mining software. Frequently these interests are expressed in terms of what is "the best" software to buy. On-line, such queries are often met with quick and eager responses (and not just from vendors). In a way, this mimics the much more common (and much more incendiary) question about which programming language is "the best".

Not withstanding myriad fast answers, the answer to such questions is, of course, "It depends". What is the problem you are trying to solve? What is your familiarity with any of the available alternatives? How large is your budget? How large is your budget for ongoing subscription costs? How do you intend to deploy the result of your data mining effort?

Vendors, naturally, have an incentive to emphasize any feature which they believe will move product. Some vendors are worse about this than others. Years ago, one neural network shell vendor touted the fact that their software used "32-bit math", without ever demonstrating the benefit of this feature. In truth, competing software, which ran 16-bit fixed-point arithmetic was much faster, gave accurate results, and did not require 32-bit hardware.

The problem of irrelevant features is exacerbated by the presence of individuals in the customer organization who buy into this stuff. Some use this as political leverage on their unaware peers. I attended in a vendor presentation once with a banking client in which one would-be expert asked whether the vendor's computers were SIMD or MIMD. This was like asking whether the vendor's cafeteria served this or that brand of coffee and could not have been less relevant to the conversation. The asking of such a question was clearly a power play and served only as a distraction.

When confronted with unfamiliar features, my recommendation is to ask as many questions as it takes to understand why said features are of benefit. Don't stop with the vendor. Ask associates at other firms what they know about the subject. Try on-line discussion groups. Keep asking "Why?" until you are satisfied. Joe Pesci's character in "My Cousin Vinny" is a good model: "Why does SIMD vs. MIMD matter?" "Is one better than the other?" "Exactly how is it better?" "Is it faster? How much faster?" "Does it cost more?" Remember that diligence is the responsibility of the customer.

Some things to consider when framing the question "What is the best data mining software for my purposes?":

-Up front software cost
-Up front hardware cost, if any
-Continuing software costs (subscription prices)
-Training time for users
-Algorithms which match your needs
-Effective data capacity in variables
-Effective data capacity in examples
-Testing capabilities
-Model deployment options (source code, libraries, etc.)
-Model deployment costs (licensing costs, if any)
-Ease of interface with your data sources
-Ability to deal with missing values, special values, outliers, etc.
-Data preparation capabilities (generation of derived or transformed variables)
-Automatic attribute selection / Data reduction

1 comment:

Dean Abbott said...

"It depends" is one of those answers that isn't very satisfying, but is often the most honest of answers. It is the answer I give to may questions in data mining, such as:

1) how much missing data in a field is too much?
2) what model is best?
3) which algorithm is best?
4) how much data should I use to build models?
5) how much data do I need to validate models?
6) how much correlation between fields is too much before throwing one of the two out?

And these are just a few.