I have been reminded in the past couple weeks working with customers that in many applications of data mining and predictive analytics, unless the stakeholders of predictive models understand what the models are doing, they are utterly useless. When rules from a decision tree, no matter how statistically significant, don't resonate with domain experts, they won't be believed. Arguments that "the model wouldn't have picked this rule if it wasn't really there in the data" makes no difference when the rule doesn't make sense.
There is always a tradeoff in these cases between the "best" model (i.e., most accurate by some measure) and the "best understood" model (i.e., the one that gets the "ahhhs" from the domain experts). We can coerce models toward the transparent rather than the statistically significant by removing fields that perform well but don't contribute to the story the models tell about the data.
I know what some of you are thinking: if the rule or pattern found by the model is that good, we must try to find the reason for its inclusion, make the case for it, find a surrogate meaning, or just demand it be included because it is so good! I trust the algorithms and our ability to assess if the algorithms are finding something "real" compared with those "happenstance" occurrences. But not all stakeholders share our trust, and it is our job to translate the message for them so that their confidence in the models approaches are own.
Tips, tricks, and comments related to topics in data science and machine learning. Used to be called "data mining and predictive analytics" but updated the title to reflect the language of the day!
Hosted by Dean Abbott, Abbott Analytics
Tuesday, August 24, 2010
Thursday, August 19, 2010
Building Correlations in Clementine / Modeler
I just responded to this question on LinkedIn, Clementine group, and thought it might be of interest to a broader audience.
Q: Hi,
Does anyone have any suggestion or any knowledge on how to make cross-correlation in the Modeler/Clementine?
A:
I'm not so familiar with Modeler 14, but in prior versions, there was no good correlation matrix option (the Statistics node does correlations, but it is not easier to build an entire matrix)
The way I do it is with the Regression node. In the expert tab, click on the Expert radio button, then the Output... button, and make sure the "Descriptions" box is checked and run the regression with all the inputs (Direction->In) you want in the correlation matrix. Don't worry about having an output that is useful--if you don't have one, create a random number (Range) and use that as the output. After you Execute this, look in the Advanced tab of the gem and you will find a correlation matrix there. I usually then export it and re-import it into Excel (as an html file) where it is much easier to read and do things like color code big correlations.
Q: Hi,
Does anyone have any suggestion or any knowledge on how to make cross-correlation in the Modeler/Clementine?
A:
I'm not so familiar with Modeler 14, but in prior versions, there was no good correlation matrix option (the Statistics node does correlations, but it is not easier to build an entire matrix)
The way I do it is with the Regression node. In the expert tab, click on the Expert radio button, then the Output... button, and make sure the "Descriptions" box is checked and run the regression with all the inputs (Direction->In) you want in the correlation matrix. Don't worry about having an output that is useful--if you don't have one, create a random number (Range) and use that as the output. After you Execute this, look in the Advanced tab of the gem and you will find a correlation matrix there. I usually then export it and re-import it into Excel (as an html file) where it is much easier to read and do things like color code big correlations.
Friday, August 13, 2010
IBM and Unica, Affinium Model and Clementine
After seeing that IBM has purchased Unica I have to wonder how this will effect Affinium Model and Clementine (I revert to the names that were used for so long here, now PredictExpress and Modeler, respectively). They are so very different in interfaces, features and deployment options that it is hard to see how they will be "joined": the big-button wizard interface vs. the block-diagram flow interface.
One thing I always liked about Affinium Model was the ability to automate the building of thousands of models. Clementine now has that same capability, so that advantage is lost. To me, that leaves the biggest advantage of Affinium Model being it's language and wizards. Because it uses the language of customer analytics rather than the more technical language of data mining / predictive analytics, it was easier to teach to new analysts. Because it makes generally good decisions on data prep and preprocessing, the analyst didn't need to know a lot about sampling and data transformations to get a model out (we won't dive into how good here, or how much better experts could do the data transformations and sampling).
My fear is that Affinium Model will just be dropped, going the way of Darwin, PRW (the predecessor to Affinium Model), and other data mining tools that were good ideas. Time will tell.
One thing I always liked about Affinium Model was the ability to automate the building of thousands of models. Clementine now has that same capability, so that advantage is lost. To me, that leaves the biggest advantage of Affinium Model being it's language and wizards. Because it uses the language of customer analytics rather than the more technical language of data mining / predictive analytics, it was easier to teach to new analysts. Because it makes generally good decisions on data prep and preprocessing, the analyst didn't need to know a lot about sampling and data transformations to get a model out (we won't dive into how good here, or how much better experts could do the data transformations and sampling).
My fear is that Affinium Model will just be dropped, going the way of Darwin, PRW (the predecessor to Affinium Model), and other data mining tools that were good ideas. Time will tell.
Monday, August 02, 2010
Is there too much data?
I was reading back over some old blog posts, and came across this quote from Moneyball: The Art of Winning an Unfair Game
I see this phenomenon often these days; we have so much data that we build models without thinking, hoping that the sheer volume of data and sophisticated algorithms will be enough to find the solution. But even with mounds of data, the insight still occurs often on the micro level, with individual cases or customers. The data must tell a story.
The quote is a good reminder that no matter the size of the data, we are in the business of decisions, knowledge, and insight. Connecting the big picture (lots of data) to decisions takes more than analytics.
Intelligence about baseball statistics had become equated in the public mind with the ability to recite arcane baseball stats. What [Bill] James's wider audience had failed to understand was that the statistics were beside the point. The point was understanding; the point was to make life on earth just a bit more intelligible; and that point, somehow, had been lost. 'I wonder,' James wrote, 'if we haven't become so numbed by all these numbers that we are no longer capable of truly assimilating any knowledge which might result from them.' [italics mine]
I see this phenomenon often these days; we have so much data that we build models without thinking, hoping that the sheer volume of data and sophisticated algorithms will be enough to find the solution. But even with mounds of data, the insight still occurs often on the micro level, with individual cases or customers. The data must tell a story.
The quote is a good reminder that no matter the size of the data, we are in the business of decisions, knowledge, and insight. Connecting the big picture (lots of data) to decisions takes more than analytics.