I'm at the Predictive Analytics Summit in San Francisco. It is interesting to see the titles of Analytics people at the conference (here). They include CTO/Senior/Manager/VP of a variety of analytics variants: Predictive Analytics, Marketing Analytics, just Analytics, Data Analytics, Research & Analytics, Quant Research, etc. Others not here but that I've seen include Business Analytics and the variety of Data Mining titles.
There has been a lot of hype about data mining and predictive analytics being a great field to be in. It's interesting to me that (1) Predictive Analytics is so often part of the title now, lending credence to this term becoming a standard term companies use, and (2) the variety of ways quantitive modeling is described.
This conference is just one of many taking place in a short time period, including Predictive Analytics World, SAS M2010, IBM Information on Demand, and Teradata Partners conference, the SuperMath Conference in San Diego, and the ACM Data Mining Bootcamp in San Jose. Too many to attend all of them!
Tips, tricks, and comments related to topics in data science and machine learning. Used to be called "data mining and predictive analytics" but updated the title to reflect the language of the day!
Hosted by Dean Abbott, Abbott Analytics
Thursday, November 04, 2010
Thursday, October 28, 2010
A humorous explanation of p-values
After Will's great post on sample sizes that referenced the youtube video entitled Statistics vs. Marketing, I found an equally funny and informative explanation on p-values here.
Aside from the esoteric explanations of what a p-value is, there is a point that I make often with customers that statistical significance (from p-values) is not the same thing as operational significance; just because you find a p-value of less than 0.05 doesn't mean the result is useful for anything! Enjoy.
Aside from the esoteric explanations of what a p-value is, there is a point that I make often with customers that statistical significance (from p-values) is not the same thing as operational significance; just because you find a p-value of less than 0.05 doesn't mean the result is useful for anything! Enjoy.
From the Archives: A Synopsis of Programming Languages
A departure from the usual data mining and predictive analytics posts...
I was looking at old articles I clipped from the 80s, and came across my favorite programming article from the days I used to program a lot (mostly C, some FORTRAN, sh, csh, tcsh). This one from the C Advisor by Ken Arnold I found funny then, and still do now. I don't know where these are archived, so I'll just quote an excerpt here:
C advisor article by Ken Arnold from years and years ago quoting Richard Curtis
I was looking at old articles I clipped from the 80s, and came across my favorite programming article from the days I used to program a lot (mostly C, some FORTRAN, sh, csh, tcsh). This one from the C Advisor by Ken Arnold I found funny then, and still do now. I don't know where these are archived, so I'll just quote an excerpt here:
C advisor article by Ken Arnold from years and years ago quoting Richard Curtis
• FORTRAN was like the fifties: It's rigid and procedural, and doesn't even distinguish between cases. It's motto is "Do my thing".
• C is a real sixties language, because it just doesn't care. It doesn't type check, and it lets you get into as much trouble as you can--you own your own life. C's motto: "Do your own thing".
• Pascal is the seventies. It tries to seize control of the wild and woolly sixties, without getting too restrictive. It thus ends up pleasing no one. It's full of self-justification and self-importance--going from C to Pascal is like going from Janis Joplin to Donna Summer. It is smooth and flashy and useless for major work--truly the John Travolta of programming languages. The Pascal motto is: "Do your thing my way".
• ADA is the eighties. There is no overarching philosophy; everything is possible, but there is no ethical compass to tell you what ought to be done. (Actually, I know of two things you can't do in ADA, but I'm not telling for fear they'll be added.) It reflects the eighties notion of freedom, which is that you are free to do anything, as long as you do it the way the government wants you to--that is, in ADA. It's credo: "Do anything anyway you want".
Sunday, October 24, 2010
The Data Budget
Larger quantities of data permit greater precision, greater certainty and more detail in analysis. As observation counts increase, standard errors decrease and the opportunity for more detailed- perhaps more segmented- analysis rises. These are things which are obvious to even junior analysts: The standard error of the mean is calculated as the standard deviation divided by the square root of the observation count.
This general idea may seem obvious when spoken aloud, but it is something which many non-technical people seem to give little thought. Ask any non-technical client whether more data will provide a better answer, and the response will be in the affirmative. It is a simple trend to understand.
However, people who do not analyze data for a living do not necessarily think about such things in precise terms. On too many occasions, I have listened to managers or other customers indicate that they wanted to examine data set X and test Y things. Without performing any calculations, I had strong suspicions that it would not be feasible to test Y things, given the meager size of data set X. Attempts to explain this have been met with various responses. To be fair, some of them were constructive acknowledgments of this unfortunate reality, and new expectations were established. In other cases, I was forced to be the insistent bearer of bad news.
In one such situation, a data set with less than twenty thousand observations was to be divided among about a dozen direct mail treatments. Expected response rates were typically in the single-digit percents, meaning that only a few hundred observations would be available for analysis. Treatments were to be compared based on various business metrics (customer spending, etc.). Given the small number of respondents and high variability of this data, I realized that this was unlikely to be productive. I eventually gave up trying to explain the futility of this exercise, and resigned myself to listening to biweekly explanations the noisy graphs and summaries. One day, though, I noticed that one of the cells contained a single observation! Yes, much energy and attention was devoted to tracking this "cell" of one individual, which of course would have no predictive value whatsoever.
It is important for data analysts to make clear the limitations of our craft. One such limitation is the necessity of sufficient data from which to draw reasonable and useful conclusions. It may be helpful to indicate this important requirement as the data budget: "Given the quality and volume of our historical data, we only have the data budget to answer questions about 3 segments, not 12." Simply saying "We don't have enough data" is not effective (so I have learned through painful experience). Referring to this issue in terms which others can appreciate may help.
This general idea may seem obvious when spoken aloud, but it is something which many non-technical people seem to give little thought. Ask any non-technical client whether more data will provide a better answer, and the response will be in the affirmative. It is a simple trend to understand.
However, people who do not analyze data for a living do not necessarily think about such things in precise terms. On too many occasions, I have listened to managers or other customers indicate that they wanted to examine data set X and test Y things. Without performing any calculations, I had strong suspicions that it would not be feasible to test Y things, given the meager size of data set X. Attempts to explain this have been met with various responses. To be fair, some of them were constructive acknowledgments of this unfortunate reality, and new expectations were established. In other cases, I was forced to be the insistent bearer of bad news.
In one such situation, a data set with less than twenty thousand observations was to be divided among about a dozen direct mail treatments. Expected response rates were typically in the single-digit percents, meaning that only a few hundred observations would be available for analysis. Treatments were to be compared based on various business metrics (customer spending, etc.). Given the small number of respondents and high variability of this data, I realized that this was unlikely to be productive. I eventually gave up trying to explain the futility of this exercise, and resigned myself to listening to biweekly explanations the noisy graphs and summaries. One day, though, I noticed that one of the cells contained a single observation! Yes, much energy and attention was devoted to tracking this "cell" of one individual, which of course would have no predictive value whatsoever.
It is important for data analysts to make clear the limitations of our craft. One such limitation is the necessity of sufficient data from which to draw reasonable and useful conclusions. It may be helpful to indicate this important requirement as the data budget: "Given the quality and volume of our historical data, we only have the data budget to answer questions about 3 segments, not 12." Simply saying "We don't have enough data" is not effective (so I have learned through painful experience). Referring to this issue in terms which others can appreciate may help.
Thursday, October 21, 2010
Predictive Analytics World Addresses Risk and Fraud Detection
Eric Siegel focused his plenary session on predicting and assessing risk in the enterprise, and in his usual humorous way, described how big, macro or catastrophic risk often dominates thinking, micro or transactional risk can cost organizations more than macro risk. The micro risk is where predictive analytics is well suited, what he called data-driven micro risk management.
The point is well-taken because the most commonly used PA techniques are work better with larger data than "one of a kind" events. Micro risk can be quantified in a PA framework well.
During the second day, an excellent talk described a fraud assessment application in the insurance industry. While the entire CRISP-DM process were covered in this talk (from Business Understanding through Deployment), there was one aspect that struck me in particular, namely the definition of the target variable to predict. Of course, the most natural target variable for fraud detection is a label indicating if a claim has been shown to be fraudulent. Fraud often has a legal aspect to it, where a claim can only be truly "fraud" after it has been prosecuted and the case closed. This has at least two difficulties for analytics. First, it can take quite some time for a case to close, making the data one has for building fraud models lag by perhaps years from when the fraud was perpetrated. Patterns of fraud change, and thus models may perpetually be behind in identifying the fraud patterns.
Second, a there are far fewer actual proven fraud cases compared to those that are suspicious and worthy of investigation. Cases may be dismissed or "flushed" for a variety of reasons ranging from lack of resources to investigate, statutory restrictions, and legal loopholes which do not reduce the risk for a particular claim at all, but rather just change the target variable (to 0), making these cases appear the same as benign cases.
In this case study, the author described a process where another label for risk was used, a human-generated label that only indicated a high-enough level of suspicious behavior rather than only using actual claims fraud, a good idea in my opinion.
Friday, October 15, 2010
Thursday, October 07, 2010
A little math humor, and achieving clarity in explaining solutions
This is still one of my favorite cartoons of all time (by S. Harris). I think we've all been there before, trying to waive our hands in place of providing a good reason for the procedures we use.
A closely related phenomenon is when you receive an explanation for a business process that is "proof by confusion", whereby the person explaining the process uses lots of buzz words and complex terminology in place of clarity, probably because the person him or herself doesn't really understand it him/herself.
This is why clarifying questions are so key. I remember a professor of mathematics of mine at Rensselaer Polytechnic Institute named David Isaacson who told a story of a graduate seminar. If you have ever experienced these seminars, there are two distinguishing features: the food, that goes quickly to those who arrive on time, and the game involved of the speaker trying to lose the graduate students during the lecture (an overstatement, but a frequently occurring outcome). Prof. Isaacson told us of a guy there who would ask dumb questions from the get-go: questions that we all knew the answer to and most folks thought were obvious. But as the lecture continued, this guy was the only one left asking questions, and of course was the only one who truly understood the lecture. What was happening is that he was constantly aligning what he thought he heard by asking for clarification. The rest of those in the room thought they understood, but in reality did not.
It reminds me to ask questions, even the dumb ones if it means forcing the one who is teaching or explaining to restate their point in different words, thus providing better opportunity for true communication.
A closely related phenomenon is when you receive an explanation for a business process that is "proof by confusion", whereby the person explaining the process uses lots of buzz words and complex terminology in place of clarity, probably because the person him or herself doesn't really understand it him/herself.
This is why clarifying questions are so key. I remember a professor of mathematics of mine at Rensselaer Polytechnic Institute named David Isaacson who told a story of a graduate seminar. If you have ever experienced these seminars, there are two distinguishing features: the food, that goes quickly to those who arrive on time, and the game involved of the speaker trying to lose the graduate students during the lecture (an overstatement, but a frequently occurring outcome). Prof. Isaacson told us of a guy there who would ask dumb questions from the get-go: questions that we all knew the answer to and most folks thought were obvious. But as the lecture continued, this guy was the only one left asking questions, and of course was the only one who truly understood the lecture. What was happening is that he was constantly aligning what he thought he heard by asking for clarification. The rest of those in the room thought they understood, but in reality did not.
It reminds me to ask questions, even the dumb ones if it means forcing the one who is teaching or explaining to restate their point in different words, thus providing better opportunity for true communication.
Friday, September 24, 2010
Theory vs. Practice
In many fields, it is common to find a gap between theorists and practitioners. As stereotypes, theorists have a reputation for sniffing at anything which has not been optimized and proven to the nth degree, while practitioners show little interest in theory, as it "only ever works on paper".
I have been amazed at both extremes of this spectrum. Academic and standards journals seem to publish mostly articles which solve theoretical problems which will never arise in practice (but which permit solutions which are elegant or which can be optimized to some ridiculous level), or solutions which are trivial variations on previous work. The same goes for most masters and doctoral theses. On the other hand, I was shocked when software development colleagues (consultants: the last word in practice over theory) were unfamiliar with two's complement arithmetic.
Data mining is certainly not immune to this problem. Not long ago, I came upon technical documentation for a linear regression which had been "fixed" by a logarithmic transformation of the dependent variable. (There is a correct way to fit coefficients in this circumstance, but that was not done in this case.) Even more astounding was the polynomial curve fit which was applied to "undo" the log transformation, to get back to the original units! Sadly, the practitioners in question did not even recognize the classic symptom of their error: residuals were much larger at the high end of their plots.
Data miners (statisticians, quantitative analysts, forecasters, etc.) come from a variety of fields, and enjoy diverse levels of formal training. Grounding in theory follows suit. The people we work for typically are capable of identifying only the most egregious technical errors in our work. This sets the stage for potential problems.
As a practitioner, I have found much that is useful in theory and suggest that it is a fountain which is worth returning to, from time to time. Reviewing new developments in our field, searching for useful techniques and guidance will benefit data miners, regardless of their seniority.
I have been amazed at both extremes of this spectrum. Academic and standards journals seem to publish mostly articles which solve theoretical problems which will never arise in practice (but which permit solutions which are elegant or which can be optimized to some ridiculous level), or solutions which are trivial variations on previous work. The same goes for most masters and doctoral theses. On the other hand, I was shocked when software development colleagues (consultants: the last word in practice over theory) were unfamiliar with two's complement arithmetic.
Data mining is certainly not immune to this problem. Not long ago, I came upon technical documentation for a linear regression which had been "fixed" by a logarithmic transformation of the dependent variable. (There is a correct way to fit coefficients in this circumstance, but that was not done in this case.) Even more astounding was the polynomial curve fit which was applied to "undo" the log transformation, to get back to the original units! Sadly, the practitioners in question did not even recognize the classic symptom of their error: residuals were much larger at the high end of their plots.
Data miners (statisticians, quantitative analysts, forecasters, etc.) come from a variety of fields, and enjoy diverse levels of formal training. Grounding in theory follows suit. The people we work for typically are capable of identifying only the most egregious technical errors in our work. This sets the stage for potential problems.
As a practitioner, I have found much that is useful in theory and suggest that it is a fountain which is worth returning to, from time to time. Reviewing new developments in our field, searching for useful techniques and guidance will benefit data miners, regardless of their seniority.
Tuesday, September 07, 2010
DM Radio - Predictive Analytics and Fraud Detection
I'll be on DM Radio Thursday September 9 at 3pm EDT. Here's the blurb:
How many ways to catch a thief? More and more, thanks to predictive analytics, data-as-a-service and other clever computing tricks. Stopping fraud in its tracks can save customers, money and more. Tune into this episode of DM Radio to find out how. We'll hear from Eric Siegel, Prediction Impact; Erick Brethenoux, SPSS; Jason Trunk, Quest Software and Dean Abbott, Abbott Analytics.
Thursday, September 02, 2010
Leo Breiman quote about statisticians
One nice thing about having to move offices is that it forces you to go through old papers and folders. I found my folder containing KDD 97 conference notes, including quotes in the tutorial by David Hand from Leo Breiman (1995):
In courses I teach, one of my objectives is to take the mathematics of the algorithms and translate the practical meaning of what they do into understandable pieces so that practitioners can manipulate learning rates and hidden units, gini and two-ing, radial kernels and polynomials kernels. Understanding backprop isn't important to most practitioners, but understanding how one can improve the performance of backprop is very much a key topic for practitioners.
We need more Breimans to pave the way toward practical innovations in predictive modeling.
One problem in the field of statistics has been that everyone wants to be a theorist. Part of this is envy - the real sciences are based on mathematical theory. In the universities for this century, the glamor and prestige has been in mathematical models and theorems, no matter how irrelevant.I love this quote because it highlights the divide between the practical and the elegant or sophisticated. Data mining and predictive analytics are "low-brow" sciences, empirical, and practical. That doesn't mean that the mathematics aren't important; they are very much so. But while we wait for the elegances of a theory to trickle down to us, we still need solutions.
In courses I teach, one of my objectives is to take the mathematics of the algorithms and translate the practical meaning of what they do into understandable pieces so that practitioners can manipulate learning rates and hidden units, gini and two-ing, radial kernels and polynomials kernels. Understanding backprop isn't important to most practitioners, but understanding how one can improve the performance of backprop is very much a key topic for practitioners.
We need more Breimans to pave the way toward practical innovations in predictive modeling.
Tuesday, August 24, 2010
Predictive Models are Only as Good as Their Acceptance by Decision-Makers
I have been reminded in the past couple weeks working with customers that in many applications of data mining and predictive analytics, unless the stakeholders of predictive models understand what the models are doing, they are utterly useless. When rules from a decision tree, no matter how statistically significant, don't resonate with domain experts, they won't be believed. Arguments that "the model wouldn't have picked this rule if it wasn't really there in the data" makes no difference when the rule doesn't make sense.
There is always a tradeoff in these cases between the "best" model (i.e., most accurate by some measure) and the "best understood" model (i.e., the one that gets the "ahhhs" from the domain experts). We can coerce models toward the transparent rather than the statistically significant by removing fields that perform well but don't contribute to the story the models tell about the data.
I know what some of you are thinking: if the rule or pattern found by the model is that good, we must try to find the reason for its inclusion, make the case for it, find a surrogate meaning, or just demand it be included because it is so good! I trust the algorithms and our ability to assess if the algorithms are finding something "real" compared with those "happenstance" occurrences. But not all stakeholders share our trust, and it is our job to translate the message for them so that their confidence in the models approaches are own.
There is always a tradeoff in these cases between the "best" model (i.e., most accurate by some measure) and the "best understood" model (i.e., the one that gets the "ahhhs" from the domain experts). We can coerce models toward the transparent rather than the statistically significant by removing fields that perform well but don't contribute to the story the models tell about the data.
I know what some of you are thinking: if the rule or pattern found by the model is that good, we must try to find the reason for its inclusion, make the case for it, find a surrogate meaning, or just demand it be included because it is so good! I trust the algorithms and our ability to assess if the algorithms are finding something "real" compared with those "happenstance" occurrences. But not all stakeholders share our trust, and it is our job to translate the message for them so that their confidence in the models approaches are own.
Thursday, August 19, 2010
Building Correlations in Clementine / Modeler
I just responded to this question on LinkedIn, Clementine group, and thought it might be of interest to a broader audience.
Q: Hi,
Does anyone have any suggestion or any knowledge on how to make cross-correlation in the Modeler/Clementine?
A:
I'm not so familiar with Modeler 14, but in prior versions, there was no good correlation matrix option (the Statistics node does correlations, but it is not easier to build an entire matrix)
The way I do it is with the Regression node. In the expert tab, click on the Expert radio button, then the Output... button, and make sure the "Descriptions" box is checked and run the regression with all the inputs (Direction->In) you want in the correlation matrix. Don't worry about having an output that is useful--if you don't have one, create a random number (Range) and use that as the output. After you Execute this, look in the Advanced tab of the gem and you will find a correlation matrix there. I usually then export it and re-import it into Excel (as an html file) where it is much easier to read and do things like color code big correlations.
Q: Hi,
Does anyone have any suggestion or any knowledge on how to make cross-correlation in the Modeler/Clementine?
A:
I'm not so familiar with Modeler 14, but in prior versions, there was no good correlation matrix option (the Statistics node does correlations, but it is not easier to build an entire matrix)
The way I do it is with the Regression node. In the expert tab, click on the Expert radio button, then the Output... button, and make sure the "Descriptions" box is checked and run the regression with all the inputs (Direction->In) you want in the correlation matrix. Don't worry about having an output that is useful--if you don't have one, create a random number (Range) and use that as the output. After you Execute this, look in the Advanced tab of the gem and you will find a correlation matrix there. I usually then export it and re-import it into Excel (as an html file) where it is much easier to read and do things like color code big correlations.
Friday, August 13, 2010
IBM and Unica, Affinium Model and Clementine
After seeing that IBM has purchased Unica I have to wonder how this will effect Affinium Model and Clementine (I revert to the names that were used for so long here, now PredictExpress and Modeler, respectively). They are so very different in interfaces, features and deployment options that it is hard to see how they will be "joined": the big-button wizard interface vs. the block-diagram flow interface.
One thing I always liked about Affinium Model was the ability to automate the building of thousands of models. Clementine now has that same capability, so that advantage is lost. To me, that leaves the biggest advantage of Affinium Model being it's language and wizards. Because it uses the language of customer analytics rather than the more technical language of data mining / predictive analytics, it was easier to teach to new analysts. Because it makes generally good decisions on data prep and preprocessing, the analyst didn't need to know a lot about sampling and data transformations to get a model out (we won't dive into how good here, or how much better experts could do the data transformations and sampling).
My fear is that Affinium Model will just be dropped, going the way of Darwin, PRW (the predecessor to Affinium Model), and other data mining tools that were good ideas. Time will tell.
One thing I always liked about Affinium Model was the ability to automate the building of thousands of models. Clementine now has that same capability, so that advantage is lost. To me, that leaves the biggest advantage of Affinium Model being it's language and wizards. Because it uses the language of customer analytics rather than the more technical language of data mining / predictive analytics, it was easier to teach to new analysts. Because it makes generally good decisions on data prep and preprocessing, the analyst didn't need to know a lot about sampling and data transformations to get a model out (we won't dive into how good here, or how much better experts could do the data transformations and sampling).
My fear is that Affinium Model will just be dropped, going the way of Darwin, PRW (the predecessor to Affinium Model), and other data mining tools that were good ideas. Time will tell.
Monday, August 02, 2010
Is there too much data?
I was reading back over some old blog posts, and came across this quote from Moneyball: The Art of Winning an Unfair Game
I see this phenomenon often these days; we have so much data that we build models without thinking, hoping that the sheer volume of data and sophisticated algorithms will be enough to find the solution. But even with mounds of data, the insight still occurs often on the micro level, with individual cases or customers. The data must tell a story.
The quote is a good reminder that no matter the size of the data, we are in the business of decisions, knowledge, and insight. Connecting the big picture (lots of data) to decisions takes more than analytics.
Intelligence about baseball statistics had become equated in the public mind with the ability to recite arcane baseball stats. What [Bill] James's wider audience had failed to understand was that the statistics were beside the point. The point was understanding; the point was to make life on earth just a bit more intelligible; and that point, somehow, had been lost. 'I wonder,' James wrote, 'if we haven't become so numbed by all these numbers that we are no longer capable of truly assimilating any knowledge which might result from them.' [italics mine]
I see this phenomenon often these days; we have so much data that we build models without thinking, hoping that the sheer volume of data and sophisticated algorithms will be enough to find the solution. But even with mounds of data, the insight still occurs often on the micro level, with individual cases or customers. The data must tell a story.
The quote is a good reminder that no matter the size of the data, we are in the business of decisions, knowledge, and insight. Connecting the big picture (lots of data) to decisions takes more than analytics.
Thursday, July 08, 2010
Neural Network books
I was talking with a colleague today who is taking a business-oriented data mining course, and there was a list of neural network books recommended by the instructor. It was fascinating looking at the books in the list because I didn't know several of them. When I examined several of the recommended books on amazon.com, I found they contained what I would call "academic" treatments of neural networks. That means they covered all kinds of varieties of neural networks, including brain-state-in-a-box, Boltzmann machines, Hebbian networks, Adaline, ART1, ART2, and many more. Now I have nothing against learning about these techniques on the graduate school level, or even on the undergraduate level. But for practitioners, I see absolutely no advantage here because they aren't used in practice. Nearly always, when someone says they are building a "neural network" they mean a Multi-layered perceptron (MLP).
When I use neural networks in major software packages, such as IBM-SPSS Modeler, Statistica, Tibco Spotfire Miner, SAS Enterprise Miner, JMP, Affinium Predictive Insight, and I can go on... I am building MLPs, not ART3 models. So why teach professionals how these other algorithms work? I don't know.
Now neural network experts I'm sure will find times and places to build esoteric varieties of neural nets. But because of the way most practitioners actually build neural networks, I recommend sticking with the MLP, and understanding the vast numbers of options one has just with this algorithm. This is one reason I like the Christopher Bishop Neural Networks for Pattern Recognition. Check out the table of contents--I think these topics are more helpful to understand than learning more neural network algorithms.
Another option for spinning up on neural nets is the excellent SAS Neural Network FAQ which is old, but still a very clear introduction to the subject. Finally, for backpropagation, I also like the Richard Lippmann 1987 classic "An Introduction to Computing with Neural Nets (8MB here).
When I use neural networks in major software packages, such as IBM-SPSS Modeler, Statistica, Tibco Spotfire Miner, SAS Enterprise Miner, JMP, Affinium Predictive Insight, and I can go on... I am building MLPs, not ART3 models. So why teach professionals how these other algorithms work? I don't know.
Now neural network experts I'm sure will find times and places to build esoteric varieties of neural nets. But because of the way most practitioners actually build neural networks, I recommend sticking with the MLP, and understanding the vast numbers of options one has just with this algorithm. This is one reason I like the Christopher Bishop Neural Networks for Pattern Recognition. Check out the table of contents--I think these topics are more helpful to understand than learning more neural network algorithms.
Another option for spinning up on neural nets is the excellent SAS Neural Network FAQ which is old, but still a very clear introduction to the subject. Finally, for backpropagation, I also like the Richard Lippmann 1987 classic "An Introduction to Computing with Neural Nets (8MB here).
Tuesday, June 22, 2010
Salford to Launch New Integrated Data Mining Suite
Tomorrow night is the launch of SPM (Salford Predictive Miner). If you are in San Diego, give them a holler to let them know you are coming. See you there!
A/B Testing and the Need for Clear Business Objectives
The website http://videolectures.net/ contains a wealth of interesting lectures on a wide variety of topics, including data mining. I was reminded of one today by Ronny Kohavi entitled "Practical Guide to Controlled Experiments on the Web: Listen to Your Customers not to the HiPPO" It's short (only 23 minutes) and filled with some very good common-sense principles.
First, it is a talk about the importance of A/B testing, or in other words, constructing experiments to learn customer behavior rather than having the experts make a best guess at how people will behave. He gives some good examples from Microsoft and the sometimes non-intuitive results from actual testing. A book he recommends is Breakthrough Business Results With MVT: A Fast, Cost-Free, Secret Weapon for Boosting Sales, Cutting Expenses, and Improving Any Business Process
The second part of the lecture I found particularly interesting is what Kohavi calls the Overall Evaluation Criterion (OEC), or what I usually call business objectives. He included the great Lewis Carroll quote, "If you don't know where you are going, any road will take you there." I find this a common problem as well: if we don't define a business objective that truly measures the impact of the predictive models we build, we have no way of determining if they are effective or not. This objective must be tied to the business itself. For example, Kohavi argues for using Customer Lifetime Value (CLV) rather than click-through rates as they are more tied to the bottom line.
I would add that it can be useful to have two objectives that are measurable, especially if two objectives better measure the value. For example, in collections risk models, the age of the debt and the amount of the debt are both important components to risk. These are difficult to put into a single number in general, so the two-dimensional risk score can be helpful operationally.
First, it is a talk about the importance of A/B testing, or in other words, constructing experiments to learn customer behavior rather than having the experts make a best guess at how people will behave. He gives some good examples from Microsoft and the sometimes non-intuitive results from actual testing. A book he recommends is Breakthrough Business Results With MVT: A Fast, Cost-Free, Secret Weapon for Boosting Sales, Cutting Expenses, and Improving Any Business Process
I would add that it can be useful to have two objectives that are measurable, especially if two objectives better measure the value. For example, in collections risk models, the age of the debt and the amount of the debt are both important components to risk. These are difficult to put into a single number in general, so the two-dimensional risk score can be helpful operationally.
Wednesday, June 02, 2010
Embedded Analytics and Business Rules: The Holy Grail?
Tomorrow (Thursday) at 3pm EDT I'll be on DM Radio for the broadcast "Embedded Analytics and Business Rules: The Holy Grail?". I'm not sure what the other guests are going to talk about, but my comments will resemble the talk I gave at Predictive Analytics World in February 2010 in the talk Rules Rule: Inductive Business-Rule Discovery in Text Mining. In this help-desk case study, we used decision trees to cherry pick interesting rules, converted them to SQL, and deployed them in a rule system that was applied transactionally, online. I emphasized the text mining portion at PAW, but the methodology was independent of that. In 2002-2003, researchers and I at the IRS applied same kind of approach to rule discovery in selecting returns for audit: use trees to find interesting rules.
The reason we liked the approach was that it was a fast way to overcome two problems. First, decision trees only find the best solution to a problem (according to its measure of "good"). To obtain a richer set of terminal nodes, one can build ensembles of trees, but then one loses the interpretation. On the other hand, one can build association rules, but then you are left with perhaps thousands to tens of thousands of rules that have to be pruned back to get the gist of the key ideas. Many of the rules will be redundant (some completely identical in which records are "hit" by the rule), and it's easy to become lost in the sheer number of rules.
For the Fortune 500 company, we used CART with the battery option to generate a sequence of trees (we iterated on "priors" and misclassification costs, and I think some more options as well to generate variety), and took only those terminal nodes that had sufficiently high classification accuracy. I think we could have used their hotspot analysis for this too, but I wasn't sufficiently well-versed in it at that time.
If you can't join in on the radio broadcast, you can always download the mp3 later.
The reason we liked the approach was that it was a fast way to overcome two problems. First, decision trees only find the best solution to a problem (according to its measure of "good"). To obtain a richer set of terminal nodes, one can build ensembles of trees, but then one loses the interpretation. On the other hand, one can build association rules, but then you are left with perhaps thousands to tens of thousands of rules that have to be pruned back to get the gist of the key ideas. Many of the rules will be redundant (some completely identical in which records are "hit" by the rule), and it's easy to become lost in the sheer number of rules.
For the Fortune 500 company, we used CART with the battery option to generate a sequence of trees (we iterated on "priors" and misclassification costs, and I think some more options as well to generate variety), and took only those terminal nodes that had sufficiently high classification accuracy. I think we could have used their hotspot analysis for this too, but I wasn't sufficiently well-versed in it at that time.
If you can't join in on the radio broadcast, you can always download the mp3 later.
Thursday, May 27, 2010
PAKDD-10 Data Mining Competition Winner: Ensembles Again!
The PAKDD-10 Data Mining Competition results are in, and ensembles occupied the top 4 positions, and I think the top 5. The winner used Stochastic Gradient Boosting and Random Forests in Statistica, second place a combination of logistic regression and Stochastic Gradient Boosting (and Salford Systems CART for some feature extraction). Interestingly to me, the 5th place finisher used WEKA, an open source software tool.
The problem was credit risk with biased data for building the models, a good way to do the competition because this is the problem we usually face anyway: data was collected based on historic interactions with the company, biased by the approaches the company has used in the past rather than having a pure random sample to build models. Model performance was judged based on Area under the Curve (AUC), with the KS distance as the tie breaker (it's not everyday I hear folks pull out the KS distance!).
One submission in particular commented on the difference between how algorithms build models and the metric used to evaluate them. CART uses the Gini Index, Logistic regression the log-odds, Neural Networks minimize mean squared error (usually), none of which directly maximize AUC. But this topic is worthy of another post.
The problem was credit risk with biased data for building the models, a good way to do the competition because this is the problem we usually face anyway: data was collected based on historic interactions with the company, biased by the approaches the company has used in the past rather than having a pure random sample to build models. Model performance was judged based on Area under the Curve (AUC), with the KS distance as the tie breaker (it's not everyday I hear folks pull out the KS distance!).
One submission in particular commented on the difference between how algorithms build models and the metric used to evaluate them. CART uses the Gini Index, Logistic regression the log-odds, Neural Networks minimize mean squared error (usually), none of which directly maximize AUC. But this topic is worthy of another post.
Tuesday, May 25, 2010
The Trimmed Mean has Intuitive Appeal
I was listening to Colin Cowherd of ESPN radio this morning and he made a very interesting observation that we data miners know, or at least should know and make good use of. The context was evaluating teams and programs: are they dynasties or built off of one great player or coach. Lakers? dynasty. Celtics? dynasty. Bulls? without Jordan, they have been a mediocre franchise. The Lakers without Magic are still a dynasty. The Celtics without Bird are still a dynasty.
So his rule of thumb that he applied to college football programs was this: remove the best coach and the worst coach, and then assess the program. If they are still a great program, they are truly a dynasty.
This is the trimmed (truncated) mean idea that he was applying intuitively but is quite valuable in practice. When we assess customer lifetime value, if a small percentage of the customers generate 95% of the profits, examining those outliers or the long tail while valuable does not get at the general trend. When I was analyzing IRS corporate tax returns, the correlation between two line items (that I won't identify here!) was more than 90% over the 30K+ returns. But when we removed the largest 50 corporations, the correlation between these line items dropped to under 30%. Why? Because the tail drove the relationship; the overall trend didn't apply to the entire population. It is easy to be fooled by summary statistics for this reason: they assume characteristics about the data that may not be true.
This all gets back to nonlinearity in the data: if outliers behave differently than the general population, assess them based on the truncated populations. If outliers exist in your data, get the gist from the trimmed mean or median to reduce the bias from the outliers. We know this intuitively, but sometimes we forget to do it and make misleading inferences.
[UPDATE] I neglected to reference a former post that shows the problem of outliers in computing correlation coefficients: Beware of Outliers in Computing Correlations.
So his rule of thumb that he applied to college football programs was this: remove the best coach and the worst coach, and then assess the program. If they are still a great program, they are truly a dynasty.
This is the trimmed (truncated) mean idea that he was applying intuitively but is quite valuable in practice. When we assess customer lifetime value, if a small percentage of the customers generate 95% of the profits, examining those outliers or the long tail while valuable does not get at the general trend. When I was analyzing IRS corporate tax returns, the correlation between two line items (that I won't identify here!) was more than 90% over the 30K+ returns. But when we removed the largest 50 corporations, the correlation between these line items dropped to under 30%. Why? Because the tail drove the relationship; the overall trend didn't apply to the entire population. It is easy to be fooled by summary statistics for this reason: they assume characteristics about the data that may not be true.
This all gets back to nonlinearity in the data: if outliers behave differently than the general population, assess them based on the truncated populations. If outliers exist in your data, get the gist from the trimmed mean or median to reduce the bias from the outliers. We know this intuitively, but sometimes we forget to do it and make misleading inferences.
[UPDATE] I neglected to reference a former post that shows the problem of outliers in computing correlation coefficients: Beware of Outliers in Computing Correlations.
Sunday, May 23, 2010
Upcoming DMRadio Interview: Analytics and Business Rules
On June 3rd, a week from this Thursday, I'll be participating in my third DMRadio interview, this time on business rules (the first two were related to text mining, including this one last year). I always have found these interviews enjoyable to do. I'll probably be discussing an inductive rule discovery process I participated in with a Fortune 500 company (and described at last February's Predictive Analytics World Conference in San Francisco).
Even if you can't be there "live", you can download the interview later.
Even if you can't be there "live", you can download the interview later.
Thursday, May 20, 2010
Data Mining as a Top Career
More good news for data miners: http://www.signonsandiego.com/news/2010/may/19/hot-career-trends-for-college-grads-listed-in/
Of course time will tell. One sign will be how many more resumes (unsolicited) I get!
I think they got it right: data mining (and it's siblings Predictive Analytics and Business Analytics) are growing in their appeal. But more importantly, I see organizations believing they can do it.
Data mining. The field involves extracting specific information or patterns from large databases. Career prospects are available in areas including advertising technology, scientific research and law enforcement.
Of course time will tell. One sign will be how many more resumes (unsolicited) I get!
Tuesday, May 11, 2010
web analytics and predictive analytics: comments from emetrics
I just got back from the latest (and my first) eMetrics conference in San Jose, CA last week, and was very impressed by the practical nature of the conference. It was also a quite different experience for me to be in a setting where I knew very very few people there. I was there to co-present with Angel Morales "Behavioral Driven Marketing Attribution". Angel and I are co-founders of SmarterRemarketer, a new web analytics company, and this solution we described is just one nut we are trying to crack in the industry.
This post though is related to the overlap between web analytics and predictive analytics: very little right now. It really is a different world, and for many I spoke with, the mere mention of "predictive analytics" resulted in one of those unknowing looks back at me. In fairness, much that was spoken to me resulted in the same look!
One such topic was that of "use cases", a term used over and over in talks, but one that I don't encounter in the data mining world. We describe "case studies", but a "use case" is a smaller and more specific example of something interesting or unusual in how individuals or groups of individuals interact with web sites (I hope I got that right). The key though is that this is a thread of usage. In data mining, it is more typical that predictive models are built, and then to understand why the models are the way they are, one might trace through some of the more interesting branches of a tree or unusual variable combinations in something similar to this "use case" idea.
First, what to commend... The analyses I saw were quite good: customer segmentation, A/B testing, web page layout, some attribution, etc. There was a great keynote by Joe Megibow of Expedia describing how Expedia's entire web presence has changed in the past year. One of my favorite bloggers, Kevin Hillstrom of MineThatData fame gave a presentation praising the power of conditional probabilities (very nice!). Lastly, there was one more keynote by someone I had never heard of (not to my credit), but is obviously a great communicator and is well-known in the web analytics world, Avinash Kaushik. One idea I liked very much from his keynote was the long tail: the tail of the distribution of keywords that navigates to his website contains many times more visits than his top 10. In the data mining world, of course, this would push us to characterize these sparsely populated items differently so they produce more influence in any predictive models. Lots to think about.
But I digress. The lack of data mining and predictive analytics at this conference begs (at least from me) the question: why not? They are swimming in data, have important business questions that need to be solved, and clearly not all of these are being solved well enough. That will be the subject of my next post.
This post though is related to the overlap between web analytics and predictive analytics: very little right now. It really is a different world, and for many I spoke with, the mere mention of "predictive analytics" resulted in one of those unknowing looks back at me. In fairness, much that was spoken to me resulted in the same look!
One such topic was that of "use cases", a term used over and over in talks, but one that I don't encounter in the data mining world. We describe "case studies", but a "use case" is a smaller and more specific example of something interesting or unusual in how individuals or groups of individuals interact with web sites (I hope I got that right). The key though is that this is a thread of usage. In data mining, it is more typical that predictive models are built, and then to understand why the models are the way they are, one might trace through some of the more interesting branches of a tree or unusual variable combinations in something similar to this "use case" idea.
First, what to commend... The analyses I saw were quite good: customer segmentation, A/B testing, web page layout, some attribution, etc. There was a great keynote by Joe Megibow of Expedia describing how Expedia's entire web presence has changed in the past year. One of my favorite bloggers, Kevin Hillstrom of MineThatData fame gave a presentation praising the power of conditional probabilities (very nice!). Lastly, there was one more keynote by someone I had never heard of (not to my credit), but is obviously a great communicator and is well-known in the web analytics world, Avinash Kaushik. One idea I liked very much from his keynote was the long tail: the tail of the distribution of keywords that navigates to his website contains many times more visits than his top 10. In the data mining world, of course, this would push us to characterize these sparsely populated items differently so they produce more influence in any predictive models. Lots to think about.
But I digress. The lack of data mining and predictive analytics at this conference begs (at least from me) the question: why not? They are swimming in data, have important business questions that need to be solved, and clearly not all of these are being solved well enough. That will be the subject of my next post.
Monday, May 10, 2010
Rexer Analytics Data Mining Survey
Calling all data miners! I encourage all to fill out the survey--it is the most complete survey of the data mining world that I am aware of. Use the link and code below, and stay tuned to see the results later in the year.
Access Code: RS2458 The full description sent by Karl Rexer is below: Rexer Analytics, a data mining consulting firm, is conducting our fourth annual survey of the analytic behaviors, views and preferences of data mining professionals. We would greatly appreciate it if you would: 1) Participate in this survey, and 2) Tell other data miners about the survey (forward this email to them). Thank you. Forwarding the survey to others is invaluable for our “snowball sample methodology”. It helps the survey reach a wide and diverse group of data miners. Thank you also to everyone who participated in previous Data Miner Surveys, and especially to the people who provided suggestions for new questions and other survey modifications. This year’s survey incorporates many ideas from survey participants. Your responses are completely confidential: no information you provide on the survey will be shared with anyone outside of Rexer Analytics. All reporting of the survey findings will be done in the aggregate, and no findings will be written in such a way as to identify any of the participants. This research is not being conducted for any third party, but is solely for the purpose of Rexer Analytics to disseminate the findings throughout the data mining community via publication, conference presentations, and personal contact. If you would like a summary of last year’s or this year’s findings emailed to you, there will be a place at the end of the survey to leave your email address. You can also email us directly (DataMinerSurvey@RexerAnalytics.com) if you have any questions about this research or to request research summaries. To participate, please click on the link below and enter the access code in the space provided. The survey should take approximately 20 minutes to complete. Anyone who has had this email forwarded to them should use the access code in the forwarded email. Thank you for your time. We hope the results from this survey provide useful information to the data mining community. |
Wednesday, February 17, 2010
Predictive Analytics World Recap
Predictive Analytics World (PAW) just ended today, and here are a few thoughts on the conference.
PAW was a bigger conference than October's or last February's and it definitely felt bigger. It seemed to me that there was a larger international presence as well.
Major data mining software vendors included the ones you would expect (in alphabetical order to avoid any appearance of favoritism): Salford Systems, SAS, SPSS (an IBM company), Statsoft, and Tibco. Others who were there included Netezza (a new one for me--they have an innovative approach to data storage and retrieval), SAP, Florio (another new one for me--a drag-and-drop simulation tool) and REvolution.
One surprise to me was how many text mining case studies were presented. John Elder rightfully described text mining as "the wild west" of analytics in his talk and SAS introduced a new initiative in text analytics (including sentiment analysis, a topic that came up in several discussions I had with other attendees).
A second theme emphasized by Eric Siegel in the keynote and discussed in a technical manner by Day 2 Keynote Kim Larsen was uplift modeling, or as Larsen described it, Net Lift modeling. This approach makes so much sense, that one should consider not just responders, but should instead set up data to be able to identify those individuals that respond because of the marketing campaign and not bother those who would respond anyway. I'm interested in understanding the particular way that Larsen approaches Net Lift models with variable selection and a variant of Naive Bayes.
But for me, the key is setting up the data right and Larsen described the data particularly well. A good campaign will have a treatment set and a control set, where the treatment set gets the promotion or mailing, and the control set does not. There are several possible outcomes here. First, in the treatment set, there are those individuals who would have responded anyway, those who respond because of the campaign, and those who do not respond. For the control set, there are those who respond despite not receiving a mailing, and those who do not. The problem, of course, is that in the treatment set, you don't know which individuals would have responded if they had not been mailed, but you suspect that they look like those in the control set who responded.
A third area that struck me was that of big data. There was a session (that I missed, unfortunately) on in-dateabase vs. in-cloud computing (by Neil Raden of Hired Brains), and Robert Grossman's talk on building and maintaining 10K predictive models. This latter application was one that I believe will be the approach that we move toward as data size increases, where the multiple models are customized by geography, product, demographic group, etc.
I enjoyed the conference tremendously, including the conversations with attendees. One of note was the use of ensembles of clustering models that I hope will be presented at a future PAW.
PAW was a bigger conference than October's or last February's and it definitely felt bigger. It seemed to me that there was a larger international presence as well.
Major data mining software vendors included the ones you would expect (in alphabetical order to avoid any appearance of favoritism): Salford Systems, SAS, SPSS (an IBM company), Statsoft, and Tibco. Others who were there included Netezza (a new one for me--they have an innovative approach to data storage and retrieval), SAP, Florio (another new one for me--a drag-and-drop simulation tool) and REvolution.
One surprise to me was how many text mining case studies were presented. John Elder rightfully described text mining as "the wild west" of analytics in his talk and SAS introduced a new initiative in text analytics (including sentiment analysis, a topic that came up in several discussions I had with other attendees).
A second theme emphasized by Eric Siegel in the keynote and discussed in a technical manner by Day 2 Keynote Kim Larsen was uplift modeling, or as Larsen described it, Net Lift modeling. This approach makes so much sense, that one should consider not just responders, but should instead set up data to be able to identify those individuals that respond because of the marketing campaign and not bother those who would respond anyway. I'm interested in understanding the particular way that Larsen approaches Net Lift models with variable selection and a variant of Naive Bayes.
But for me, the key is setting up the data right and Larsen described the data particularly well. A good campaign will have a treatment set and a control set, where the treatment set gets the promotion or mailing, and the control set does not. There are several possible outcomes here. First, in the treatment set, there are those individuals who would have responded anyway, those who respond because of the campaign, and those who do not respond. For the control set, there are those who respond despite not receiving a mailing, and those who do not. The problem, of course, is that in the treatment set, you don't know which individuals would have responded if they had not been mailed, but you suspect that they look like those in the control set who responded.
A third area that struck me was that of big data. There was a session (that I missed, unfortunately) on in-dateabase vs. in-cloud computing (by Neil Raden of Hired Brains), and Robert Grossman's talk on building and maintaining 10K predictive models. This latter application was one that I believe will be the approach that we move toward as data size increases, where the multiple models are customized by geography, product, demographic group, etc.
I enjoyed the conference tremendously, including the conversations with attendees. One of note was the use of ensembles of clustering models that I hope will be presented at a future PAW.
Prinicpal Components for Modeling
Problem Statement
Analysts constructing predictive models frequently encounter the need to reduce the size of the available data, both in terms of variables and observations. One reason is that data sets are now available which are far too large to be modeled directly in their entirety using contemporary hardware and software. Another reason is that some data elements (variables) have an associated cost. For instance, medical tests bring an economic and sometimes human cost, so it would be ideal to minimize their use if possible. Another problem is overfitting: Many modeling algorithms will eagerly consume however much data they are fed, but increasing the size of this data will eventually produce models of increased complexity without a corresponding increase in quality. Model deployment and maintenance, too, may be encumbered by extra model inputs, in terms of both execution time and required data preparation and storage.
Naturally, the goal in data reduction is to decrease the size of needed data, while maintaining (as much as is possible) model performance, this process must be performed carefully.
A Solution: Principal Components
Selection of candidate predictor variables to retain (or to eliminate) is the most obvious way to reduce the size of the data. If model performance is not to suffer, though, then some effective measure of each variable's usefulness in the final model must be employed- which is complicated by the correlations among predictors. Several important procedures have been developed along these lines, such as forward selection, backward selection and stepwise selection.
Another possibility is principal components analysis ("PCA" to his friends), which is a procedure from multivariate statistics which yields a new set of variables (the same number as before), called the principal components. Conveniently, all of the principal components are simply linear functions of the original variables. As a side benefit, all of the principal components are completely uncorrelated. The technical details will not be presented here (see the reference, below), but suffice it to say that if 100 variables enter PCA, then 100 new variables (called the principal components come out. You are now wondering, perhaps, where the "data reduction" is? Simple: PCA constructs the new variables so that the first principal component exhibits the largest variance, the second principal component exhibits the second largest variance, and so on.
How well this works in practice depends completely on the data. In some cases, though, a large fraction of the total variance in the data can be compressed into a very small number of principal components. The data reduction comes when the analyst decides to retain only the first n principal components.
Note that PCA does not eliminate the need for the original variables: they are all still used in the calculation of the principal components, no matter how few of the principal components are retained. Also, statistical variance (which is what is concentrated by PCA) may not correspond perfectly to "predictive information", although it is often a reasonable approximation.
Last Words
Many statistical and data mining software packages will perform PCA, and it is not difficult to write one's own code. If you haven't tried this technique before, I recommend it: It is truly impressive to see PCA squeeze 90% of the variance in a large data set into a handful of variables.
Note: Related terms from the engineering world: eigenanalysis, eigenvector and eigenfunction.
Reference
For the down-and-dirty technical details of PCA (with enough information to allow you to program PCA), see:
Multivariate Statistical Methods: A Primer, by Manly (ISBN: 0-412-28620-3)
Note: The first edition is adequate for coding PCA, and is at present much cheaper than the second or third editions.
Analysts constructing predictive models frequently encounter the need to reduce the size of the available data, both in terms of variables and observations. One reason is that data sets are now available which are far too large to be modeled directly in their entirety using contemporary hardware and software. Another reason is that some data elements (variables) have an associated cost. For instance, medical tests bring an economic and sometimes human cost, so it would be ideal to minimize their use if possible. Another problem is overfitting: Many modeling algorithms will eagerly consume however much data they are fed, but increasing the size of this data will eventually produce models of increased complexity without a corresponding increase in quality. Model deployment and maintenance, too, may be encumbered by extra model inputs, in terms of both execution time and required data preparation and storage.
Naturally, the goal in data reduction is to decrease the size of needed data, while maintaining (as much as is possible) model performance, this process must be performed carefully.
A Solution: Principal Components
Selection of candidate predictor variables to retain (or to eliminate) is the most obvious way to reduce the size of the data. If model performance is not to suffer, though, then some effective measure of each variable's usefulness in the final model must be employed- which is complicated by the correlations among predictors. Several important procedures have been developed along these lines, such as forward selection, backward selection and stepwise selection.
Another possibility is principal components analysis ("PCA" to his friends), which is a procedure from multivariate statistics which yields a new set of variables (the same number as before), called the principal components. Conveniently, all of the principal components are simply linear functions of the original variables. As a side benefit, all of the principal components are completely uncorrelated. The technical details will not be presented here (see the reference, below), but suffice it to say that if 100 variables enter PCA, then 100 new variables (called the principal components come out. You are now wondering, perhaps, where the "data reduction" is? Simple: PCA constructs the new variables so that the first principal component exhibits the largest variance, the second principal component exhibits the second largest variance, and so on.
How well this works in practice depends completely on the data. In some cases, though, a large fraction of the total variance in the data can be compressed into a very small number of principal components. The data reduction comes when the analyst decides to retain only the first n principal components.
Note that PCA does not eliminate the need for the original variables: they are all still used in the calculation of the principal components, no matter how few of the principal components are retained. Also, statistical variance (which is what is concentrated by PCA) may not correspond perfectly to "predictive information", although it is often a reasonable approximation.
Last Words
Many statistical and data mining software packages will perform PCA, and it is not difficult to write one's own code. If you haven't tried this technique before, I recommend it: It is truly impressive to see PCA squeeze 90% of the variance in a large data set into a handful of variables.
Note: Related terms from the engineering world: eigenanalysis, eigenvector and eigenfunction.
Reference
For the down-and-dirty technical details of PCA (with enough information to allow you to program PCA), see:
Multivariate Statistical Methods: A Primer, by Manly (ISBN: 0-412-28620-3)
Note: The first edition is adequate for coding PCA, and is at present much cheaper than the second or third editions.
Friday, February 12, 2010
Predictive Analytics World - San Francisco
The next Predictive Analytics World is coming up next week. This is a conference look forward to very much because of the attendees; I have found that at the first two PAWs, there have a been a good mix of folks who are experts and those who are spinning up on Predictive Analytics. I'll be teaching a hands-on workshop Monday (using Enterprise Miner), and presenting a talk on using trees to generate business rules for a help-desk text analytics application on Tuesday the 16h. You can still get the 15% discount if you use the registration code DEANABBOTT010 in the registration process (this is not a sales plug--I won't receive any benefit from this).
Look me up if you are going; I will be there both days (16th and 17th).
Look me up if you are going; I will be there both days (16th and 17th).
Tuesday, January 19, 2010
Is there anything new in Predictive Analytics?
Federal Computer Week's John Zyskowski posted an article on Jan 8, 2010 on Predictive Analytics entitled "Deja vu all over again: Predictive analytics look forward into the past". (kudos for the great Yogi Berra quote! But beware, as Berra stated himself, "I really didn't say everything I said")
Back to Predictive Analytics...Pieter Mimno is quoted as stating:
I think what is new is not algorithms, but a better integration of data mining software in the business environment, primarily in two places: on the front end and on the back end. On the front end, data mining tools are better at connecting to databases now compared to 10 years ago, and provide the analyst better tools for assessing the data coming into the software. This has always been a big hurdle, and was the reason that at KDD 1999 in San Diego, the panel discussion on "Data Mining into Vertical Solutions" concluded that data mining functionality would be integrated into the database to a large degree. But while it hasn't happened quite the way it was envisioned 10 years ago, it is clearly much easier to do now.
On the back end, I believe the most significant step forward in data mining tools has been giving the analyst the ability to assess models in a manner consistent with the business objectives of the model. So rather than comparing models based on R^2 or overall classification accuracy, most tools give you the ability to generate an ROI chart, or a ROC curve, or build a custom model assessment engine based on rank-ordered model predictions. This means that when we convey what models are doing to decision makers, we can do so in the language they understanding and not force them to understand how good an R^2 of 0.4 really is. And then, data mining tools are to a greater degree producing scoring code that is usable outside of the tool itself by creating SQL code, SAS code, C or Java, or PMML. What I'm waiting for next is for vendors to provide PMML or other code for all the data prep one does in the tool prior to the model itself; typically, PMML code is generated only for the model itself.
Back to Predictive Analytics...Pieter Mimno is quoted as stating:
There's nothing new about this (Predictive Analytics). It's just old techniques that are being done better.To support this argument, John quotes me related to work done at DFAS 10 years ago. Is this true? Is there nothing new in predictive analytics? If it isn't true, what is new?
I think what is new is not algorithms, but a better integration of data mining software in the business environment, primarily in two places: on the front end and on the back end. On the front end, data mining tools are better at connecting to databases now compared to 10 years ago, and provide the analyst better tools for assessing the data coming into the software. This has always been a big hurdle, and was the reason that at KDD 1999 in San Diego, the panel discussion on "Data Mining into Vertical Solutions" concluded that data mining functionality would be integrated into the database to a large degree. But while it hasn't happened quite the way it was envisioned 10 years ago, it is clearly much easier to do now.
On the back end, I believe the most significant step forward in data mining tools has been giving the analyst the ability to assess models in a manner consistent with the business objectives of the model. So rather than comparing models based on R^2 or overall classification accuracy, most tools give you the ability to generate an ROI chart, or a ROC curve, or build a custom model assessment engine based on rank-ordered model predictions. This means that when we convey what models are doing to decision makers, we can do so in the language they understanding and not force them to understand how good an R^2 of 0.4 really is. And then, data mining tools are to a greater degree producing scoring code that is usable outside of the tool itself by creating SQL code, SAS code, C or Java, or PMML. What I'm waiting for next is for vendors to provide PMML or other code for all the data prep one does in the tool prior to the model itself; typically, PMML code is generated only for the model itself.
Sunday, January 10, 2010
Counting Observations
Data is fodder for the data mining process. One fundamental aspect of the data we analyze is its size, which is most often characterized by the number of observations and the number of variables in the given set of data- typically measured as counts of "rows and columns", respectively. It is worth taking a closer look at this, though, as questions such as "Do we have enough data?" depend on an apt measure of how much data we have.
Outcome Distributions
In many predictive modeling situations, cases are spread fairly evenly among the possible outcomes, but this is not always true. Many fraud detection problems, for instance, involve extreme class imbalance: target class cases (known frauds) may represent a small fraction of 1% of the available records. Despite having many total observations of customer behavior, observations of fraudulent behavior may be rather sparse. Data miners who work in the fraud detection field are acutely aware of this issue and characterize their data sets not just by 'total number of observations', but also by 'observations of the behavior of interest'. When assessing an existing data set, or specifying a new one, such an analyst generally employ both counts.
Numeric outcome variables may also suffer from this problem. Most numeric variables are not uniformly distributed, and areas in which outcome data is sparse- for instance, long tails of high personal income- are areas which may be poorly represented in models derived from that data.
With both class and numeric outcomes, it might be argued that outcome values which are infrequent are, by definition, less important. This may or may not be so, depending on the modeling process and our priorities. If the model is expected to perform well on the top personal income decile, then data should be evaluated by how many cases fall in that range, not just on the total observation count.
Predictor Distributions
Issues of coverage occur on the input variable side, as well. Keeping in mind that generalization is the goal of discovered models, the total record count by itself seems inadequate when, for example, data are drawn from a process which has (or may have) a seasonal component. Having 250,000 records in a single data set sounds like many, but if they are only drawn from October, November and December, then one might reasonably take the perspective that only 3 "observations" of monthly behavior are represented, out of 12 possibilities. In fact, (assuming some level of stability from year to year) one could argue that not only should all 12 calendar months be included, but that they should be drawn from multiple historical years, so that there are multiple observations for each calendar month.
Other groupings of cases in the input space may also be important. For instance, of hundreds of observations of retail sales may be observed, but if only from 25 salespeople out of a sales force of 300, then the simple record count as "observation count" may be deceiving.
Validation Issues
Observations as aggregates of single records should be considered during the construction of train/test data, as well. When pixel-level data are drawn from images for the construction of a pixel level classifier, for instance, it makes sense to avoid having pixels from a given image serve as training observations, and other pixels from that same image serve as validation observations. Entire images should be labeled as "train" or "test", and pixels drawn as observations according, to avoid "cheating" during model construction, based on the inherent redundancy in image data.
Conclusion
This posting has only briefly touched on some of the issues which arise when attempting to measure the volume of data in one's possession, and has not explored yet more subtle concepts such as sampling techniques, observation weighting or model performance measures. Hopefully though, it gives the reader some things to think about when assessing data sets in terms of their size and quality.
Outcome Distributions
In many predictive modeling situations, cases are spread fairly evenly among the possible outcomes, but this is not always true. Many fraud detection problems, for instance, involve extreme class imbalance: target class cases (known frauds) may represent a small fraction of 1% of the available records. Despite having many total observations of customer behavior, observations of fraudulent behavior may be rather sparse. Data miners who work in the fraud detection field are acutely aware of this issue and characterize their data sets not just by 'total number of observations', but also by 'observations of the behavior of interest'. When assessing an existing data set, or specifying a new one, such an analyst generally employ both counts.
Numeric outcome variables may also suffer from this problem. Most numeric variables are not uniformly distributed, and areas in which outcome data is sparse- for instance, long tails of high personal income- are areas which may be poorly represented in models derived from that data.
With both class and numeric outcomes, it might be argued that outcome values which are infrequent are, by definition, less important. This may or may not be so, depending on the modeling process and our priorities. If the model is expected to perform well on the top personal income decile, then data should be evaluated by how many cases fall in that range, not just on the total observation count.
Predictor Distributions
Issues of coverage occur on the input variable side, as well. Keeping in mind that generalization is the goal of discovered models, the total record count by itself seems inadequate when, for example, data are drawn from a process which has (or may have) a seasonal component. Having 250,000 records in a single data set sounds like many, but if they are only drawn from October, November and December, then one might reasonably take the perspective that only 3 "observations" of monthly behavior are represented, out of 12 possibilities. In fact, (assuming some level of stability from year to year) one could argue that not only should all 12 calendar months be included, but that they should be drawn from multiple historical years, so that there are multiple observations for each calendar month.
Other groupings of cases in the input space may also be important. For instance, of hundreds of observations of retail sales may be observed, but if only from 25 salespeople out of a sales force of 300, then the simple record count as "observation count" may be deceiving.
Validation Issues
Observations as aggregates of single records should be considered during the construction of train/test data, as well. When pixel-level data are drawn from images for the construction of a pixel level classifier, for instance, it makes sense to avoid having pixels from a given image serve as training observations, and other pixels from that same image serve as validation observations. Entire images should be labeled as "train" or "test", and pixels drawn as observations according, to avoid "cheating" during model construction, based on the inherent redundancy in image data.
Conclusion
This posting has only briefly touched on some of the issues which arise when attempting to measure the volume of data in one's possession, and has not explored yet more subtle concepts such as sampling techniques, observation weighting or model performance measures. Hopefully though, it gives the reader some things to think about when assessing data sets in terms of their size and quality.
Wednesday, January 06, 2010
Data Mining and Terrorism... Counterpoint
In a recent posting to this Web log (Data Mining and Privacy...again, Jan-04-2010), Dean Abbott made several points regarding the use of data mining to counter terrorism, and related privacy issues. I'd like to address the question of the usefulness of data mining in this application.
Dean quoted Bruce Schneier's argument against data mining's use in anti-terrorism programs. The specific technical argument that Schneier has made (and he is not alone in this) is: Automatic classification systems are unlikely to be effective at identifying individual terrorists, since terrorists are so rare. Schneier concludes that the rate of "false positives" could never be made low enough for such a system to work effectively.
As far as this specific technical line of thought goes, I agree absolutely, and doubt that any competent data analyst would disagree. It is the extension of this argument to the much broader conclusion that data mining is not a fruitful line of inquiry for those seeking to oppose terrorists that I take issue with.
Many (most?) computerized classification systems in practice output probabilities, as opposed to simple class predictions. Owners of such systems use them to prioritize their efforts (think of database marketers who sort name lists to find the so many who are most likely to respond to an offer). Classifiers need not be perfect to be useful, and portraying them as such is what I call the "Minority Report strawman".
Beyond this, data mining has been used to great effect in rooting out other criminal behaviors, such as money laundering, which are associated with terrorism. While those who practice our art against terrorism are unlikely to be forthcoming about their work, it is not difficult to imagine data mining systems other than classifiers being used in this struggle, such as analysis on networks of associates of terrorists.
It would take considerable naivety to believe that present computer systems could be trained to throw up red flags on a small number of individuals, previously unknown to be terrorists, with any serious degree of reliability. Given the other chores which data mining systems may perform in this fight, I think it is equally naive to abandon that promise for an overextended technical argument.
Dean quoted Bruce Schneier's argument against data mining's use in anti-terrorism programs. The specific technical argument that Schneier has made (and he is not alone in this) is: Automatic classification systems are unlikely to be effective at identifying individual terrorists, since terrorists are so rare. Schneier concludes that the rate of "false positives" could never be made low enough for such a system to work effectively.
As far as this specific technical line of thought goes, I agree absolutely, and doubt that any competent data analyst would disagree. It is the extension of this argument to the much broader conclusion that data mining is not a fruitful line of inquiry for those seeking to oppose terrorists that I take issue with.
Many (most?) computerized classification systems in practice output probabilities, as opposed to simple class predictions. Owners of such systems use them to prioritize their efforts (think of database marketers who sort name lists to find the so many who are most likely to respond to an offer). Classifiers need not be perfect to be useful, and portraying them as such is what I call the "Minority Report strawman".
Beyond this, data mining has been used to great effect in rooting out other criminal behaviors, such as money laundering, which are associated with terrorism. While those who practice our art against terrorism are unlikely to be forthcoming about their work, it is not difficult to imagine data mining systems other than classifiers being used in this struggle, such as analysis on networks of associates of terrorists.
It would take considerable naivety to believe that present computer systems could be trained to throw up red flags on a small number of individuals, previously unknown to be terrorists, with any serious degree of reliability. Given the other chores which data mining systems may perform in this fight, I think it is equally naive to abandon that promise for an overextended technical argument.
Monday, January 04, 2010
The Next Predictive Analytics World
Just a reminder that the next Predictive Analytics World is coming in another 6 weeks--Feb 16-17 in San Francisco.
I'll be teaching a pre-conference Hands-On Predictive Analytics workshop using SAS Enterprise Miner on the 15th, and presenting a text mining case study on the 16th.
For any readers here who may be going, feel free to use this discount code during registration to get a 15% discount off the 2-day conference: DEANABBOTT010
Hope to see you there.
I'll be teaching a pre-conference Hands-On Predictive Analytics workshop using SAS Enterprise Miner on the 15th, and presenting a text mining case study on the 16th.
For any readers here who may be going, feel free to use this discount code during registration to get a 15% discount off the 2-day conference: DEANABBOTT010
Hope to see you there.
Data Mining and Privacy...again
A google search tonight on "data mining" referred to the latest DHS Privacy Office 2009 Data Mining Report to Congress. I'm always nervous when I see "data mining" in titles like this, especially when linked to privacy because of the misconceptions about what data mining is and does. I have long contended that data mining only does what humans would do manually if they had enough time to do it. The concerns that most privacy advocates really are complaining about is the data that one has available to make the inferences from, albeit more efficiently with data mining.
What I like about this article are the common-sense comments made. Data mining on extremely rare events (such as terrorist attacks) is very difficult because there are not enough examples of the patterns to have high confidence that the predictions are not by chance. Or as it is stated in the article:
Now this is true for the most commonly used data mining techniques (predictive models like decision trees, regression, neural nets, SVM). However, there are other techniques that are used to find links between interesting entities that are extremely unlikely to occur by chance. This isn't foolproof, of course, but while there will be lots of false alarms, they can still be useful. Again from the enlightened layperson:
It's not as if this were a new topic. From the Cato Institute, this article describes the same phenomenon, and links to a Jeff Jonas presentation that describes how good investigation would have linked the 9/11 terrorists (rather than using data mining). Fair enough, but analytic techniques are still valuable in removing the chaff--those individuals or events that very uninteresting. In fact, I have found this to be a very useful approach to handling difficult problems.
What I like about this article are the common-sense comments made. Data mining on extremely rare events (such as terrorist attacks) is very difficult because there are not enough examples of the patterns to have high confidence that the predictions are not by chance. Or as it is stated in the article:
Security expert Bruce Schneier explains well. When searching for a needle in a haystack, adding more "hay" does not good at all. Computers and data mining are useful only if they are looking for something relatively common compared to the database searched. For instance, out of 900 million credit card in the US, about 1% are stolen or fraudulently used every year. One in a hundred is certainly the exception rather than the rule, but it is a common enough occurrence to be worth data mining for. By contrast, the 9-11 hijackers were a 19-man needle in a 300 million person haystack, beyond the ken of even the finest super computer to seek out. Even an extremely low rate of false alarms will swamp the system.
Now this is true for the most commonly used data mining techniques (predictive models like decision trees, regression, neural nets, SVM). However, there are other techniques that are used to find links between interesting entities that are extremely unlikely to occur by chance. This isn't foolproof, of course, but while there will be lots of false alarms, they can still be useful. Again from the enlightened layperson:
An NSA data miner acknowledged, "Frankly, we'll probably be wrong 99 percent of the time . . . but 1 percent is far better than 1 in 100 million times if you were just guessing at random."
It's not as if this were a new topic. From the Cato Institute, this article describes the same phenomenon, and links to a Jeff Jonas presentation that describes how good investigation would have linked the 9/11 terrorists (rather than using data mining). Fair enough, but analytic techniques are still valuable in removing the chaff--those individuals or events that very uninteresting. In fact, I have found this to be a very useful approach to handling difficult problems.