Depending on whose definition one reads, the list of activities which comprise data mining will vary, but the first two items are always the same...
Number 1: Prediction
The most common data mining function, by far, is prediction (or, more esoterically, supervised learning), which is sometimes listed twice, depending on the type of variable being predicted: classification (when the target is categorical) vs. regression (when the target is numerical). Predictive models learned by machines from historical examples easily occupy the most of almost any measure of data mining: time, money, technical papers published, software packages, etc. The hyperbole of marketers and the fears of data mining critics, also, are most often associated with prediction.
Number 2: Clustering
The second most common data mining function in practice is clustering (sometimes known by the alias unsupervised learning). Gathering things into "natural" groupings has a long history in some fields (cladistics in biology, for instance), though clustering's "no right or wrong answer" quality likely will cement its continuing spot in second place. Despite being second banana to prediction, clustering enjoys widespread application and is well understood even in non-technical circles. What marketer doesn't like a good segmentation?
"... and all the rest!"
What else is in the data mining toolbox? Definitions vary, but the next two most commonly mentioned tasks are anomaly detection and association rule discovery. Other tasks have been included, such as data visualization, though that field dates back well over a hundred years and clearly enjoys a healthy existence outside of the data mining field.
Anomaly detection (a superset of statistical outlier detection) searches for observations which violate patterns in data. Generally, these patterns are discovered (explicitly or not) using prediction or clustering. Given that a wide array of prediction or clustering techniques might be applied, the patterns concluded to exist within a single data set will vary, implying that observations flagged as anomalous will vary. This leaves anomaly detection somewhat in the company of clustering in the sense of having "no right or wrong answers". Still, anomaly detection can be immensely useful, with two common applications being fraud detection and data cleansing. This author has used a simple anomaly detection process to help find errors in predictive model implementation code.
Association rule discovery attempts to identify patterns among data items which exhibit associations with one another. The classic example is individual items of merchandise in a retail setting (market basket analysis): Each purchase represents an association of a variety of distinct items with one another. After enough purchases, relationships among items can be inferred, such as the frequent purchase of coffee with sugar. Relationships among people, as evidenced by instances of telephone or electronic contact, have also been explored, both for marketing purposes and in law enforcement.
Neither anomaly detection nor association rule discovery receive nearly the press that the first two members of the data mining club do, but it is worth learning something about them. Some problems fall more naturally into their purview. To get started with these techniques, the standard references will do, such as Witten and Frank, or Han and Kamber. Also consider material on outliers in the traditional statistical literature.
Thursday, July 24, 2014
I’ve been reminded recently of the overlap between business intelligence and predictive analytics. Of course any reader of this blog (or at least the title of the blog) knows I live in the world of data mining (DM) and predictive analytics (PA), not the world of business intelligence (BI). In general, I don’t make comments about BI because I am an outsider looking in. Nevertheless, I view BI as a sibling to PA because we share so much in common: we use the same data, often use similar metrics and even sometimes use the same tools in our analyses.
I was interviewed by Victoria Garment of Software Advice on the topic of testing accuracy of predictive models in January, 2014 (I think I was first contacted about the interview in December, 2013). What I didn’t know was that John Elder and Karl Rexer, two good friends and colleagues in this space, were interviewed as well. The resulting article, "3 Ways to Test the Accuracy of Your Predictive Models" posted on their Plotting Success blog was well written and generated quite a bit of buzz on twitter after it was posted.
Prior to the interview, I had no knowledge of Software Advice and after looking at their blog, I understand why: they are clearly a BI blog. But after reading maybe a dozen posts, it is clear that we are siblings, in particular sharing concepts and approaches in big data, data science, staffing and talent acquisition. I've enjoyed going back to the blog.
The similarities of BI and PA are points I’ve tried to make in talks I’ve given at eMetrics and performance management conferences. After making suitable translations of terms, these two fields can understand each other well. Two sample differences in terminology are described here.
First, one rarely hears the term KPI at a PA conference, but will often hear it at BI conferences. If we use google as an indicator of popularity of the term KPI,
- ' “predictive analytics” KPI' yielded a mere 103,000 hits on google, whereas
- ' “business intelligence” KPI' yielded 1,510,000 hits.
In PA, one is more likely to hear these ideas described as metrics or even features or derived variables that can be used as inputs to models are as a target variable.
As a second example, a “use case” is frequently presented in BI conferences to explain a reason for creating a particular KPI or analysis. “Use Cases” are rarely described in PA conferences; in PA we say “case studies”. Back to google, we find
- ' "business intelligence" "use case" ' – 306,000 hits on google
- ' “predictive analytics” ”use case” ' – 58,800 hits on google
- ' “predictive analytics “case study” ' – 217,000 hits on google
Interestingly, the top two links for “predictive analytics” “use case” from the search weren’t even predictive analytics use cases or case studies. The second link of the two actually described how predictive analytics is a use case for cloud computing.
The BI community, however, seems to embrace PA and even consider it part of BI (much to the chagrin of the PA community, I would think). According to the Wikipedia entry on BI, the following chart shows topics that are a part of BI:
Interestingly, DM, PA, and even Prescriptive Analytics are considered a part of BI. I must admit, at all the DM and PA conferences I’ve attended, I’ve never heard attendees describe themselves as BI practitioners. I have heard more cross-branding of BI and PA at other conferences that include BI-specific material, like Performance Management and Web Analytics conferences.
Contrast this with the PA Wikipedia page. This taxonomy of fields related to PA is typical. I would personally include dashed lines to Text Mining and maybe even Link Analysis or Social Networks as they are related though not directly under PA. Interestingly, statistics falls under PA here, I’m sure to the chagrin of statisticians! And, I would guess that at a statistics conference, the attendees would not refer to themselves as predictive modelers. But maybe they would consider themselves data scientists! Alas, that’s another topic altogether. But that is the way these kinds of lists go; they are difficult to perfect and usually generate discussion over where the dividing lines occur.
This tendency to include fields are part of “our own” is a trap most of us fall into: we tend to be myopic in our views of the fields of study. It frankly reminds me of a map I remember hanging in my house growing up in Natick, MA: “A Bostonian’s Idea of The United States of America”. Clearly, Cape Cod is far more important than Florida or even California!
Be that as it may, my final point is that BI and PA are important but complementary disciplines. BI is a much larger field and understandably so. PA is more of a specialty, but a specialty that is gaining visibility and recognition as an important skill set to have in any organization. Here’s to further collaboration in the future!
Posted by Dean Abbott at 10:56 PM
Monday, May 26, 2014
Thursday, May 01, 2014
Arguably, the most important safeguard in building predictive models is complexity regularization to avoid overfitting the data. When models are overfit, their accuracy is lower on new data that wasn’t seen during training, and therefore when these models are deployed, they will disappoint, sometimes even leading decision makers to believe that predictive modeling “doesn’t work”.
Overfit, however, is thankfully a well-known problem and every algorithm has ways to avoid it. CART® and C5 trees use pruning to remove branches that are prone to overfitting, CHAID trees require splits are statistically significant to add complexity to the trees. Neural networks use held-out data to stop training when accuracy on held-out data becomes worse. Stepwise regression uses information theoretic criteria like the Akaike Information Criterion (AIC), Minimum Description Length (MDL), or the Bayesian Information Criterion (BIC) to add terms only when the additional complexity is offset by enough reduction of error.
But overfitting has more problems than merely misclassification cases in holdout data or incurring large errors for regression problems. Without loss of generality, this discussion will only describe overfilling in classification problems, but the same principles apply in regression problems as well.
One way modelers reduce the likelihood of overfit is to apply the principle of Occam’s Razor, where if two models exhibit the same accuracy, we will prefer the simpler model because it is more likely to generalize well. By simpler, we must keep in mind that we prefer models that behave more simply rather than models that just appear to be simpler because they have fewer terms. John Elder (a regular contributor to the PA Times) has a fantastic discussion of that topic in the book by Seni and Elder, Ensemble Methods in Data Mining.
Consider this example contrasting linear and nonlinear models. The figure below shows decision boundaries for two models separates two classes of the famous Iris Data (http://archive.ics.uci.edu/ml/datasets/Iris). On the left is the decision boundary from a linear model built using linear discriminant analysis (like LDA or the Fisher Discriminant) and on the right, a decision boundary built by a model using quadratic discriminant analysis (like the Bayes Rule). The image can be found at http://scikit-learn.org/0.5/auto_examples/plot_lda_vs_qda.html.
It appears that the accuracy of both models is the same (let’s assume that it is), yet the behavior of the models is very different. If there is new data to be classified that appears in the upper left of the plot, the LDA model will call the data point versicolor whereas the QDA model will call it virginica. Which is correct? We don’t know which would be correct from the training data, but we do know this: there is no justification in the data to increase the complexity of the model from linear to quadratic. We probably would prefer the linear model here.
Apply models to regions in the data without data is the entire reason for avoiding overfit. The issue with the figure above was with model behavior when doing extrapolation, where we want to make sure that the models behave in a reasonable way for values outside (larger than or smaller than) the data used in training. But models also need to behave well when they interpolate, meaning we want models to behave reasonably for data in between data that exists in the training data.
Consider the second figure below showing decision boundaries for two models built from a data set derived from the famous KDD Cup data from 1998. The two dimensions in this plot are Average Donation Amount (Y) and Recent Donation Amount (X). This data tells the story that higher values of average and recent donation amounts are related to higher likelihoods of donors responding; note that for the smallest values of both average and recent donation amount, at the very bottom left of the data, the regions are colored cyan.
Both models are built using the Support Vector Machines (SVM) algorithm, but with different values of the complexity constant, C. Obviously, the model at the left is more complex than the model on the right. The magenta regions represent responders and the cyan regions represent non-responders.
In the effort to be more accurate on training data, the model on the left creates closed-decision boundaries around any and all groupings of responders. The model at the right joins these smaller blobs together into a larger blob where the model classifies data as responders. The complexity constant for the model at the right gives up accuracy to gain simplicity.
Which model is more believable? The one on the left will exhibit strange interpolation properties; data in between the magenta blobs will be called non-responders, sometimes in very thin regions between magenta regions; this behavior isn’t smooth or believable. The model at the right creates a single region of data to be classified as a responder and is clearly better than the model at the left.
Beware of overfitting the data and test models not just on testing or validation data, but if possible, on values not in the data to ensure its behavior, whether interpolation or extrapolation, is believable.
In part II, the problem overfitting causes for model interpretation will be addressed.
This article first appeared at the Predictive Analytics Times, http://www.predictiveanalyticsworld.com/patimes/why-overfitting-is-more-dangerous-than-just-poor-accuracy-part-i/
Posted by Dean Abbott at 1:20 PM