Monday, March 20, 2017

A Question of Resource Allocation

Of the resources consumed in data mining projects, the most precious (read: "expensive") is time, especially the time of the human analyst. Hence, a significant question for the analyst is how best to allocate his or her time.

Long and continuing experience indicates clearly that the most productive use of time in such work is that dedicated to data preparation. I apologize if this seems like an old topic to the reader, but it is an important lesson which seems to be forgotten annually, as each new technical development presents itself. A surprising number of authors- particularly the on-line variety- come to the conclusion that "the latest thing"* will spare us from needing to prepare and enhance the data.

I offer, as yet another data point in favor of this perspective a recent conversation I had with a colleague. He and a small team conducted parallel modeling efforts for a shared client. Using the same base data, they constructed separate predictive models. His model and theirs achieved similar test performance. The team used a random forest, while he used logistic regression, one of the simplest modeling techniques. The team was perplexed at the similarity in model performance. My associate asked them how they had handled missing values. They responded that they filled them in. He asked exactly how they had filled the missing values. The response was that they set them all to zeros (!). By not taking the time and effort to comprehensively address this issue, they had forced their model to do the significant extra work of filling in these gaps itself. Consider that some fraction of their data budget was spent on fixing this mistake, rather than being used to create a better model. Note, too, that it is far easier (less code, less input variables to monitor, less to go wrong) to deploy a modestly-sized logistic regression than any random forest.

Given this context, it is curious to note that so much of what is published (again, especially on-line; think of titles such as: "The 10 Learning Algorithms Every Data Scientist Must Know") and so many job listings emphasize- almost to the point of exclusivity- learning algorithms, as opposed to practical questions of data sampling, data preparation and enhancement, variable reduction, solving the business problem (instead of the technical one) or ability to deploy the final product.

* For "the latest thing", you may fill in, variously, neural networks, decision trees, SVM, random forests, GPUs, deep learning or whatever comes out as next year's "next big thing".

Monday, April 25, 2016

Tracking Model Performance Over Time


Most introductory data mining texts include substantial coverage of model testing. Various methods of assessing true model performance (holdout testing, k-fold cross validation, etc.) are usually explained, perhaps with some important variants, such as stratification of the testing samples.

Generally, all of this exposition is aimed at in-time analysis: Model development data may span multiple time periods, but the testing is more or less blind to this: all periods are treated as fair game and mixed together. This is fine for model development. Once predictive models are deployed, however, it is desirable to continue testing to track model performance over time. Models which degrade over time need to be adjusted or replaced.

Subtleties of Testing Over Time

Nearly all production model evaluation is performed with new out-of-time data. As new periods of observed outcomes become available, they are used to calculate running performance measures. As far it goes, focusing on the actual performance metric makes sense. In my experience, though, some clients become distracted by movement in the independent variables or in the predicted or actual outcome distributions, in isolation. It is important to understand the dynamic of these changes to fully understand model performance over time.

For the sake of a thought experiment, consider a very simply problem with one independent variable, and one target variable, both real numbers. Historically, the distribution of each of these variables has been confined to specific ranges. A predictive model has been constructed as a linear regression which attempts to anticipate the target variable, using only the input of the single independent variable (and a constant). Assume that errors observed in the development data have been small and otherwise unremarkable (they are distributed normally, their magnitude is relatively constant across the range of the independent variable, there is no obvious pattern to them and so forth).

Once this model is deployed, it is executed on all future cases drawn from the relevant statistical universe, and predictions are saved for further analysis. Likewise, actual outcomes are recorded as they become available. At the conclusion of each future time period, model performance within that period is examined.

Consider the simplest change to well-developed model: the distribution of the independent variable remains the same, but the actual outcomes begin to depart the regression line. Any number of changes could be taking place in the output distribution, but the predicted distribution (the regression line) cannot move since it is entirely defined by the independent variable, which in this case is stable. By definition, model performance is degrading. This circumstance is easy to diagnose: the dynamic linking the target and independent variables is changing, hence a new model is necessary to restore performance.

What happens, though, when the independent variable begins to migrate? There are two possible effects (in reality, some combination of these extremes is likely): 1. The distribution of actual outcomes will either shift to appropriately match the change ("the dots march along the regression line"), or 2. The distribution of actual outcomes does not shift to match the change. In the first case, the model continues to correctly identify the relationship between the target and the independent variable, and model performance will more-or-less endure. In the second case, reality begins to wander from the model and performance deteriorates. Notice that, in the second case, the actual outcome distribution may or may not change noticeably- either way, the model no longer correctly anticipates reality and needs to be updated.


The example used here was deliberately chosen to be simple, for illustrations' sake. Qualitatively, though, the same basic behaviors are exhibited by much more complex models. Models featuring multiple independent variables or employing complex transformations (neural networks, decision trees, etc.) obey the same fundamental dynamic. Given the sensitivity of nonlinear models to each of their independent variables, a migration in even one of them may provoke the changes described above. Consideration of the components of this interplay in isolation only serves to confuse: Changes over time can only be understood as part of the larger whole.

Sunday, December 06, 2015

Predictive Modeling Skills: Expect to be Surprised

Excerpted from Chapter 1 of my book Applied Predictive Analytics, Wiley 2014
Conventional wisdom says that predictive modelers need to have an academic background in statistics, mathematics, computer science, or engineering. A degree in one of these fields is best, but without a degree, at a minimum, one should at least have taken statistics or mathematics courses. Historically, one could not get a degree in predictive analytics, data mining, or machine learning.
This has changed, however, and dozens of universities now offer master’s degrees in predictive analytics. Additionally, there are many variants of analytics degrees, including master’s degrees in data mining, marketing analytics, business analytics, or machine learning. Some programs even include a practicum so that students can learn to apply textbook science to real-world problems.
One reason the real-world experience is so critical for predictive modeling is that the science has tremendous limitations. Most real-world problems have data problems never encountered in the textbooks. The ways in which data can go wrong are seemingly endless; building the same customer acquisition models even within the same domain requires different approaches to data preparation, missing value imputation, feature creation, and even modeling methods.
However, the principles of how one can solve data problems are not endless; the experience of building models for several years will prepare modelers to at least be able to identify when potential problems may arise.
Surveys of top-notch predictive modelers reveal a mixed story, however. While many have a science, statistics, or mathematics background, many do not. Many have backgrounds in social science or humanities. How can this be?
Consider a retail example. The retailer Target was building predictive models to identify likely purchase behavior and to incentivize future behavior with relevant offers. Andrew Pole, a Senior Manager of Media and Database Marketing described how the company went about building systems of predictive models at the Predictive Analytics World Conference in 2010. Pole described the importance of a combination of domain knowledge, knowledge of predictive modeling, and most of all, a forensic mindset in successful modeling of what he calls a “guest portrait.”
They developed a model to predict if a female customer was pregnant. They noticed patterns of purchase behavior, what he called “nesting” behavior. For example, women were purchasing cribs on average 90 days before the due date. Pole also observed that some products were purchased at regular intervals prior to a woman’s due date. The company also observed that if they were able to acquire these women as purchasers of other products during the time before the birth of their baby, Target was able to increase significantly the customer value; these women would continue to purchase from Target after the baby was born based on their purchase behavior before.
The key descriptive terms are “observed” and “noticed.” This means the models were not built as black boxes. The analysts asked, “does this make sense?” and leveraged insights gained from the patterns found in the data to produce better predictive models. It undoubtedly was iterative; as they “noticed” pat- terns, they were prompted to consider other patterns they had not explicitly considered before (and maybe had not even occurred to them before). This forensic mindset of analysts, noticing interesting patterns and making connections between those patterns and how the models could be used, is critical to successful modeling. It is rare that predictive models can be fully defined before a project and modelers can anticipate all of the most important patterns the model will find. So we shouldn’t be surprised that we will be surprised, or put another way, we should expect to be surprised.

This kind of mindset is not learned in a university program; it is part of the personality of the individual. Good predictive modelers need to have a forensic mindset and intellectual curiosity, whether or not they understand the mathematics enough to derive the equations for linear regression.
(This post first appeared in the Predictive Analytics Times)

Friday, July 17, 2015

Data Mining's Forgotten Step-Children

Depending on whose definition one reads, the list of activities which comprise data mining will vary, but the first two items are always the same...

Number 1: Prediction

The most common data mining function, by far, is prediction (or, more esoterically, supervised learning), which is sometimes listed twice, depending on the type of variable being predicted: classification (when the target is categorical) vs. regression (when the target is numerical). Predictive models learned by machines from historical examples easily occupy the most of almost any measure of data mining: time, money, technical papers published, software packages, etc. The hyperbole of marketers and the fears of data mining critics, also, are most often associated with prediction.

Number 2: Clustering

The second most common data mining function in practice is clustering (sometimes known by the alias unsupervised learning). Gathering things into "natural" groupings has a long history in some fields (cladistics in biology, for instance), though clustering's "no right or wrong answer" quality likely will cement its continuing spot in second place.  Despite being second banana to prediction, clustering enjoys widespread application and is well understood even in non-technical circles. What marketer doesn't like a good segmentation?

"... and all the rest!"

What else is in the data mining toolbox? Definitions vary, but the next two most commonly mentioned tasks are anomaly detection and association rule discovery. Other tasks have been included, such as data visualization, though that field dates back well over a hundred years and clearly enjoys a healthy existence outside of the data mining field.

Anomaly detection (a superset of statistical outlier detection) searches for observations which violate patterns in data. Generally, these patterns are discovered (explicitly or not) using prediction or clustering. Given that a wide array of prediction or clustering techniques might be applied, the patterns concluded to exist within a single data set will vary, implying that observations flagged as anomalous will vary. This leaves anomaly detection somewhat in the company of clustering in the sense of having "no right or wrong answers".  Still, anomaly detection can be immensely useful, with two common applications being fraud detection and data cleansing. This author has used a simple anomaly detection process to help find errors in predictive model implementation code.

Association rule discovery attempts to identify patterns among data items which exhibit associations with one another. The classic example is individual items of merchandise in a retail setting (market basket analysis): Each purchase represents an association of a variety of distinct items with one another. After enough purchases, relationships among items can be inferred, such as the frequent purchase of coffee with sugar. Relationships among people, as evidenced by instances of telephone or electronic contact, have also been explored, both for marketing purposes and in law enforcement.

Further Reading

Neither anomaly detection nor association rule discovery receive nearly the press that the first two members of the data mining club do, but it is worth learning something about them. Some problems fall more naturally into their purview. To get started with these techniques, the standard references will do, such as Witten and Frank, or Han and Kamber. Also consider material on outliers in the traditional statistical literature.

Thursday, July 24, 2014

Similarities and Differences Between Predictive Analytics and Business Intelligence

I’ve been reminded recently of the overlap between business intelligence and predictive analytics. Of course any reader of this blog (or at least the title of the blog) knows I live in the world of data mining (DM) and predictive analytics (PA), not the world of business intelligence (BI). In general, I don’t make comments about BI because I am an outsider looking in. Nevertheless, I view BI as a sibling to PA because we share so much in common: we use the same data, often use similar metrics and even sometimes use the same tools in our analyses.

I was interviewed by Victoria Garment of Software Advice on the topic of testing accuracy of predictive models in January, 2014 (I think I was first contacted about the interview in December, 2013). What I didn’t know was that John Elder and Karl Rexer, two good friends and colleagues in this space, were interviewed as well. The resulting article, "3 Ways to Test the Accuracy of Your Predictive Modelsposted on their Plotting Success blog was well written and generated quite a bit of buzz on twitter after it was posted.

Prior to the interview, I had no knowledge of Software Advice and after looking at their blog, I understand why: they are clearly a BI blog. But after reading maybe a dozen posts, it is clear that we are siblings, in particular sharing concepts and approaches in big data, data science, staffing and talent acquisition. I've enjoyed going back to the blog. 

The similarities of BI and PA are points I’ve tried to make in talks I’ve given at eMetrics and performance management conferences. After making suitable translations of terms, these two fields can understand each other well. Two sample differences in terminology are described here.

First, one rarely hears the term KPI at a PA conference, but will often hear it at BI conferences. If we use google as an indicator of popularity of the term KPI,
  • ' “predictive analytics” KPI' yielded a mere 103,000 hits on google, whereas
  • ' “business intelligence” KPI' yielded 1,510,000 hits.
In PA, one is more likely to hear these ideas described as metrics or even features or derived variables that can be used as inputs to models are as a target variable.

As a second example, a “use case” is frequently presented in BI conferences to explain a reason for creating a particular KPI or analysis. “Use Cases” are rarely described in PA conferences; in PA we say “case studies”. Back to google, we find
  • ' "business intelligence" "use case" ' – 306,000 hits on google
  • ' “predictive analytics” ”use case” ' – 58,800 hits on google
  • ' “predictive analytics “case study” ' – 217,000 hits on google

Interestingly, the top two links for “predictive analytics” “use case” from the search weren’t even predictive analytics use cases or case studies. The second link of the two actually described how predictive analytics is a use case for cloud computing.

The BI community, however, seems to embrace PA and even consider it part of BI (much to the chagrin of the PA community, I would think). According to the Wikipedia entry on BI, the following chart shows topics that are a part of BI:

Interestingly, DM, PA, and even Prescriptive Analytics are considered a part of BI. I must admit, at all the DM and PA conferences I’ve attended, I’ve never heard attendees describe themselves as BI practitioners. I have heard more cross-branding of BI and PA at other conferences that include BI-specific material, like Performance Management and Web Analytics conferences.

Contrast this with the PA Wikipedia page. This taxonomy of fields related to PA is typical. I would personally include dashed lines to Text Mining and maybe even Link Analysis or Social Networks as they are related though not directly under PA. Interestingly, statistics falls under PA here, I’m sure to the chagrin of statisticians! And, I would guess that at a statistics conference, the attendees would not refer to themselves as predictive modelers. But maybe they would consider themselves data scientists! Alas, that’s another topic altogether. But that is the way these kinds of lists go; they are difficult to perfect and usually generate discussion over where the dividing lines occur.

This tendency to include fields are part of “our own” is a trap most of us fall into: we tend to be myopic in our views of the fields of study. It frankly reminds me of a map I remember hanging in my house growing up in Natick, MA: “A Bostonian’s Idea of The United States of America”.  Clearly, Cape Cod is far more important than Florida or even California!

Be that as it may, my final point is that BI and PA are important but complementary disciplines. BI is a much larger field and understandably so. PA is more of a specialty, but a specialty that is gaining visibility and recognition as an important skill set to have in any organization. Here’s to further collaboration in the future!

Monday, May 26, 2014

Why Overfitting is More Dangerous than Just Poor Accuracy, Part II

In Part I, I explained one problem with overfitting the data: estimates of the target variable in regions without any training data can be unstable, whether those regions require the model to interpolate or extrapolate. Accuracy is a problem, but more precisely, the problems in interpolation and extrapolation are not revealed using any accuracy metrics and only arise when new data points are encountered after the model is deployed.
This month, a second problem with overfitting is described: unreliable model interpretation. Predictive modeling algorithms find variables that associate or correlate with the target variable. When models are overfit, the algorithm has latched onto variables that it finds to be strongly associated with the target variable, but these relationships are not repeatable. The problem is that these variables that appear to be strongly associated with the target are not necessarily related at all to the target. When we interpret what the model is telling us, we therefore glean the wrong insights, and these insights can be difficult to shed once we rebuild models to simplify them and avoid overfitting.
Consider an example from the 1998 KDD Cup data. One variable, RFA_3, has 70 levels (71 if we include the missing values), a case of a high-cardinality input variable. A decision tree may try to group all levels with the highest association with the target variable, TARGET_B, a categorical variable with labels 0 for non-responders and 1 for responders to a mailing campaign.
RFA_3 turns out to be one of the top predictors when building decision trees. The decision tree may try to group all levels with high average rates of TARGET_B equal to 1. The table below shows the 10 highest rates along with the counts for how many records match each value of RFA_3. The question is this: when a value like L4G matches only 10 records, one of which is a responder (10% response rate), do we believe it? How sure are we that the measured 10% rate in our sample is reproducible for the next 10 values of L4G?
We can gain some insight by applying a simple statistical test, like a binomial distribution test you can find online. The upper and lower bounds of the measured rate given the sample size is shown in the table as well. For L4G, we are 95% sure from the statistical test that L4G will have a rate between 0% and 28.6%. This means the 10% rate we measured in the small sample size could really in the long run be 1%. Or, it could be 20%. We just don’t know.
 TARGET_B = 1 Percent
Confidence Interval Lower Bound
Confidence Interval Upper Bound
 95% Confidence above average

For the 1998 KDD Cup data, it turns out that RFA_3 isn’t one of the better predictors of TARGET_B; it only showed up as a significant predictor when overfitting reared it’s ugly head.
The solution? Beware of overfitting. For high-cardinality variables, apply a complexity penalty to reduce the likelihood of finding these low-count associations. For continuous variables, the problem exists as well and can be just as deceptive. For all problems you are solving, resample the data to assess models on held-out data (testing data), cross-validation, or bootstrap sampling.

note: this post first appeared in Predictive Analytics Times (with minor edits added here)