Friday, April 26, 2013

Math and Predictive Analytics - A Personal Account

Last week I taught a workshop at Predictive Analytics World entitled Supercharging Prediction: Hands-On with Ensemble Models. The workshop was intended to introduce predictive modelers to the concept of ensembles through a combination of lecture to provide an overview of model ensembles and hands-on to gain experience building ensembles using Salford Systems SPM v7.0 (Salford Systems sponsored the workshop).

This morning, Heather Hinman, a Marketing Communications Manager at Salford Systems, posted comments on attending that workshop at the Salford Systems blog. Two comments were particularly interesting, especially their implications vis a vis my last blog post on math and predictive analytics:

I will admit I was intimidated at first to be participating in a predictive modeling workshop as I do not have a background in statistics, and only have basic training on decision tree tools by Salford Systems' team of in-house experts. Despite my basic knowledge of decision trees, I was thrilled that I was able to follow along with ease and understanding when learning about tree ensembles and modern hybrid modeling approaches. Marketing folk building predictive models? Yes, we can!
and
Now back at the office in San Diego, along with my usual responsibilities, I feel confident in my ability to build predictive models and gain insights into the data at hand to achieve the email marketing and online campaign goals for our communication efforts!  
In the post, Heather also outlines some of the principles she learned and how she used them to build the predictive models in the workshop.

The point is this: if one uses good software that uses solid principles for building predictive models, and one understands key principles of building predictive models, someone without a mathematics background can build good, profitable models.



Monday, April 01, 2013

Do Predictive Modelers Need to Know Math?

(Note: this post was first published in the March 2013 Edition of the Predictive Analytics Times)
Predictive analytics is just a bunch of math, isn’t it? After all, algorithms in the form of matrix algebra, summations, integrals, multiplies and adds are the core of what predictive modeling algorithms do. Even rule-based approaches need math to compute how good the if-then-else rules are.

I was participating in a predictive analytics course recently and the question a participant asked at the end of two days of instruction was this: “it’s been a long time since I’ve had to do this kind of math and I’m a bit rusty. Is there a book that would help me learn the techniques without the math?”

The question about math was interesting. But do we need to know the math to build models well? Anyone can build a bad model, but to build a good model, don’t we need to know what the algorithms are doing? The answer, of course, depends on the role of the analyst. I contend, however, that for most predictive analytics projects, the answer is “no”.

Let’s consider building decision tree models. What options does one need to set to build good trees? Here is a short list of common knobs that can be set by most predictive analytics software packages: 1. Splitting metric (CART style trees, C5 style trees, CHAID style trees, etc.) 2. Terminal node minimum size 3. Parent node minimum size 4. Maximum tree depth 5. Pruning options (standard error, Chi-square test p-value threshold, etc.)

The most mathematical of these knobs is the splitting metric. CART-styled trees use the Gini Index, C5 trees use Entropy (information gain), and CHAID style trees use the chi-square test as the splitting criterion. A book I consider the best technical book on data mining and statistical learning methods, “The Elements of Statistical Learning”, has this description of the splitting criteria for decision trees, including the Gini Index and Entropy:



To a mathematician, these make sense. But without a mathematics background, these equations will be at best opaque and at worst incomprehensible. (And these are not very complicated. Technical textbooks and papers describing machine learning algorithms can be quite difficult even for more seasoned, but out-of-practice mathematicians to understand).

As someone with a mathematics background and a predictive modeler, I must say that the actual splitting equations almost never matter to me. Gini and Entropy often produce the same splits or at least similar splits. CHAID differs more, especially in how it creates multi-way splits. But even here, the biggest difference for me is not the math, but just that they use different tests for determining "good" splits

There are, however, very important reasons for someone on the team to understand the mathematics or at least the way these algorithms work qualitatively. First and foremost, understanding the algorithms helps us uncover why models go wrong. Models can be biased toward splitting on particular variables or even particular records. In some cases, it may appear that the models are performing well but in actuality they are brittle. Understanding the math can help remind us that this may happen and why.

The fact that linear regression uses a quadratic cost function tells us that outliers affect overall error disproportionately. Understanding how decision trees measure differences between the parent population and sub-populations informs us why a high-cardinality variable may be showing up at the top of our tree, and why additional penalties may be in order to reduce this bias. Seeing the computation of information gain (derived from Entropy) tells us that binary classification with a small target value proportion (such as having 5% 1s) often won't generate any splits at all.

The answer to the question if predictive modelers need to know math is this: no they don’t need to understand the mathematical notation, but neither should they ignore the mathematics. Instead, we all need to understand the effects of the mathematics on the algorithms we use. “Those who ignore statistics are condemned to reinvent it,” warns Bradley Efron of Stanford University. The same applies to mathematics.

Thursday, February 14, 2013

What To Take Home from Your Next Predictive Analytics Conference

Why should one go to a predictive analytics conference? What should one take home from a conference like Predictive Analytics World (PAW)? There are many reasons conferences are valuable including interacting with thought leaders and practitioners, seeing software and hardware tools (the exhibit hall), and learning principles of predictive analytics from talks and workshops. This post focuses on the talks, and in particular, case studies.

There is no quicker way to upgrade our capabilities than having someone else who has "been there" tell us how they succeeded in their development and implementation of predictive models. When I go to conferences, this is at the top of my list. In the best case studies I am able to see different way of looking at a problem than I had considered before, how the practitioner overcame obstacles, how their target variable was defined, what data was used in building the models, how the data was prepared, what figure of merit they used to judge a model's effectiveness, and much more.

Almost all case studies we see at conferences are success stories; we all love winners. Yes, we all know that we learn from mistakes and many case studies actually enumerate mistakes. But success sells and given time limitations in a 20-50 minute talk, few mistakes and dead-ends are usually described in the talks. And, as we used to say in when I was doing government contracting, one works like crazy on the research and then when the money runs out, one declares victory. Putting a more positive spin on the process, we do as well as we can with the resources we have, and if the final solution improves the current system, we are indeed successful.

But once we observe the successful approach, what can we really take home with us? There are three reasons we should be skeptical taking case studies and applying them directly to our own problems.

The first two reasons are straightforward. First, our data is different from the data used in the talk. Obviously. But it is likely to be different enough that one cannot not take the exact same approach to data preparation or target variable creation that one sees at a conference.

Second, our business is different. The way the question was framed and the way predictions can be used are likely to differ in our organization. If we are building models to predict Medicare fraud, they way the “suspicious” claim is processed and which data elements are available vary significantly for each provider (codes being just one example).

The third reason is more subtle and more difficult to overcome. In a fascinating New Yorker article entitled, "The Truth Wears Off: Is there something wrong with the scientific method?", author Jonah Lehrer describes an effect seen by many researchers over the past few decades. Findings in major studies, published in reputable journals, and showing statistically significant results have been difficult to replicate by the original researcher and by others. This is a huge problem because replicating results is what we do as predictive modeler: we assume that behavior in the past can and will be replicated in the future.

In one example, researcher Jonathan Schooler (who was originally at the University of Washington as a graduate student) “demonstrated that subjects shown a face and asked to describe it were much less likely to recognize the face when shown it later than those who had simply looked at it. Schooler called the phenomenon ‘verbal overshadowing’. The study turned him into an academic star."

A few years later, he tried to replicate the study didn’t succeed. In fact, he tried many times over the years and never succeeded. The effect he found at first waned each time he tried to replicate the study with additional data. "This was profoundly frustrating. It was as if nature gave me this great result and then tried to take it back.” There have been a variety of potential explanations for the effect, including “regression to the mean”. This might very well be the case because even when we show statistically significant results defined by having a p value less than 0.05, there is still a chance that the effect found was not really there at all. Over thousands of studies, dozens find effects therefore that aren't really there.

Let's assume we are building models and there is actually no significant difference between responders and non-responders (but we don't know that). However, we work very hard to identify an effect, and eventually we find the effect on training and testing data. We publish. But the effect isn't there; we happened upon the effect just had good luck (which in the long run is actually bad luck!). Even if the chance of finding the effect by chance is 1 in 100, or 1 in 1000, if we experiment enough and search through enough variables, we may happen upon a seemingly good effect eventually. This process, called "over searching" by Jensen and Cohen (see "Multiple Comparisons in Induction Algorithms"), is a real danger.

So what do we do at conferences? We should take home ideas, principles, and approaches rather than recipes. It should spur us to try ideas we either hadn't yet tried or even thought about before.

(An earlier version of this post was first published in the Predictive Analytics Times February 2013 issue)

Sunday, February 10, 2013

Using Geographic Data

Most organizations collect and maintain some type of geographic data, yet many ignore this data during analysis. Any business has some record of customer addresses, for instance, but this data is usually formatted in an awkward, non-numeric form. Geographic data can be very predictive, though, since behaviors being predicted often have some correlation to location.

So, how might one use geographic data? Possible answers depend on several factors, most importantly the volume and type of such data. A company serving a national market in the United States, for instance, will have customer shipping and billing addresses (not necessarily the same thing) for each customer (possibly for each transaction). These addresses normally come with a range of spatial granularities: street address, town, state, and associated ZIP Code (a 5-digit postal code).

Even at the largest level of aggregation, the state level, there may be over 50 distinct values (besides the 50 states, American addresses may be in Washington D.C. [technically not part of any state], or any of a number of other American territories, the most common of which is probably Puerto Rico). With 50 or so distinct values, significant data volume is needed to amass the observations needed to draw conclusions about each value. Given the best case scenario, in which all states exhibit equal observation counts, 1,000 observations breaks out into 50 categories of merely 20 observations each- not even enough to satisfy the old statistician's 30 observation rule of thumb. In data mining circles, we are accustomed to having much larger observation counts, but consider that the distribution of state values is never uniform in real data.

Using individual dummy variables to represent each state may be possible with especially large volumes.  Possibly an "other" category covering the least frequent so many states will be needed. Another technique which I have found to work well is to replace the categorical state variable with a numeric variable representing a summary of the target variable, conditioned by state. In other words, all instances of "Virginia" are replaced by the average of the target variable for all Virginia cases, all instances of "New Jersey" are replaced by the average of the target variable for all New Jersey cases, and so on. This solution concentrates information about the target which comes from the state in a single variable, but makes interactions with other predictors more opaque. Ideally, such summaries are calculated on special hold-out set of data, used just for this purpose, so as to avoid over-fitting. Again, it may be necessary to lump the smallest so many states together as "other". While I have used American states in my example, it should not be hard for the reader to extend this idea to Canadian provinces, French départements, etc.

Most American states are large enough to provide robust summaries, but as a group they may not provide enough differentiation in the target variable. Changing the spatial scale implies a trade-off: Smaller geographic units exhibit worse summary variance, but improved geographic differentiation. American town names are not necessarily unique within a given state and similar names may be confused (Newtown, Pennsylvania is quite a distance from Newtown Square, Pennsylvania, for instance). In the United States, county names are unambiguous, and present finer spatial detail than states. County names do not, however, normally appear in addresses, but they are easily attached using ZIP Code/County tables easily found on-line. Another possible aggregation is the Section Code Facility, or "SCF", which is the first 3 digits of the ZIP Code.

In the American market, other types of spatial definitions which can be used include: Census Bureau definitions, telephone area codes and Metropolitan Statistical Areas ("MSAs") and related groupings defined by the U.S. Office of Management and Budget. The Census Bureau is a government agency which divides the entire country in to spatial units which vary in scale, down to very small areas (much smaller than ZIP Codes). MSAs are very popular with marketers. There are 366 MSAs at present, and they do not cover the entire land area of the United States, though they do cover about 85% of its population.

It is important to note that nearly all geographic entities change in size, shape and character over time. While existing American state and county boundaries almost never change any more, ZIP code boundaries and Census Bureau definitions, for instance, do change. Changing boundaries obviously complicates analysis, even though historic boundary definitions are often available. Even among entities whose boundaries do not change, radical changes in behavior may happen in geographically distinct ways. Consider that a model built before hurricane Katrina may no longer perform well in areas affected by the storm.

Also note that some geographic units, by definition, "respect" other definitions. American counties, for instance, only contain land from a single state. Others don't: the third-most populous MSA, Chicago-Joliet-Naperville, IL-IN-WI, for example, overlaps three different states.

Being creative when defining model inputs can be as helpful with geographic data as it is with more conventional data. In addition to the billing address itself, consider transformations such as: Has the billing address ever changed (1) or not (0)? How many times has the billing address changed? How often has the billing address changed (number of times changed divided by number of months the account has been open)? How far is the shipping address from the billing address? And so on...

Much more sophisticated use may be made of geographic data than has been described in this short posting. Software is available commerically which will determine drive time contours about locations, which would be useful, for instance when modeling retail store location revenue models. Additionally, there is an entire of statistics, called spatial statistics, which defines an entire class of analysis procedures specific to this sort of thing.

I encourage readers who have avoided geographic data to consider even simple mechanisms to include it in model construction. Opening up a new dimension in your analysis may provide significant returns.





Saturday, February 02, 2013

When Analysis Isn't the Answer

Data mining is an important tool whose benefits have been demonstrated in diverse fields, among business, government and non-profit organizations. Its application areas continue to grow, especially given the ever-shrinking cost of gathering and organizing data. Yet, there are problems for which data mining is wholly unsuited as a solution.

To understand when data mining is not applicable, it will be helpful to define precisely when it is applicable. Data mining (inferential statistics, predictive analytics, etc.) requires data stored in a machine format of sufficient volume, quality and relevance so as to permit the construction of predictive models which assist in real-world decision making.

Most of our time as data miners is spent worrying over the quality of the data and the process of turning data into models, however it is important to realize the usual context of data mining. Most organizations can perform basic decision making competently, and they have done so for thousands of years. Whether the base decision process is human judgment, a simple set of rules or a spreadsheet, much performance potential is already realized before data mining is applied. Consultants' marketing notwithstanding, data mining typically inhabits the margin of performance, where it tries to bring an extra "edge".

So, if the above two paragraphs describe conditions conducive to data mining success, what sorts of real-world situations defy data mining? The most obvious would be problems featuring data that is too small, too narrow, too noisy or of too little relevance to allow effective modeling. Organizations which have not maintained good records, which still rely on non-computer procedures and those with too little history are good examples. Even within very large organizations which collect and store enormous databases, there may be no relevant data for the problem at hand (for instance, when a new line of business is being opened, or new products introduced). It is surprising how often business people expect to extract value from a situation when they have failed to invest in appropriate data gathering.

Another large area with minimal data mining potential is organizations whose basic business process is so fundamentally broken that the usual decision making procedures have failed to do the usual "heavy lifting". Any of us can easily recall experiences in retail establishments whose operation was so flawed that it was obvious that the profit potential was not nearly being exploited. Data mining cannot fine tune a process which is so far gone. No amount of quantitative analysis will fix unkept shelves, weak product offering or poor employee behavior.

Wednesday, January 16, 2013

Three Ways to Get Your Predictive Models Deployed


We all know that given reasonable data, a good predictive modeler can build a model that works well and helps make makes better decisions than what is currently used in your organization (at least in our own minds). Newer data, sophisticated algorithms, and a seasoned analyst are all working in our favor when we build these models, and if success were measured by accuracy (as they are in most data mining competitions), we're in great shape. Yes, there are always gotchas and glitches along the way. But when my deliverable is only slideware, even of the modeling is hard, I'm confident of being able to declare victory at the end.

However, the reality is that there is much more to the transition from cool model to actual deployment than a nice slide deck and paper accepted at one's favorite predictive analytics, data mining or big data conference. In these venues, the winning models are those that are "accurate" (more on that later) and have used creative analysis techniques to find the solution; we won't submit a paper when we only had to press the "go" button and have the data mining software give us a great solution!

For me, the gold standard is deployment. If the model gets used and improves the decisions an organization makes, I've succeeded. Three ways to increase the likelihood your models are deployed are:

1) Make sure the model stakeholder designs deployment into the project from the beginning

The model stakeholder is the individual, usually a manager, who is the advocate of predictive models to decision-makers. It is possible that a senior-level modeler can do this task, but that person must be able to switch hit: he or she must be able to speak the language of management and be able to talk technical detail to analytics. This may require more than one trusted person: the manager, who is responsible and makes the ultimate decisions about the models, and the lead modeler, who is responsible for the technical aspects of the model. It is more than "talking the talk" and knowing buzz-words in both realms; the person or persons must truly be "one of" both groups.

For those who have followed my blog posts and conference talks, you know I am a big advocate of the CRISP-DM process model (or equivalent methodologies, which seem to be endless). I've referred to CRISP-DM often, including on topics related to what data miners need to learn and Defining the Target Variable, just as two examples.

The stakeholder must not only understand the business of objectives of the model (Business Understanding in CRISP-DM), but must be present during discussions take place related to which models will be built. It is essential that reasonable expectations are put into place from the beginning, including what a good model will "look like" (accuracy and interpretability) and how the final model will be deployed.

I've seen far too many projects die or become inconsequential because either the wrong objectives were used in building the models, meaning the models were operationally useless, or because the deployment of the models was not considered, meaning again that the models were operationally useless. As an example, on one project, the model was assumed to be able to be run within a rules engine, but the models that were built were not rules at all, but were complex non-linear models that could not be translated into rules. The problem obviously could have been avoided had this disconnect been verbalized early in the modeling process.

2) Make sure modelers understand the purpose of the models

The modelers must know how the models will be used and what metrics should be used to judge model performance. A good summary of typical error metrics used by modelers is found here. However, for most of the models I have deployed in customer acquisition, retention, and risk modeling, the treatment based on the model is never applied to the entire population (we don't mail everyone, just a subset). So the metrics that make the most sense are often ones like "lift after the top decile", maximum cumulative net revenue, top 1000 scores to be investigated, etc. I've actually seen negative correlations between the ranking of models based on global metrics (like classification error or R^2) vs. the ranking based on subset selection ranking, such as top 1000 scores; very different models may be deployed depending on the metric one uses to assess them. If modelers aren't aware of the metric to be used, the wrong model can be selected, even one that does worse than the current approach.

Second, if the modelers don't understand how the models will be deployed operationally, they may find a fantastic model, one that maximizes the right metric, but is useless. The Neflix Prize is a great example: the final winning model was accurate but far too complex to be used. Netflix extracted key pieces to the models to operationalize instead. I've had customers stipulate to me that "no more than 10 variables can be included in the final model". If modelers aren't aware of specific timelines or implementation constraints, a great but useless model can be the result.

3) Make sure the model stakeholder understands what the models can and can't do

In the effort to get models deployed, I've seen models elevated to a status they don't deserve, most often by exaggerating their accuracy and expected performance once in operation. I understand why modelers may do this: they have a direct stake in what they did. But the manager must be more skeptical and conservative.

One of the most successful colleagues I've ever worked with used to assess model performance on held-out data using the metric we had been given (maximum depth one could mail to and still achieve the pre-determined response rate). But then he always backed off what was reported to his managers by about 10% to give some wiggle room. Why? Because even in our best efforts, there is still a danger that the data environment after the model is deployed will differ from that used in building the models, thus reducing the effectiveness of the models.

A second problem for the model stakeholder is communicating an interpretation of the models to decision-makers. I've had to do this exercise several times in the past few months and it is always eye-opening when I try to explain the patterns a model is finding when the model is itself complex. We can describe overall trends ("on average", more of X increases the model score) and we can also describe specific patterns (when observable fields X and Y are both high, the model score is high). Both are needed to communicate what the models do, but have to connect with what a decision-maker understands about the problem. If it doesn't make sense, the model won't be used. If it is too obvious, the model isn't worth being used.

The ideal model for me is one where the decision-maker nods knowingly at the "on average" effects (these should usually be obvious). Then, once you throw in some specific patterns, he or she should scrunch his/her eyes, think a bit, then smile as the implications of the pattern dawns on them as that pattern really does make sense (but was previously not considered).

As predictive modelers, we know that absolutes are hard to come by, so even if these three principles are adhered to, other factors can sabotage the deployment of a model. Nevertheless, in general, these steps will increase the likelihood that models are deployed. In all three steps, communication is the key to ensuring the model built addresses the right business objective, the right scoring metric, and can be deployed operationally.

NOTE: this post was originally posted for the Predictive Analytics Times at http://www.predictiveanalyticsworld.com/patimes/january13/ 

Friday, January 04, 2013

Top Posts in 2012

For the second consecutive year, a quick look back at posts from the prior year.

For posted in 2012, in order of popularity:
  1. Target, Pregnancy, and Predictive Analytics, Part I
  2. Target, Pregnancy, and Predictive Analytics, Part II
  3. Predictive Analytics World Had the Target Story First
  4. Why Defining the Target Variable in Predictive Analytics is Critical
  5. Dilbert, Database Marketing, and Spam
I’m also adding #6 because Will post in December did very well, but of course has had only one month to accumulate views.
  1. 6 Reasons You Hired the Wrong Data Miner
From posts prior to 2012, in order of popularity for 2012:
  1. What Do Data Miners Need to Learn (June 2011)
  2. Free and Inexpensive Data Mining Software (November 2006) This post needs to be updated!
  3. Why Normalization Matters for K Means (April 2009) It always amazes me why this post persists as one of the most popular, but nearly ¼ of the visits used the search term “K Means Noisy Data”
  4. Data Mining Data Sets (April 2008) This post also needs to be updated
  5. Business Analytics vs. Business Intelligence (December 2009)
One final note: When I look back at visits since the start of this blog, 4 of the top 5 posts are the top 4 “prior to 2012” above. The #5 most popular post over all the years I’ve had the blog is one by Will from 2007, “Missing Values and Special Values: The Plague of Data Analyis”, one that I have always liked very much.

Best to all of you in 2013!

Wednesday, December 19, 2012

6 Reasons You Hired the Wrong Data Miner

As is in any discipline, talent within data mining community varies greatly.  Generally, business people and others who hire and manage technical specialists like data miners are not themselves technical experts.  This makes it difficult to evaluate the performance of data miners, so this posting is a short list of possible deficiencies in a data miner's performance.  Hopefully, this will spare some heartache in the coming year.  Merry Christmas!


1. The data miner has little or no programming skill.

Most work environments require someone to extract and prepare the data.  The more of this process which the data miner can accomplish, the less her dependence on others.  Even in ideal situations with prepared analytical data tables, the data miner who can program can wring more from the data than her counterpart who cannot (think: data transformations, re-coding, etc.).  Likewise, when her predictive model is to be deployed in a production system, it helps if the data miner can provide code as near to finished as possible.


2. The data miner is unable to communicate effectively with non-data miners.

Life is not all statistics: Data mining results must be communicated to colleagues with little or no background in math.  If other people do not understand the analysis, they will not appreciate its significance and are unlikely to act on it.  The data miner who can express himself clearly to a variety of audiences (internal customers, management, regulators, the press, etc.) is of greater value to the organization than his counterpart who cannot.  The data miner should should receive questions eagerly.


3. The data miner never does anything new.

If the data miner always approaches new problems with the same solution, something is wrong.  She should be, at least occasionally, suggesting new techniques or ways of looking at problems.  This does not require that new ideas be fancy: Much useful work can be done with basic summary statistics.  It is the way they are applied that matters.


4. The data miner cannot explain what they've done.

Data mining is a subtle craft: there are many pitfalls and important aspects of statistics and probability are counter-intuitive.  Nonetheless, the data miner who cannot provide at least a glimpse into the specifics of what they've done and why, is not doing all he might for the organization.  Managers want to understand why so many observations are needed for analysis (after all, they pay for those observations), and the data miner should be able to provide some justification for his decisions.


5. The data miner does not establish the practical benefit of his work.

A data miner who cannot connect the numbers to reality is working in a vacuum and is not helping her manager (team, company, etc.) to assess or utilize her work product.  Likewise, there's a good chance that she is pursuing technical targets rather than practical ones.  Improving p-values, accuracy, AUC, etc. may or may not improve profit (retention, market share, etc.).


6. The data miner never challenges you.

The data miner has a unique view of the organization and its environment.  The data miner works on a landscape of data which few of his coworkers ever see, and he is less likely to be blinded by industry prejudices.  It is improbable that he will agree with his colleagues 100% of the time.  If the data miner never challenges assumptions (business practices, conclusions, etc.), then something is wrong.

Tuesday, November 06, 2012

Why Predictive Modelers Should be Suspicious of Statistical Tests (or why the Redskin Rule fools us)

Well, the danger is really not the statistical test per se, it the interpretation of the statistical test.

Yesterday I tweeted (@deanabb) this fun factoid: "Redskins predict Romney wins POTUS #overfit. if Redskins lose home game before election => challenger wins (17/18) http://www.usatoday.com/story/gameon/2012/11/04/nfl-redskins-rule-romney/1681023/" I frankly had never heard of this "rule" before and found it quite striking. It even has its own Wikipedia page (http://en.wikipedia.org/wiki/Redskins_Rule).

For those of us in the predictive analytics or data mining community, and those of us who use statistical tests to help out interpreting small data, 17/18 we know is a hugely significant finding. This can frequently be good: statistical tests will help us gain intuition about value of relationships in data even when they aren't obvious.

In this case, an appropriate test is a chi-square test based on the two binary variables (1) did the Redskins win on the Sunday before the general election (call it the input or predictor variable) vs. (2) did the incumbent political party win the general election for President of the United States (POTUS).

According to the Redskins Rule, the answer is "yes" in 17 of 18 cases since 1940. Could this be by chance? If we apply the chi-square test to it, it sure does look significant! (chi-square = 14.4, p < 0.001). I like the decision tree representation of this that shows how significant it is (built using the Interactive CHAID tree in IBM Modeler on Redskin Rule data I put together here):


It's great data--9 Redskin wins, 9 Redskin losses, great chi-square statistic!

OK, so it's obvious that this is just another spurious correlation in the spirit of all of those fun examples in history, such as the superbowl winning conference predicting if the stock market would go up or down in the next year at a stunning 20 or 22 correct. It even was the subject of academic papers on the subject!

The broader question (and concern) for predictive modelers is this: how do we recognize when we have uncovered spurious correlations in the data that are merely spurious? This can happen especially when we don't have deep domain knowledge and therefore wouldn't necessarily identify variables or interactions as spurious. In examples such as the election or stock market predictions, no amount of "hold out" samples, cross-validation or bootstrap sampling would uncover the problem: it is in the data itself.

We need to think about this because inductive learning techniques search through hundreds, thousands, even millions of variables and combinations of variables. The phenomenon of "over searching" is a real danger with inductive algorithms as they search and search for patterns in the input space. Jensen and Cohen have a very nice and readable paper on this topic (PDF here). For trees, they recommend using the Bonferroni adjustment which does help penalize the combinatorics associated with splits. But our problem here goes far deeper than overfitting due to combinatorics.

Of course the root problem with all of these spurious correlations is small data. Even if we have lots of data, what I'll call here the "illusion of big data", some algorithms make decisions based on smaller populations, like decision trees, rule induction and nearest neighbor (i.e., algorithms that build bottom-up). Anytime decisions are made from populations of 15, 20, 30 or even 50 examples, there is a danger that our search through hundreds of variables will turn out a spurious relationship.

What do we do about this? First, make sure you have enough data so that these small-data effects don't bite you. This is why I strongly recommend doing data audits and looking for categorical variables that contain levels with at most dozens of examples--these are potential overfilling categories.

Second, don't hold strongly any patterns discovered in your data based on solely on the data, especially if they are based on relatively small sample sizes. These must be validated with domain experts. Decision trees are notorious for allowing splits deep in the trees that are "statistically significant" but dangerous nevertheless because of small data sizes.

Third, the gist of your models have to make sense. If they don't, put on your "Freakonomics" hat and dig in to understand why the patterns were detected by the models. In our Redskin Rule, clearly this doesn't make sense causally, but sometimes the pattern picked up by the algorithm is just a surrogate for a real relationship. Nevertheless, I'm still curious to see if the Redskin Rule will prove to be correct once again. This year it predicts a Romney win because the Redskins lost and therefore the incumbent party (D) by the rule should lose. UPDATE: by way of comparison...the chances of having 17/18 or 18/18 coin flips turn up heads (or tails--we're assuming a fair coin after all!) is 7 in 100,000 or 1 in 14,000. Put another way, if we examined 14K candidate variables unrelated to POTUS trends, the chances are that one of them would line up 17/18 or 18/18 of the time. Unusual? Yes. Impossible? No!

Tuesday, October 23, 2012

Data Preparation: Know Your Records!

Data preparation in data mining and predictive analytics (dare I also say Data Science?) rightfully focuses on how the fields in ones data should be represented so that modeling algorithms either will work properly or at least won't be misled by the data. These data preprocessing steps may involve filling missing values, reigning in the effects of outliers, transforming fields so they better comply with algorithm assumptions, binning, and much more.

In recent weeks I've been reminded how important it is to know your records. I've heard this described in many ways, four of which are:

  • the unit of analysis
  • the level of aggregation
  • what a record represents
  • unique description of a record


  • For example, does each record represent a customer? If so, over their entire history or over a time period of interest? In web analytics, the time period of interest may be a single session, which if it is true, means that an individual customer may be in the modeling data multiple times as if each visit or session is an independent event.

    Where this especially matters is when disparate data sources are combined. If one is joining a table of customerID/Session data with another table with each record representing a customerID, there's no problem. But if the second table represents customerID/store visit data, there will obviously be a many-to-many join resulting in a big mess.

    This is probably obvious to most readers of this blog. What isn't always obvious is when our assumptions about the data result in unexpected results. What if we expect the unit of analysis to be customerID/Session but there are duplicates in the data? Or what if we had assumed customerID/Session data but it was in actuality customerID/Day data (where ones customers typically have one session per day, but could have a dozen)?

    The answer is just like we need to perform a data audit to identify potential problems with fields in the data, we need to perform record audits to uncover unexpected record-level anomalies. We've all had those data sources where the DBA swears up and down that there are no dups in the data, but when we group by customerID/Session, we find 1000 dups.

    So before the joins and after joins, we need to do those group by operations to find examples with unexpected numbers of matches.

    In conclusion: know what your records are supposed to represent, and verify verify verify. Otherwise, your models (who have no common sense) will exploit these issues in undesirable ways!