Why should one go to a predictive analytics conference? What should one take home from a conference like Predictive Analytics World (PAW)? There are many reasons conferences are valuable including interacting with thought leaders and practitioners, seeing software and hardware tools (the exhibit hall), and learning principles of predictive analytics from talks and workshops. This post focuses on the talks, and in particular, case studies.
There is no quicker way to upgrade our capabilities than having someone else who has "been there" tell us how they succeeded in their development and implementation of predictive models. When I go to conferences, this is at the top of my list. In the best case studies I am able to see different way of looking at a problem than I had considered before, how the practitioner overcame obstacles, how their target variable was defined, what data was used in building the models, how the data was prepared, what figure of merit they used to judge a model's effectiveness, and much more.
Almost all case studies we see at conferences are success stories; we all love winners. Yes, we all know that we learn from mistakes and many case studies actually enumerate mistakes. But success sells and given time limitations in a 20-50 minute talk, few mistakes and dead-ends are usually described in the talks. And, as we used to say in when I was doing government contracting, one works like crazy on the research and then when the money runs out, one declares victory. Putting a more positive spin on the process, we do as well as we can with the resources we have, and if the final solution improves the current system, we are indeed successful.
But once we observe the successful approach, what can we really take home with us? There are three reasons we should be skeptical taking case studies and applying them directly to our own problems.
The first two reasons are straightforward. First, our data is different from the data used in the talk. Obviously. But it is likely to be different enough that one cannot not take the exact same approach to data preparation or target variable creation that one sees at a conference.
Second, our business is different. The way the question was framed and the way predictions can be used are likely to differ in our organization. If we are building models to predict Medicare fraud, they way the “suspicious” claim is processed and which data elements are available vary significantly for each provider (codes being just one example).
The third reason is more subtle and more difficult to overcome. In a fascinating New Yorker article entitled, "The Truth Wears Off: Is there something wrong with the scientific method?", author Jonah Lehrer describes an effect seen by many researchers over the past few decades. Findings in major studies, published in reputable journals, and showing statistically significant results have been difficult to replicate by the original researcher and by others. This is a huge problem because replicating results is what we do as predictive modeler: we assume that behavior in the past can and will be replicated in the future.
In one example, researcher Jonathan Schooler (who was originally at the University of Washington as a graduate student) “demonstrated that subjects shown a face and asked to describe it were much less likely to recognize the face when shown it later than those who had simply looked at it. Schooler called the phenomenon ‘verbal overshadowing’. The study turned him into an academic star."
A few years later, he tried to replicate the study didn’t succeed. In fact, he tried many times over the years and never succeeded. The effect he found at first waned each time he tried to replicate the study with additional data. "This was profoundly frustrating. It was as if nature gave me this great result and then tried to take it back.”
There have been a variety of potential explanations for the effect, including “regression to the mean”. This might very well be the case because even when we show statistically significant results defined by having a p value less than 0.05, there is still a chance that the effect found was not really there at all. Over thousands of studies, dozens find effects therefore that aren't really there.
Let's assume we are building models and there is actually no significant difference between responders and non-responders (but we don't know that). However, we work very hard to identify an effect, and eventually we find the effect on training and testing data. We publish. But the effect isn't there; we happened upon the effect just had good luck (which in the long run is actually bad luck!). Even if the chance of finding the effect by chance is 1 in 100, or 1 in 1000, if we experiment enough and search through enough variables, we may happen upon a seemingly good effect eventually. This process, called "over searching" by Jensen and Cohen (see "Multiple Comparisons in Induction Algorithms"), is a real danger.
So what do we do at conferences? We should take home ideas, principles, and approaches rather than recipes. It should spur us to try ideas we either hadn't yet tried or even thought about before.
(An earlier version of this post was first published in the Predictive Analytics Times February 2013 issue)
Tips, tricks, and comments related to topics in data science and machine learning. Used to be called "data mining and predictive analytics" but updated the title to reflect the language of the day!
Hosted by Dean Abbott, Abbott Analytics
Thursday, February 14, 2013
Sunday, February 10, 2013
Using Geographic Data
Most organizations collect and maintain some type of geographic data, yet many ignore this data during analysis. Any business has some record of customer addresses, for instance, but this data is usually formatted in an awkward, non-numeric form. Geographic data can be very predictive, though, since behaviors being predicted often have some correlation to location.
So, how might one use geographic data? Possible answers depend on several factors, most importantly the volume and type of such data. A company serving a national market in the United States, for instance, will have customer shipping and billing addresses (not necessarily the same thing) for each customer (possibly for each transaction). These addresses normally come with a range of spatial granularities: street address, town, state, and associated ZIP Code (a 5-digit postal code).
Even at the largest level of aggregation, the state level, there may be over 50 distinct values (besides the 50 states, American addresses may be in Washington D.C. [technically not part of any state], or any of a number of other American territories, the most common of which is probably Puerto Rico). With 50 or so distinct values, significant data volume is needed to amass the observations needed to draw conclusions about each value. Given the best case scenario, in which all states exhibit equal observation counts, 1,000 observations breaks out into 50 categories of merely 20 observations each- not even enough to satisfy the old statistician's 30 observation rule of thumb. In data mining circles, we are accustomed to having much larger observation counts, but consider that the distribution of state values is never uniform in real data.
Using individual dummy variables to represent each state may be possible with especially large volumes. Possibly an "other" category covering the least frequent so many states will be needed. Another technique which I have found to work well is to replace the categorical state variable with a numeric variable representing a summary of the target variable, conditioned by state. In other words, all instances of "Virginia" are replaced by the average of the target variable for all Virginia cases, all instances of "New Jersey" are replaced by the average of the target variable for all New Jersey cases, and so on. This solution concentrates information about the target which comes from the state in a single variable, but makes interactions with other predictors more opaque. Ideally, such summaries are calculated on special hold-out set of data, used just for this purpose, so as to avoid over-fitting. Again, it may be necessary to lump the smallest so many states together as "other". While I have used American states in my example, it should not be hard for the reader to extend this idea to Canadian provinces, French départements, etc.
Most American states are large enough to provide robust summaries, but as a group they may not provide enough differentiation in the target variable. Changing the spatial scale implies a trade-off: Smaller geographic units exhibit worse summary variance, but improved geographic differentiation. American town names are not necessarily unique within a given state and similar names may be confused (Newtown, Pennsylvania is quite a distance from Newtown Square, Pennsylvania, for instance). In the United States, county names are unambiguous, and present finer spatial detail than states. County names do not, however, normally appear in addresses, but they are easily attached using ZIP Code/County tables easily found on-line. Another possible aggregation is the Section Code Facility, or "SCF", which is the first 3 digits of the ZIP Code.
In the American market, other types of spatial definitions which can be used include: Census Bureau definitions, telephone area codes and Metropolitan Statistical Areas ("MSAs") and related groupings defined by the U.S. Office of Management and Budget. The Census Bureau is a government agency which divides the entire country in to spatial units which vary in scale, down to very small areas (much smaller than ZIP Codes). MSAs are very popular with marketers. There are 366 MSAs at present, and they do not cover the entire land area of the United States, though they do cover about 85% of its population.
It is important to note that nearly all geographic entities change in size, shape and character over time. While existing American state and county boundaries almost never change any more, ZIP code boundaries and Census Bureau definitions, for instance, do change. Changing boundaries obviously complicates analysis, even though historic boundary definitions are often available. Even among entities whose boundaries do not change, radical changes in behavior may happen in geographically distinct ways. Consider that a model built before hurricane Katrina may no longer perform well in areas affected by the storm.
Also note that some geographic units, by definition, "respect" other definitions. American counties, for instance, only contain land from a single state. Others don't: the third-most populous MSA, Chicago-Joliet-Naperville, IL-IN-WI, for example, overlaps three different states.
Being creative when defining model inputs can be as helpful with geographic data as it is with more conventional data. In addition to the billing address itself, consider transformations such as: Has the billing address ever changed (1) or not (0)? How many times has the billing address changed? How often has the billing address changed (number of times changed divided by number of months the account has been open)? How far is the shipping address from the billing address? And so on...
Much more sophisticated use may be made of geographic data than has been described in this short posting. Software is available commercially which will determine drive time contours about locations, which would be useful, for instance when modeling retail store location revenue models. Additionally, there is an entire branch of statistics, called spatial statistics, which defines an entire class of analysis procedures specific to this sort of thing.
I encourage readers who have avoided geographic data to consider even simple mechanisms to include it in model construction. Opening up a new dimension in your analysis may provide significant returns.
So, how might one use geographic data? Possible answers depend on several factors, most importantly the volume and type of such data. A company serving a national market in the United States, for instance, will have customer shipping and billing addresses (not necessarily the same thing) for each customer (possibly for each transaction). These addresses normally come with a range of spatial granularities: street address, town, state, and associated ZIP Code (a 5-digit postal code).
Even at the largest level of aggregation, the state level, there may be over 50 distinct values (besides the 50 states, American addresses may be in Washington D.C. [technically not part of any state], or any of a number of other American territories, the most common of which is probably Puerto Rico). With 50 or so distinct values, significant data volume is needed to amass the observations needed to draw conclusions about each value. Given the best case scenario, in which all states exhibit equal observation counts, 1,000 observations breaks out into 50 categories of merely 20 observations each- not even enough to satisfy the old statistician's 30 observation rule of thumb. In data mining circles, we are accustomed to having much larger observation counts, but consider that the distribution of state values is never uniform in real data.
Using individual dummy variables to represent each state may be possible with especially large volumes. Possibly an "other" category covering the least frequent so many states will be needed. Another technique which I have found to work well is to replace the categorical state variable with a numeric variable representing a summary of the target variable, conditioned by state. In other words, all instances of "Virginia" are replaced by the average of the target variable for all Virginia cases, all instances of "New Jersey" are replaced by the average of the target variable for all New Jersey cases, and so on. This solution concentrates information about the target which comes from the state in a single variable, but makes interactions with other predictors more opaque. Ideally, such summaries are calculated on special hold-out set of data, used just for this purpose, so as to avoid over-fitting. Again, it may be necessary to lump the smallest so many states together as "other". While I have used American states in my example, it should not be hard for the reader to extend this idea to Canadian provinces, French départements, etc.
Most American states are large enough to provide robust summaries, but as a group they may not provide enough differentiation in the target variable. Changing the spatial scale implies a trade-off: Smaller geographic units exhibit worse summary variance, but improved geographic differentiation. American town names are not necessarily unique within a given state and similar names may be confused (Newtown, Pennsylvania is quite a distance from Newtown Square, Pennsylvania, for instance). In the United States, county names are unambiguous, and present finer spatial detail than states. County names do not, however, normally appear in addresses, but they are easily attached using ZIP Code/County tables easily found on-line. Another possible aggregation is the Section Code Facility, or "SCF", which is the first 3 digits of the ZIP Code.
In the American market, other types of spatial definitions which can be used include: Census Bureau definitions, telephone area codes and Metropolitan Statistical Areas ("MSAs") and related groupings defined by the U.S. Office of Management and Budget. The Census Bureau is a government agency which divides the entire country in to spatial units which vary in scale, down to very small areas (much smaller than ZIP Codes). MSAs are very popular with marketers. There are 366 MSAs at present, and they do not cover the entire land area of the United States, though they do cover about 85% of its population.
It is important to note that nearly all geographic entities change in size, shape and character over time. While existing American state and county boundaries almost never change any more, ZIP code boundaries and Census Bureau definitions, for instance, do change. Changing boundaries obviously complicates analysis, even though historic boundary definitions are often available. Even among entities whose boundaries do not change, radical changes in behavior may happen in geographically distinct ways. Consider that a model built before hurricane Katrina may no longer perform well in areas affected by the storm.
Also note that some geographic units, by definition, "respect" other definitions. American counties, for instance, only contain land from a single state. Others don't: the third-most populous MSA, Chicago-Joliet-Naperville, IL-IN-WI, for example, overlaps three different states.
Being creative when defining model inputs can be as helpful with geographic data as it is with more conventional data. In addition to the billing address itself, consider transformations such as: Has the billing address ever changed (1) or not (0)? How many times has the billing address changed? How often has the billing address changed (number of times changed divided by number of months the account has been open)? How far is the shipping address from the billing address? And so on...
Much more sophisticated use may be made of geographic data than has been described in this short posting. Software is available commercially which will determine drive time contours about locations, which would be useful, for instance when modeling retail store location revenue models. Additionally, there is an entire branch of statistics, called spatial statistics, which defines an entire class of analysis procedures specific to this sort of thing.
I encourage readers who have avoided geographic data to consider even simple mechanisms to include it in model construction. Opening up a new dimension in your analysis may provide significant returns.
Saturday, February 02, 2013
When Analysis Isn't the Answer
Data mining is an important tool whose benefits have been demonstrated in diverse fields, among business, government and non-profit organizations. Its application areas continue to grow, especially given the ever-shrinking cost of gathering and organizing data. Yet, there are problems for which data mining is wholly unsuited as a solution.
To understand when data mining is not applicable, it will be helpful to define precisely when it is applicable. Data mining (inferential statistics, predictive analytics, etc.) requires data stored in a machine format of sufficient volume, quality and relevance so as to permit the construction of predictive models which assist in real-world decision making.
Most of our time as data miners is spent worrying over the quality of the data and the process of turning data into models, however it is important to realize the usual context of data mining. Most organizations can perform basic decision making competently, and they have done so for thousands of years. Whether the base decision process is human judgment, a simple set of rules or a spreadsheet, much performance potential is already realized before data mining is applied. Consultants' marketing notwithstanding, data mining typically inhabits the margin of performance, where it tries to bring an extra "edge".
So, if the above two paragraphs describe conditions conducive to data mining success, what sorts of real-world situations defy data mining? The most obvious would be problems featuring data that is too small, too narrow, too noisy or of too little relevance to allow effective modeling. Organizations which have not maintained good records, which still rely on non-computer procedures and those with too little history are good examples. Even within very large organizations which collect and store enormous databases, there may be no relevant data for the problem at hand (for instance, when a new line of business is being opened, or new products introduced). It is surprising how often business people expect to extract value from a situation when they have failed to invest in appropriate data gathering.
Another large area with minimal data mining potential is organizations whose basic business process is so fundamentally broken that the usual decision making procedures have failed to do the usual "heavy lifting". Any of us can easily recall experiences in retail establishments whose operation was so flawed that it was obvious that the profit potential was not nearly being exploited. Data mining cannot fine tune a process which is so far gone. No amount of quantitative analysis will fix unkept shelves, weak product offering or poor employee behavior.
To understand when data mining is not applicable, it will be helpful to define precisely when it is applicable. Data mining (inferential statistics, predictive analytics, etc.) requires data stored in a machine format of sufficient volume, quality and relevance so as to permit the construction of predictive models which assist in real-world decision making.
Most of our time as data miners is spent worrying over the quality of the data and the process of turning data into models, however it is important to realize the usual context of data mining. Most organizations can perform basic decision making competently, and they have done so for thousands of years. Whether the base decision process is human judgment, a simple set of rules or a spreadsheet, much performance potential is already realized before data mining is applied. Consultants' marketing notwithstanding, data mining typically inhabits the margin of performance, where it tries to bring an extra "edge".
So, if the above two paragraphs describe conditions conducive to data mining success, what sorts of real-world situations defy data mining? The most obvious would be problems featuring data that is too small, too narrow, too noisy or of too little relevance to allow effective modeling. Organizations which have not maintained good records, which still rely on non-computer procedures and those with too little history are good examples. Even within very large organizations which collect and store enormous databases, there may be no relevant data for the problem at hand (for instance, when a new line of business is being opened, or new products introduced). It is surprising how often business people expect to extract value from a situation when they have failed to invest in appropriate data gathering.
Another large area with minimal data mining potential is organizations whose basic business process is so fundamentally broken that the usual decision making procedures have failed to do the usual "heavy lifting". Any of us can easily recall experiences in retail establishments whose operation was so flawed that it was obvious that the profit potential was not nearly being exploited. Data mining cannot fine tune a process which is so far gone. No amount of quantitative analysis will fix unkept shelves, weak product offering or poor employee behavior.