Monday, November 18, 2013

A Good Business Objective Beats a Good Algorithm

Predictive Modeling competitions, once the arena for a few data mining conferences, has now become big business. Kaggle (kaggle.com) is perhaps the most well-known forum for modeling competitions, using a crowd-sourcing mentality: if more people try to solve a problem, the likelihood that someone will create an excellent solution to that problem increases.
The participants, and there have been 10s of thousands of participants since their 2011 beginning, sometimes have no predictive modeling background and sometimes an extensive data science background. Some very clever algorithms and solutions have been developed with, on some occasions, ground-breaking results
One conclusion to draw from these competitions is that what we need in the predictive analytics space is more data scientists with different, innovative ideas for solving problems, and perhaps more in-depth training of data scientists so they can create these innovative solutions. After all, the Netflix prize winner created a solution that was an ensemble of model ensembles, comprised of hundreds of models (not a Kaggle competition, but one created by and for Netflix).
This idea of the importance of machine learning expertise was the topic of a Strata conference debate in 2012, tackling the question, “which is more important, domain expertise or machine learning expertise”, or the way it was phrased for the debate, “who should your first hire be: a domain expert or data scientist?”
The conclusion of the majority at the Strata conference was the machine learning is more important, but even the moderator, Mark Driscoll, concluded the following,
“Could you currently prepare your data for a Kaggle competition?  If so, then hire a machine learner.  If not, hire a data scientist who has the domain expertise and the data hacking skills to get you there.” (http://medriscoll.com/post/18784448854/the-data-science-debate-domain-expertise-or-machine)
The point is that defining the competition objectives and the data needed to solve the problem is critically important. Non-domain experts, the data scientists, can not ever hope to understand the domain well enough to determine what the most effective question to answer would be, where to find the data to build a modeling data set, what the target variable should be, and how one should assess which model is best. These are business domain specific.
Even companies building the same kinds of models, let’s say customer retention or churn, will approach them differently depending on the kind of business, the lead time needed to act on potential churners, and the metrics for churn that relate to ROI for that company. I’ve build models for companies in the same domain area that took very different approaches; even though I had some domain experience from customer 1, that didn’t translate into developing business objectives well for company 2.
It’s the partnership that matters. I often think of these partnerships within an organization as the three-legged stool, all of which are needed for the modeling project to succeed: a business stakeholder who understands what business objectives matter to the company and how to articulate them, IT staff who know where the data is, what it means, and how to access it, and the analysts who know how to take the data and the business objectives and translate them into modeling objectives that address the business problem. Without all three, projects fail. We modelers could build the best models in the world that solve the wrong problem exceedingly well!
(first posted at http://www.predictiveanalyticsworld.com/patimes/a-good-business-objective-beats-a-good-algorithm/)

Saturday, September 07, 2013

On Data Mining Contests

Data mining contests have grown in popularity over the years, from the annual competitions at technical conferences to the continuous stream of events at sites like Kaggle. This has yielded several benefits, allowing many experts to work on difficult problems, giving novices a chance to work on real data and showcasing successful solutions. These competitions have even garnered the attention of the mainstream press. While believing that the spread of these technical contests has been largely positive, this author feels that it's worth noting the limitations of these contests.

Despite using real data, the problems, as formulated, are somewhat artificial. Questions of sampling and initial variable selection have already been decided, as have the evaluation function and the model's part in the ultimate solution. To some extent, these are necessary constraints, but they are constraints nonetheless. In real world data mining, all of these questions are the responsibility of the data miner and his or her clients, and they are not trivial considerations. In most larger organizations, the data is large enough that there is always "one more" table in the database which could be tapped for candidate predictors. Likewise, how the model might best be positioned as part of the total solution is not always obvious, especially in more complex problems. A minority of contests permit the use of outside data, but even this is somewhat unrealistic since real organizations have budgets for the purchase of outside data, such as demographic data to be appended to a customer population. I've yet to learn of anyone paying for outside variables to append to competition data, though.

Another issue is the large number of competitors which these contests attract. Though it is good to have many analysts take a crack at a problem, one must wonder about the statistical significance of having hundreds of statisticians test God-only-knows how many hypotheses against the same data. Further, the number of competitors and the similarity of top contestants' performance figures make selection of a single "winner" a dubious proposition.

Finally, it has become rather common for winners of the contests to construct solutions of vast proportions- typically ensembles of gigantic number of base models. While such models may be feasible to deploy in some circumstances, they far too computationally demanding to execute on many real databases quickly enough to be practical.

Some of these criticisms are probably unavoidable, especially the ones regarding the pre-selected, pre-digested contest data. Still, it'd be interesting to see future data mining competitions address at least some of these issues. For one thing, it might be interesting to see solution sizes (lines of SQL or C++ or something similar) limited to something which ordinary IT departments would be capable of executing during a typical overnight run. Averaging across an increased number of tasks might begin to improve the significance of differences among contestants' performances.

Wednesday, August 21, 2013

Beware Phantom Data

One of the perennial challenges facing the data analyst is missing values. A great deal has been written about the importance of identifying the source of missing values, the danger of overly simplistic solutions and, of course, the many and varied mechanisms for "filling them in" with synthetic data ("imputation").

Of the tremendous volume of material written on this subject, nearly all assumes that the analyst knows precisely which items are missing from the data. In reality, this is sometimes not the case. Relational databases and statistical software files, as a rule, have a special value to indicate "missing", though that does not mean that it is always used. Some file formats offer only indirect provision for missings, if any at all, and how software reacts to such missings varies.

Consider, too, the popular practice of using special values (such as -9999) to represent missing values. What could possibly go wrong? For one thing, the person writing the data may not consider whether the flag value might represent a legitimate value. Is it possible, for instance, to have an account balance of -9999 dollars (euros, etc.)? In my career, I have seen databases which used different flag values for each field (-99, -9999, -99999, etc.), making the writing of code against such data extremely tedious (and error-prone). I have also seen -9999 used to indicate one type of missing value, and -9998 to indicate another type of missing. When the hand-off of information from one person (system, process, etc.) to another is confused or incomplete, interpretation of the data becomes incorrect.

Another aspect of this problem is the precise definition given to fields, and their possible misinterpretation by data consumers (such as data miners). Imagine that a particular integer field is being used to record the number of times each customer has made a payment on their loan, within the past 6 months. As customers begin their tenure, this variable starts with a value of zero. Suppose our model included this field as an independent variable. Presumably low risk customers have higher values, while higher risk customers have lower values. Without missing any payments, early lifecycle customer are penalized arbitrarily by the model. One could make the argument that this variable should be undefined (recorded as a database missing value flag) until a customer has a full 6-month track record, but this is exactly the sort of conversation which very often fails to materialize in real organizations.

These are all instances of "phantom data": Items in the database which are missing values, but mistaken for real data. It shouldn't take much imagination on the reader's part to conjure similar problematic situations in his or her own field. The lesson is to look beyond the known missings for more subtle gaps in the data. Time spent investigating the nature of database systems, company procedures and so forth which generate data is insurance against being burned by serious misunderstanding of the data.

Tuesday, July 23, 2013

The NSA, Link Analysis and Fraud Detection

The recent leaks about the NSA’s use data mining and predictive analytics has certainly raised awareness of our field and has resulted in hours of discussions with family, relatives, friends and reporters about what predictive analytics can (and can’t) do with phone records, emails, chat messages, and other structured and unstructured data. Eric Siegel and I have been interviewed on multiple occasions to address this issue from a Predictive Analytics perspective, and in case, in the same article: “What the NSA can’t do with your data (probably)”. Part of my goal in these conversations has been to bring back to reality many of the inflated expectations of what can be done with predictive analytics: predictive analytics is a powerful approach to finding patterns in data, but it isn’t magic, nor is it fool-proof.
First, let me be clear: I have no direct knowledge of the analytics the NSA is doing. I have worked on many fraud detection projects for the U.S. Government and private sector, some including what I would describe as a “social networking” component to them where the connections between parties is an important part of the risk factors.
The phone call meta data shows simple information about each phone call: origination, destination, date of the call, duration, and perhaps some geographic information about the origination and destination. One of the valuable aspects of the data is that connections can be made between origination and destination numbers, and as a results, one can build social networks of every origination phone number in the data. The U.S. has more than 326.4 millions cell phones subscriptions as of December 2012 according to CTIA. The Pew Research survey found that individual cell phone users had on average 664 social connections (not all of which are cell connections). The number of links needed to build a U.S.-wide social map of phone call connections easily outstrips any possible visualization method, and therefore, without filtering connections and networks, these social maps would be useless. One of the factors working in our favor, if we are concerned with privacy issues related to this meta data, is therefore the sheer size of the network.
The networks of phone calls, I believe, are particularly useful in connecting high-risk individuals with others whom the NSA may not know beforehand are connected to the person of interest. In other words, a starting point is needed first and the social network is built from this starting point. If one has multiple starting points, one can also find linkages between networks even if the networks themselves don’t overlap significantly.
The strength of a link can include information such as number of calls, duration of calls, regularity of calls, most recent call, oldest call, and more. Think of these as a cell-phone version of RFM analysis. The networks can be pruned easily based on thresholds for these key features, simplifying the networks considerably.
But even if the connections are made, this data is woefully incomplete on its own. First, there is no connection to the person who actually made the call, only the phone number and who it is registered to. Finding who made the calls requires more investigation. Second, it doesn’t necessarily connect all the phones an individual might use. If a person uses 5 or 6 cell phones, one doesn’t know that the same person is behind these phone numbers. Third, one certainly doesn’t know the intent or content of the call.
Given these limitations, what value is there to the network of calls? These networks are usually best used as lead-generation engines. Which other phone numbers in a network are connected to multiple high-risk individuals (but weren’t here-to-fore considered high risk)? Is the timeline of calls temporally correlated with other known events?
Analytics, and link analysis in particular, provide tremendously powerful techniques to identify new leads and remove unfruitful leads by finding connections unlikely to occur randomly.
NOTE: this article first appeared as an article in the PATimes: http://www.predictiveanalyticsworld.com/patimes/the-nsa-link-analysis-and-fraud-detection/

Tuesday, June 18, 2013

Big Data is Not Enough

Big data is the big buzz word in the world of analytics today. According to google trends, shown in the figure, searches for "big data" have been growing exponentially since 2010 though perhaps is beginning to level off. Or take a look on amazon.com for books with Big Data in the title sometime: the publication dates, for the most part, are in 2012 or 2013.


But what's the key to unlock the big data door? In his interview with Eric Siegel on April 12, Ned Smith of Business News Daily (http://www.businessnewsdaily.com/4326-predictive-analytics-unlocks-big-data.html) starts with this apt insight: "Predictive Analytics is the 'Open Sesame' for the world of Big Data." Big data is what we have; predictive analytics (PA) is what we do with it.

Why is the data so big? Where does it come from? We who do PA usually think of doing predictive modeling on structured data pulled from a database, probably flattened into a single modeling table by a query so that the data is loadable into a software tool. We then clean the data, create features, and away we go with predictive modeling.

But according to a 2012 IBM study, "Analytics: The real-world use of big data", 88% of big data comes from transactions, 73% from log data, and significant proportions of data come from audio and video (still and motion). These are not structured data. Log files are often unstructured data containing nothing more than notes, sometimes freehand, sometimes machine-created, and therefore cannot be used without first preprocessing the data using text mining techniques. For all of us who have built models augmented with log files or other text data, we know how much work is involved in transforming text into useful attributes that can then be used in predictive models

Even the most structured of the big data sources, transactional data, often are nothing more than dates, IDs and very simple information about the nature of the transaction (an amount, time period, and perhaps a label about the nature of the transaction).

Transactional data is rarely used directly; it is usually transformed into a form more useful for predictive modeling. For example, rather than building models where each row is a web page transaction, we transform the data so that each row is a person (the ID) and the fields are aggregations of that person’s history for as long as their cookie has persisted; the individual transactions have to be linked together and aggregated to be useful.

The big data wave we are experiencing is therefore not helpful directly for improving predictive models, we need to first determine the level of analysis needed to build useful models, i.e., what a record in the model represents. The unit of analysis is determined by the question the model is intended to answer, or put another way, the decision the model is intended to improve within the organization. This is determined by defining the business objectives of the models, normally by a program manager or other domain expert in the organization, and not by the modeler.

The second step in building data for predictive modeling is creating the features to include as predictors for the models. How do we determine the features? I see three ways:
  1. the analyst can define the features based on his / her experience in the field, or do research to find what others have done in the field through google searching and academic articles. This assumes the analyst is, to some degree, a domain expert.
  2. the key features can be determined by other domain experts either handed down to the analyst or through interviews of domain experts by the analyst. This is better than a google search because the answers are focused on the organization’s perspective on solving the problem.
  3. the analyst can rely on algorithm-based features creation. In this approach, the analyst merely provides the raw input fields and allows the algorithms to find the appropriate transformations of individual fields (easy) or multivariate combinations (more complex). Some algorithms and implementations of algorithms in software can do this quite effectively. This third approach I see advocated implicitly by data scientists in particular.
In reality, a combination of all three is usually used and I recommend all three. But features based on domain expertise almost always provides the largest gains in model performance compared with algorithm-based (automatic) feature creation.

This is the new thee-legged stool of predictive modeling: big data provides the information, augmenting what we have used in the past, domain experts provide the structure for how to set up the data for modeling, including what a record means and the key attributes that reflect information expected to be helpful to solve the problem, and predictive analytics provides the muscle to open the doors to what is hidden in the data. Those who take advantage of all three will be the winners in operationalizing analytics.

First posted at The Predictive Analytics Times

Friday, June 07, 2013

Dean Abbott Featured in "Popular Mechanics" On-Line Article

Our own Dean Abbott has been consulted for an on-line Popular Mechanics article, "Why the NSA Wants All That Verizon Metadata" (Jun-06-2013), by Glenn Derene. Since the initial report connecting the NSA with Verizon, details have emerged suggesting similar large-scale information-gathering by the American government from other telecommunication and Internet companies.

Some applications of data mining to law enforcement and anti-terrorism problems have clearly been fruitful (for detection of money laundering, for instance, which is one source of funding for criminal and terrorist organizations). On the other hand, direct application of these techniques to plucking out the bad guys from large numbers of innocents strikes this author as dubious, and has long been criticized by experts, such as Bruce Schneier. What's plain is that people in democratic societies must remain vigilant of the balance of information and power granted to their governments, lest the medicine become worse than the disease.

Friday, April 26, 2013

Math and Predictive Analytics - A Personal Account

Last week I taught a workshop at Predictive Analytics World entitled Supercharging Prediction: Hands-On with Ensemble Models. The workshop was intended to introduce predictive modelers to the concept of ensembles through a combination of lecture to provide an overview of model ensembles and hands-on to gain experience building ensembles using Salford Systems SPM v7.0 (Salford Systems sponsored the workshop).

This morning, Heather Hinman, a Marketing Communications Manager at Salford Systems, posted comments on attending that workshop at the Salford Systems blog. Two comments were particularly interesting, especially their implications vis a vis my last blog post on math and predictive analytics:

I will admit I was intimidated at first to be participating in a predictive modeling workshop as I do not have a background in statistics, and only have basic training on decision tree tools by Salford Systems' team of in-house experts. Despite my basic knowledge of decision trees, I was thrilled that I was able to follow along with ease and understanding when learning about tree ensembles and modern hybrid modeling approaches. Marketing folk building predictive models? Yes, we can!
and
Now back at the office in San Diego, along with my usual responsibilities, I feel confident in my ability to build predictive models and gain insights into the data at hand to achieve the email marketing and online campaign goals for our communication efforts!  
In the post, Heather also outlines some of the principles she learned and how she used them to build the predictive models in the workshop.

The point is this: if one uses good software that uses solid principles for building predictive models, and one understands key principles of building predictive models, someone without a mathematics background can build good, profitable models.



Monday, April 01, 2013

Do Predictive Modelers Need to Know Math?

(Note: this post was first published in the March 2013 Edition of the Predictive Analytics Times)
Predictive analytics is just a bunch of math, isn’t it? After all, algorithms in the form of matrix algebra, summations, integrals, multiplies and adds are the core of what predictive modeling algorithms do. Even rule-based approaches need math to compute how good the if-then-else rules are.

I was participating in a predictive analytics course recently and the question a participant asked at the end of two days of instruction was this: “it’s been a long time since I’ve had to do this kind of math and I’m a bit rusty. Is there a book that would help me learn the techniques without the math?”

The question about math was interesting. But do we need to know the math to build models well? Anyone can build a bad model, but to build a good model, don’t we need to know what the algorithms are doing? The answer, of course, depends on the role of the analyst. I contend, however, that for most predictive analytics projects, the answer is “no”.

Let’s consider building decision tree models. What options does one need to set to build good trees? Here is a short list of common knobs that can be set by most predictive analytics software packages: 1. Splitting metric (CART style trees, C5 style trees, CHAID style trees, etc.) 2. Terminal node minimum size 3. Parent node minimum size 4. Maximum tree depth 5. Pruning options (standard error, Chi-square test p-value threshold, etc.)

The most mathematical of these knobs is the splitting metric. CART-styled trees use the Gini Index, C5 trees use Entropy (information gain), and CHAID style trees use the chi-square test as the splitting criterion. A book I consider the best technical book on data mining and statistical learning methods, “The Elements of Statistical Learning”, has this description of the splitting criteria for decision trees, including the Gini Index and Entropy:



To a mathematician, these make sense. But without a mathematics background, these equations will be at best opaque and at worst incomprehensible. (And these are not very complicated. Technical textbooks and papers describing machine learning algorithms can be quite difficult even for more seasoned, but out-of-practice mathematicians to understand).

As someone with a mathematics background and a predictive modeler, I must say that the actual splitting equations almost never matter to me. Gini and Entropy often produce the same splits or at least similar splits. CHAID differs more, especially in how it creates multi-way splits. But even here, the biggest difference for me is not the math, but just that they use different tests for determining "good" splits

There are, however, very important reasons for someone on the team to understand the mathematics or at least the way these algorithms work qualitatively. First and foremost, understanding the algorithms helps us uncover why models go wrong. Models can be biased toward splitting on particular variables or even particular records. In some cases, it may appear that the models are performing well but in actuality they are brittle. Understanding the math can help remind us that this may happen and why.

The fact that linear regression uses a quadratic cost function tells us that outliers affect overall error disproportionately. Understanding how decision trees measure differences between the parent population and sub-populations informs us why a high-cardinality variable may be showing up at the top of our tree, and why additional penalties may be in order to reduce this bias. Seeing the computation of information gain (derived from Entropy) tells us that binary classification with a small target value proportion (such as having 5% 1s) often won't generate any splits at all.

The answer to the question if predictive modelers need to know math is this: no they don’t need to understand the mathematical notation, but neither should they ignore the mathematics. Instead, we all need to understand the effects of the mathematics on the algorithms we use. “Those who ignore statistics are condemned to reinvent it,” warns Bradley Efron of Stanford University. The same applies to mathematics.

Thursday, February 14, 2013

What To Take Home from Your Next Predictive Analytics Conference

Why should one go to a predictive analytics conference? What should one take home from a conference like Predictive Analytics World (PAW)? There are many reasons conferences are valuable including interacting with thought leaders and practitioners, seeing software and hardware tools (the exhibit hall), and learning principles of predictive analytics from talks and workshops. This post focuses on the talks, and in particular, case studies.

There is no quicker way to upgrade our capabilities than having someone else who has "been there" tell us how they succeeded in their development and implementation of predictive models. When I go to conferences, this is at the top of my list. In the best case studies I am able to see different way of looking at a problem than I had considered before, how the practitioner overcame obstacles, how their target variable was defined, what data was used in building the models, how the data was prepared, what figure of merit they used to judge a model's effectiveness, and much more.

Almost all case studies we see at conferences are success stories; we all love winners. Yes, we all know that we learn from mistakes and many case studies actually enumerate mistakes. But success sells and given time limitations in a 20-50 minute talk, few mistakes and dead-ends are usually described in the talks. And, as we used to say in when I was doing government contracting, one works like crazy on the research and then when the money runs out, one declares victory. Putting a more positive spin on the process, we do as well as we can with the resources we have, and if the final solution improves the current system, we are indeed successful.

But once we observe the successful approach, what can we really take home with us? There are three reasons we should be skeptical taking case studies and applying them directly to our own problems.

The first two reasons are straightforward. First, our data is different from the data used in the talk. Obviously. But it is likely to be different enough that one cannot not take the exact same approach to data preparation or target variable creation that one sees at a conference.

Second, our business is different. The way the question was framed and the way predictions can be used are likely to differ in our organization. If we are building models to predict Medicare fraud, they way the “suspicious” claim is processed and which data elements are available vary significantly for each provider (codes being just one example).

The third reason is more subtle and more difficult to overcome. In a fascinating New Yorker article entitled, "The Truth Wears Off: Is there something wrong with the scientific method?", author Jonah Lehrer describes an effect seen by many researchers over the past few decades. Findings in major studies, published in reputable journals, and showing statistically significant results have been difficult to replicate by the original researcher and by others. This is a huge problem because replicating results is what we do as predictive modeler: we assume that behavior in the past can and will be replicated in the future.

In one example, researcher Jonathan Schooler (who was originally at the University of Washington as a graduate student) “demonstrated that subjects shown a face and asked to describe it were much less likely to recognize the face when shown it later than those who had simply looked at it. Schooler called the phenomenon ‘verbal overshadowing’. The study turned him into an academic star."

A few years later, he tried to replicate the study didn’t succeed. In fact, he tried many times over the years and never succeeded. The effect he found at first waned each time he tried to replicate the study with additional data. "This was profoundly frustrating. It was as if nature gave me this great result and then tried to take it back.” There have been a variety of potential explanations for the effect, including “regression to the mean”. This might very well be the case because even when we show statistically significant results defined by having a p value less than 0.05, there is still a chance that the effect found was not really there at all. Over thousands of studies, dozens find effects therefore that aren't really there.

Let's assume we are building models and there is actually no significant difference between responders and non-responders (but we don't know that). However, we work very hard to identify an effect, and eventually we find the effect on training and testing data. We publish. But the effect isn't there; we happened upon the effect just had good luck (which in the long run is actually bad luck!). Even if the chance of finding the effect by chance is 1 in 100, or 1 in 1000, if we experiment enough and search through enough variables, we may happen upon a seemingly good effect eventually. This process, called "over searching" by Jensen and Cohen (see "Multiple Comparisons in Induction Algorithms"), is a real danger.

So what do we do at conferences? We should take home ideas, principles, and approaches rather than recipes. It should spur us to try ideas we either hadn't yet tried or even thought about before.

(An earlier version of this post was first published in the Predictive Analytics Times February 2013 issue)

Sunday, February 10, 2013

Using Geographic Data

Most organizations collect and maintain some type of geographic data, yet many ignore this data during analysis. Any business has some record of customer addresses, for instance, but this data is usually formatted in an awkward, non-numeric form. Geographic data can be very predictive, though, since behaviors being predicted often have some correlation to location.

So, how might one use geographic data? Possible answers depend on several factors, most importantly the volume and type of such data. A company serving a national market in the United States, for instance, will have customer shipping and billing addresses (not necessarily the same thing) for each customer (possibly for each transaction). These addresses normally come with a range of spatial granularities: street address, town, state, and associated ZIP Code (a 5-digit postal code).

Even at the largest level of aggregation, the state level, there may be over 50 distinct values (besides the 50 states, American addresses may be in Washington D.C. [technically not part of any state], or any of a number of other American territories, the most common of which is probably Puerto Rico). With 50 or so distinct values, significant data volume is needed to amass the observations needed to draw conclusions about each value. Given the best case scenario, in which all states exhibit equal observation counts, 1,000 observations breaks out into 50 categories of merely 20 observations each- not even enough to satisfy the old statistician's 30 observation rule of thumb. In data mining circles, we are accustomed to having much larger observation counts, but consider that the distribution of state values is never uniform in real data.

Using individual dummy variables to represent each state may be possible with especially large volumes.  Possibly an "other" category covering the least frequent so many states will be needed. Another technique which I have found to work well is to replace the categorical state variable with a numeric variable representing a summary of the target variable, conditioned by state. In other words, all instances of "Virginia" are replaced by the average of the target variable for all Virginia cases, all instances of "New Jersey" are replaced by the average of the target variable for all New Jersey cases, and so on. This solution concentrates information about the target which comes from the state in a single variable, but makes interactions with other predictors more opaque. Ideally, such summaries are calculated on special hold-out set of data, used just for this purpose, so as to avoid over-fitting. Again, it may be necessary to lump the smallest so many states together as "other". While I have used American states in my example, it should not be hard for the reader to extend this idea to Canadian provinces, French départements, etc.

Most American states are large enough to provide robust summaries, but as a group they may not provide enough differentiation in the target variable. Changing the spatial scale implies a trade-off: Smaller geographic units exhibit worse summary variance, but improved geographic differentiation. American town names are not necessarily unique within a given state and similar names may be confused (Newtown, Pennsylvania is quite a distance from Newtown Square, Pennsylvania, for instance). In the United States, county names are unambiguous, and present finer spatial detail than states. County names do not, however, normally appear in addresses, but they are easily attached using ZIP Code/County tables easily found on-line. Another possible aggregation is the Section Code Facility, or "SCF", which is the first 3 digits of the ZIP Code.

In the American market, other types of spatial definitions which can be used include: Census Bureau definitions, telephone area codes and Metropolitan Statistical Areas ("MSAs") and related groupings defined by the U.S. Office of Management and Budget. The Census Bureau is a government agency which divides the entire country in to spatial units which vary in scale, down to very small areas (much smaller than ZIP Codes). MSAs are very popular with marketers. There are 366 MSAs at present, and they do not cover the entire land area of the United States, though they do cover about 85% of its population.

It is important to note that nearly all geographic entities change in size, shape and character over time. While existing American state and county boundaries almost never change any more, ZIP code boundaries and Census Bureau definitions, for instance, do change. Changing boundaries obviously complicates analysis, even though historic boundary definitions are often available. Even among entities whose boundaries do not change, radical changes in behavior may happen in geographically distinct ways. Consider that a model built before hurricane Katrina may no longer perform well in areas affected by the storm.

Also note that some geographic units, by definition, "respect" other definitions. American counties, for instance, only contain land from a single state. Others don't: the third-most populous MSA, Chicago-Joliet-Naperville, IL-IN-WI, for example, overlaps three different states.

Being creative when defining model inputs can be as helpful with geographic data as it is with more conventional data. In addition to the billing address itself, consider transformations such as: Has the billing address ever changed (1) or not (0)? How many times has the billing address changed? How often has the billing address changed (number of times changed divided by number of months the account has been open)? How far is the shipping address from the billing address? And so on...

Much more sophisticated use may be made of geographic data than has been described in this short posting. Software is available commercially which will determine drive time contours about locations, which would be useful, for instance when modeling retail store location revenue models. Additionally, there is an entire branch of statistics, called spatial statistics, which defines an entire class of analysis procedures specific to this sort of thing.

I encourage readers who have avoided geographic data to consider even simple mechanisms to include it in model construction. Opening up a new dimension in your analysis may provide significant returns.





Saturday, February 02, 2013

When Analysis Isn't the Answer

Data mining is an important tool whose benefits have been demonstrated in diverse fields, among business, government and non-profit organizations. Its application areas continue to grow, especially given the ever-shrinking cost of gathering and organizing data. Yet, there are problems for which data mining is wholly unsuited as a solution.

To understand when data mining is not applicable, it will be helpful to define precisely when it is applicable. Data mining (inferential statistics, predictive analytics, etc.) requires data stored in a machine format of sufficient volume, quality and relevance so as to permit the construction of predictive models which assist in real-world decision making.

Most of our time as data miners is spent worrying over the quality of the data and the process of turning data into models, however it is important to realize the usual context of data mining. Most organizations can perform basic decision making competently, and they have done so for thousands of years. Whether the base decision process is human judgment, a simple set of rules or a spreadsheet, much performance potential is already realized before data mining is applied. Consultants' marketing notwithstanding, data mining typically inhabits the margin of performance, where it tries to bring an extra "edge".

So, if the above two paragraphs describe conditions conducive to data mining success, what sorts of real-world situations defy data mining? The most obvious would be problems featuring data that is too small, too narrow, too noisy or of too little relevance to allow effective modeling. Organizations which have not maintained good records, which still rely on non-computer procedures and those with too little history are good examples. Even within very large organizations which collect and store enormous databases, there may be no relevant data for the problem at hand (for instance, when a new line of business is being opened, or new products introduced). It is surprising how often business people expect to extract value from a situation when they have failed to invest in appropriate data gathering.

Another large area with minimal data mining potential is organizations whose basic business process is so fundamentally broken that the usual decision making procedures have failed to do the usual "heavy lifting". Any of us can easily recall experiences in retail establishments whose operation was so flawed that it was obvious that the profit potential was not nearly being exploited. Data mining cannot fine tune a process which is so far gone. No amount of quantitative analysis will fix unkept shelves, weak product offering or poor employee behavior.

Wednesday, January 16, 2013

Three Ways to Get Your Predictive Models Deployed


We all know that given reasonable data, a good predictive modeler can build a model that works well and helps make makes better decisions than what is currently used in your organization (at least in our own minds). Newer data, sophisticated algorithms, and a seasoned analyst are all working in our favor when we build these models, and if success were measured by accuracy (as they are in most data mining competitions), we're in great shape. Yes, there are always gotchas and glitches along the way. But when my deliverable is only slideware, even of the modeling is hard, I'm confident of being able to declare victory at the end.

However, the reality is that there is much more to the transition from cool model to actual deployment than a nice slide deck and paper accepted at one's favorite predictive analytics, data mining or big data conference. In these venues, the winning models are those that are "accurate" (more on that later) and have used creative analysis techniques to find the solution; we won't submit a paper when we only had to press the "go" button and have the data mining software give us a great solution!

For me, the gold standard is deployment. If the model gets used and improves the decisions an organization makes, I've succeeded. Three ways to increase the likelihood your models are deployed are:

1) Make sure the model stakeholder designs deployment into the project from the beginning

The model stakeholder is the individual, usually a manager, who is the advocate of predictive models to decision-makers. It is possible that a senior-level modeler can do this task, but that person must be able to switch hit: he or she must be able to speak the language of management and be able to talk technical detail to analytics. This may require more than one trusted person: the manager, who is responsible and makes the ultimate decisions about the models, and the lead modeler, who is responsible for the technical aspects of the model. It is more than "talking the talk" and knowing buzz-words in both realms; the person or persons must truly be "one of" both groups.

For those who have followed my blog posts and conference talks, you know I am a big advocate of the CRISP-DM process model (or equivalent methodologies, which seem to be endless). I've referred to CRISP-DM often, including on topics related to what data miners need to learn and Defining the Target Variable, just as two examples.

The stakeholder must not only understand the business of objectives of the model (Business Understanding in CRISP-DM), but must be present during discussions take place related to which models will be built. It is essential that reasonable expectations are put into place from the beginning, including what a good model will "look like" (accuracy and interpretability) and how the final model will be deployed.

I've seen far too many projects die or become inconsequential because either the wrong objectives were used in building the models, meaning the models were operationally useless, or because the deployment of the models was not considered, meaning again that the models were operationally useless. As an example, on one project, the model was assumed to be able to be run within a rules engine, but the models that were built were not rules at all, but were complex non-linear models that could not be translated into rules. The problem obviously could have been avoided had this disconnect been verbalized early in the modeling process.

2) Make sure modelers understand the purpose of the models

The modelers must know how the models will be used and what metrics should be used to judge model performance. A good summary of typical error metrics used by modelers is found here. However, for most of the models I have deployed in customer acquisition, retention, and risk modeling, the treatment based on the model is never applied to the entire population (we don't mail everyone, just a subset). So the metrics that make the most sense are often ones like "lift after the top decile", maximum cumulative net revenue, top 1000 scores to be investigated, etc. I've actually seen negative correlations between the ranking of models based on global metrics (like classification error or R^2) vs. the ranking based on subset selection ranking, such as top 1000 scores; very different models may be deployed depending on the metric one uses to assess them. If modelers aren't aware of the metric to be used, the wrong model can be selected, even one that does worse than the current approach.

Second, if the modelers don't understand how the models will be deployed operationally, they may find a fantastic model, one that maximizes the right metric, but is useless. The Neflix Prize is a great example: the final winning model was accurate but far too complex to be used. Netflix extracted key pieces to the models to operationalize instead. I've had customers stipulate to me that "no more than 10 variables can be included in the final model". If modelers aren't aware of specific timelines or implementation constraints, a great but useless model can be the result.

3) Make sure the model stakeholder understands what the models can and can't do

In the effort to get models deployed, I've seen models elevated to a status they don't deserve, most often by exaggerating their accuracy and expected performance once in operation. I understand why modelers may do this: they have a direct stake in what they did. But the manager must be more skeptical and conservative.

One of the most successful colleagues I've ever worked with used to assess model performance on held-out data using the metric we had been given (maximum depth one could mail to and still achieve the pre-determined response rate). But then he always backed off what was reported to his managers by about 10% to give some wiggle room. Why? Because even in our best efforts, there is still a danger that the data environment after the model is deployed will differ from that used in building the models, thus reducing the effectiveness of the models.

A second problem for the model stakeholder is communicating an interpretation of the models to decision-makers. I've had to do this exercise several times in the past few months and it is always eye-opening when I try to explain the patterns a model is finding when the model is itself complex. We can describe overall trends ("on average", more of X increases the model score) and we can also describe specific patterns (when observable fields X and Y are both high, the model score is high). Both are needed to communicate what the models do, but have to connect with what a decision-maker understands about the problem. If it doesn't make sense, the model won't be used. If it is too obvious, the model isn't worth being used.

The ideal model for me is one where the decision-maker nods knowingly at the "on average" effects (these should usually be obvious). Then, once you throw in some specific patterns, he or she should scrunch his/her eyes, think a bit, then smile as the implications of the pattern dawns on them as that pattern really does make sense (but was previously not considered).

As predictive modelers, we know that absolutes are hard to come by, so even if these three principles are adhered to, other factors can sabotage the deployment of a model. Nevertheless, in general, these steps will increase the likelihood that models are deployed. In all three steps, communication is the key to ensuring the model built addresses the right business objective, the right scoring metric, and can be deployed operationally.

NOTE: this post was originally posted for the Predictive Analytics Times at http://www.predictiveanalyticsworld.com/patimes/january13/ 

Friday, January 04, 2013

Top Posts in 2012

For the second consecutive year, a quick look back at posts from the prior year.

For posted in 2012, in order of popularity:
  1. Target, Pregnancy, and Predictive Analytics, Part I
  2. Target, Pregnancy, and Predictive Analytics, Part II
  3. Predictive Analytics World Had the Target Story First
  4. Why Defining the Target Variable in Predictive Analytics is Critical
  5. Dilbert, Database Marketing, and Spam
I’m also adding #6 because Will post in December did very well, but of course has had only one month to accumulate views.
  1. 6 Reasons You Hired the Wrong Data Miner
From posts prior to 2012, in order of popularity for 2012:
  1. What Do Data Miners Need to Learn (June 2011)
  2. Free and Inexpensive Data Mining Software (November 2006) This post needs to be updated!
  3. Why Normalization Matters for K Means (April 2009) It always amazes me why this post persists as one of the most popular, but nearly ¼ of the visits used the search term “K Means Noisy Data”
  4. Data Mining Data Sets (April 2008) This post also needs to be updated
  5. Business Analytics vs. Business Intelligence (December 2009)
One final note: When I look back at visits since the start of this blog, 4 of the top 5 posts are the top 4 “prior to 2012” above. The #5 most popular post over all the years I’ve had the blog is one by Will from 2007, “Missing Values and Special Values: The Plague of Data Analyis”, one that I have always liked very much.

Best to all of you in 2013!