Wednesday, December 19, 2012

6 Reasons You Hired the Wrong Data Miner

As is in any discipline, talent within data mining community varies greatly.  Generally, business people and others who hire and manage technical specialists like data miners are not themselves technical experts.  This makes it difficult to evaluate the performance of data miners, so this posting is a short list of possible deficiencies in a data miner's performance.  Hopefully, this will spare some heartache in the coming year.  Merry Christmas!

1. The data miner has little or no programming skill.

Most work environments require someone to extract and prepare the data.  The more of this process which the data miner can accomplish, the less her dependence on others.  Even in ideal situations with prepared analytical data tables, the data miner who can program can wring more from the data than her counterpart who cannot (think: data transformations, re-coding, etc.).  Likewise, when her predictive model is to be deployed in a production system, it helps if the data miner can provide code as near to finished as possible.

2. The data miner is unable to communicate effectively with non-data miners.

Life is not all statistics: Data mining results must be communicated to colleagues with little or no background in math.  If other people do not understand the analysis, they will not appreciate its significance and are unlikely to act on it.  The data miner who can express himself clearly to a variety of audiences (internal customers, management, regulators, the press, etc.) is of greater value to the organization than his counterpart who cannot.  The data miner should should receive questions eagerly.

3. The data miner never does anything new.

If the data miner always approaches new problems with the same solution, something is wrong.  She should be, at least occasionally, suggesting new techniques or ways of looking at problems.  This does not require that new ideas be fancy: Much useful work can be done with basic summary statistics.  It is the way they are applied that matters.

4. The data miner cannot explain what they've done.

Data mining is a subtle craft: there are many pitfalls and important aspects of statistics and probability are counter-intuitive.  Nonetheless, the data miner who cannot provide at least a glimpse into the specifics of what they've done and why, is not doing all he might for the organization.  Managers want to understand why so many observations are needed for analysis (after all, they pay for those observations), and the data miner should be able to provide some justification for his decisions.

5. The data miner does not establish the practical benefit of his work.

A data miner who cannot connect the numbers to reality is working in a vacuum and is not helping her manager (team, company, etc.) to assess or utilize her work product.  Likewise, there's a good chance that she is pursuing technical targets rather than practical ones.  Improving p-values, accuracy, AUC, etc. may or may not improve profit (retention, market share, etc.).

6. The data miner never challenges you.

The data miner has a unique view of the organization and its environment.  The data miner works on a landscape of data which few of his coworkers ever see, and he is less likely to be blinded by industry prejudices.  It is improbable that he will agree with his colleagues 100% of the time.  If the data miner never challenges assumptions (business practices, conclusions, etc.), then something is wrong.

Tuesday, November 06, 2012

Why Predictive Modelers Should be Suspicious of Statistical Tests (or why the Redskin Rule fools us)

Well, the danger is really not the statistical test per se, it the interpretation of the statistical test.

Yesterday I tweeted (@deanabb) this fun factoid: "Redskins predict Romney wins POTUS #overfit. if Redskins lose home game before election => challenger wins (17/18)" I frankly had never heard of this "rule" before and found it quite striking. It even has its own Wikipedia page (

For those of us in the predictive analytics or data mining community, and those of us who use statistical tests to help out interpreting small data, 17/18 we know is a hugely significant finding. This can frequently be good: statistical tests will help us gain intuition about value of relationships in data even when they aren't obvious.

In this case, an appropriate test is a chi-square test based on the two binary variables (1) did the Redskins win on the Sunday before the general election (call it the input or predictor variable) vs. (2) did the incumbent political party win the general election for President of the United States (POTUS).

According to the Redskins Rule, the answer is "yes" in 17 of 18 cases since 1940. Could this be by chance? If we apply the chi-square test to it, it sure does look significant! (chi-square = 14.4, p < 0.001). I like the decision tree representation of this that shows how significant it is (built using the Interactive CHAID tree in IBM Modeler on Redskin Rule data I put together here):

It's great data--9 Redskin wins, 9 Redskin losses, great chi-square statistic!

OK, so it's obvious that this is just another spurious correlation in the spirit of all of those fun examples in history, such as the superbowl winning conference predicting if the stock market would go up or down in the next year at a stunning 20 or 22 correct. It even was the subject of academic papers on the subject!

The broader question (and concern) for predictive modelers is this: how do we recognize when we have uncovered spurious correlations in the data that are merely spurious? This can happen especially when we don't have deep domain knowledge and therefore wouldn't necessarily identify variables or interactions as spurious. In examples such as the election or stock market predictions, no amount of "hold out" samples, cross-validation or bootstrap sampling would uncover the problem: it is in the data itself.

We need to think about this because inductive learning techniques search through hundreds, thousands, even millions of variables and combinations of variables. The phenomenon of "over searching" is a real danger with inductive algorithms as they search and search for patterns in the input space. Jensen and Cohen have a very nice and readable paper on this topic (PDF here). For trees, they recommend using the Bonferroni adjustment which does help penalize the combinatorics associated with splits. But our problem here goes far deeper than overfitting due to combinatorics.

Of course the root problem with all of these spurious correlations is small data. Even if we have lots of data, what I'll call here the "illusion of big data", some algorithms make decisions based on smaller populations, like decision trees, rule induction and nearest neighbor (i.e., algorithms that build bottom-up). Anytime decisions are made from populations of 15, 20, 30 or even 50 examples, there is a danger that our search through hundreds of variables will turn out a spurious relationship.

What do we do about this? First, make sure you have enough data so that these small-data effects don't bite you. This is why I strongly recommend doing data audits and looking for categorical variables that contain levels with at most dozens of examples--these are potential overfilling categories.

Second, don't hold strongly any patterns discovered in your data based on solely on the data, especially if they are based on relatively small sample sizes. These must be validated with domain experts. Decision trees are notorious for allowing splits deep in the trees that are "statistically significant" but dangerous nevertheless because of small data sizes.

Third, the gist of your models have to make sense. If they don't, put on your "Freakonomics" hat and dig in to understand why the patterns were detected by the models. In our Redskin Rule, clearly this doesn't make sense causally, but sometimes the pattern picked up by the algorithm is just a surrogate for a real relationship. Nevertheless, I'm still curious to see if the Redskin Rule will prove to be correct once again. This year it predicts a Romney win because the Redskins lost and therefore the incumbent party (D) by the rule should lose. UPDATE: by way of comparison...the chances of having 17/18 or 18/18 coin flips turn up heads (or tails--we're assuming a fair coin after all!) is 7 in 100,000 or 1 in 14,000. Put another way, if we examined 14K candidate variables unrelated to POTUS trends, the chances are that one of them would line up 17/18 or 18/18 of the time. Unusual? Yes. Impossible? No!

Tuesday, October 23, 2012

Data Preparation: Know Your Records!

Data preparation in data mining and predictive analytics (dare I also say Data Science?) rightfully focuses on how the fields in ones data should be represented so that modeling algorithms either will work properly or at least won't be misled by the data. These data preprocessing steps may involve filling missing values, reigning in the effects of outliers, transforming fields so they better comply with algorithm assumptions, binning, and much more.

In recent weeks I've been reminded how important it is to know your records. I've heard this described in many ways, four of which are:

  • the unit of analysis
  • the level of aggregation
  • what a record represents
  • unique description of a record

  • For example, does each record represent a customer? If so, over their entire history or over a time period of interest? In web analytics, the time period of interest may be a single session, which if it is true, means that an individual customer may be in the modeling data multiple times as if each visit or session is an independent event.

    Where this especially matters is when disparate data sources are combined. If one is joining a table of customerID/Session data with another table with each record representing a customerID, there's no problem. But if the second table represents customerID/store visit data, there will obviously be a many-to-many join resulting in a big mess.

    This is probably obvious to most readers of this blog. What isn't always obvious is when our assumptions about the data result in unexpected results. What if we expect the unit of analysis to be customerID/Session but there are duplicates in the data? Or what if we had assumed customerID/Session data but it was in actuality customerID/Day data (where ones customers typically have one session per day, but could have a dozen)?

    The answer is just like we need to perform a data audit to identify potential problems with fields in the data, we need to perform record audits to uncover unexpected record-level anomalies. We've all had those data sources where the DBA swears up and down that there are no dups in the data, but when we group by customerID/Session, we find 1000 dups.

    So before the joins and after joins, we need to do those group by operations to find examples with unexpected numbers of matches.

    In conclusion: know what your records are supposed to represent, and verify verify verify. Otherwise, your models (who have no common sense) will exploit these issues in undesirable ways!

    Thursday, September 13, 2012

    Budgeting Time on a Modeling Project

    Within the time allotted for any empirical modeling project, the analyst must decide how to allocate time for various aspects of the process.  As is the case with any finite resource, more time spent on this means less time spent on that.  I suspect that many modelers enjoy the actual modeling part of the job most.  It is easy to try "one more" algorithm: Already tried logistic regression and a neural network?  Try CART next.

    Of course, more time spent on the modeling part of this means less time spent on other things.  An important consideration for optimizing model performance, then, is: Which tasks deserve more time, and which less?

    Experimenting with modeling algorithms at the end of a project will no doubt produce some improvements, and it is not argued here that such efforts be dropped.  However, work done earlier in the project establishes an upper limit on model performance.  I suggest emphasizing data clean-up (especially missing value imputation) and creative design of new features (ratios of raw features, etc.) as being much more likely to make the model's job easier and produce better performance.

    Consider how difficult it is for a simple 2-input model to discern "healthy" versus "unhealthy" when provided the input variables height and weight alone.  Such a model must establish a dividing line between healthy and unhealthy weights separately for each height.  When the analyst uses instead the ratio of weight to height, this becomes much simpler.  Note that the commonly used BMI (body mass index) is slightly more complicated than this, and would likely perform even better.  Crossing categorical variables is another way to simplify the problem for the model.  Though we deal with a process we call "machine learning", is is a pragmatic matter to make the job as easy as possible for the machine.

    The same is true for handling missing values.  Simple global substitution using the non-missing mean or median is a start, but think about the spike that creates in the variable's distribution.  Doing this over multiple variables creates a number of strange artifacts in the multivariate distribution.  Spending the time and energy to fill in those missing values in a smarter way (possibly by building a small model) cleans up the data dramatically for the downstream modeling process.

    Tuesday, September 11, 2012

    What do we call what we do?

    I've called myself a data miner for about 15 years, and the field I was a part of as Data Mining (DM). Before then, I referred to what I did as "Pattern Recognition", "Machine Learning", "Statistical Modeling", or "Statistical Learning". In recent years, I've called what I do Predictive Analytics (PA) more often and even co-titled my blog with both Data Mining and Predictive Analytics. That stated, I don't have a good noun to go along with PA. A "predictive analytist" (as if I myself were a "predictor")? A "predictive analyzer"? I often call someone who does PA a Predictive Analytics Professional. But the according to google, the trending on data mining is down. Pattern recognition? Down. Machine Learning? Flat or slightly up. Only Predictive Analytics and it's closely-related sibling, Business Analytics, are up. Even the much-touted Data Science has been relatively flat, though has been spiking Q4 the past few years.
    data mining
    Data Mining
    Pattern Recognition
    Machine Learning
    Predictive Analytics
    Business Analytics
    The big winner? Big Data of course! It has exploded this year. Will that trend continue? It's hard to believe it will continue, but this wave has grown and it seems that every conference related to analytics or databases is touting "big data".

    Big Data

    Data Science

    I have no plans of calling what I do "big data" or "data science". The former term will pass when data gets bigger than big data. The latter may or may not stick, but seems to resonate more with theoreticians and leading-edge types than with practitioners. For now, I'll continue to call myself a data miner and what I do predictive analytics or data mining.

    Friday, August 31, 2012

    Choose Your Target Carefully

    Every so often, an article or survey will appear stressing the importance of data preparation as an early step in the process of data mining.  One often-overlooked part of data preparation is to clearly define the problem, and, in particular, the target variable.  Often, a nominal definition of the target variable is given.

    As an example, a common problem in banking is to predict future balances of a loan customer.  The current balance is a matter of record and a host of explanatory variables (previous payment history, delinquency history, etc.) are available for model construction.  It is easy to move forward with such a project without considering carefully whether the raw target variable is the best choice for the model to approximate.  It may be, for instance, that it is easier to predict the logarithm of balance, due to a strongly skewed distribution.  Or, it might be that it is easier to predict the ratio of future balances to the current balance.  These two alternatives result in models whose output are easily transformed back into the original terms (by exponentiation or multiplication by the current balance, respectively).  More sophisticated targets may be designed to stabilize other aspects of the behavior being studied, and certain other loose ends may be cleaned up as well, for instance when the minimum or maximum target values are constrained.

    When considering various possible targets, it helps to keep in mind that the idea is to stabilize behavior, so that as many observations as possible align in the solution space.  If retail sales include a regular variation, such as by day of the week or month of the year, then that might be a good candidate for normalization: Possibly we want to model retail sales divided by the average for that day of the week, or retail sales divided by a trailing average for that day of the week for the past 4 weeks.  Some problems lend themselves to decomposition, such as profit being modeled by predicting revenue and cost separately.  One challenge to using multiple models in series this way is that their (presumably independent) errors will compound.

    Experience indicates that it is difficult in practice to tell which technique will work best in any given situation without experimenting, but performance gains are potentially quite high for making this sort of effort.

    Wednesday, August 08, 2012

    The Data is Free and Computing is Cheap, but Imagination is Dear

    Recently published research, What Makes Paris Look like Paris?, attempts to classify images of street scenes according to their city of origin.  This is a fairly typical supervised machine learning project, but the source of the data is of interest.  The authors obtained a large number of Google Street View images, along with the names of the cities they came from.  Increasingly, large volumes of interesting data are being made available via the Internet, free of charge or at little cost.  Indeed, I published an article about classifying individual pixels within images as "foliage" or "not foliage", using information I obtained using on-line searches for things like "grass", "leaves", "forest" and so forth.

    A bewildering array of data have been put on the Internet.  Much of this data is what you'd expect: financial quotes, government statistics, weather measurements and the like- large tables of numeric information.  However, there is a great deal of other information: 24/7 Web cam feeds which are live for years, news reports, social media spew and so on.  Additionally, much of the data for which people once charged serious bucks is now free or rather inexpensive.  Already, many firms augment the data they've paid for with free databases on the Web.  An enormous opportunity is opening up for creative data miners to consume and profit from large, often non-traditional, non-numeric data which are freely available to all, but (so far) creatively analyzed by few.

    Monday, July 30, 2012

    Predicting Crime

    Applying inferential statistics to criminology is not new, but it appears that the market has been maturing.  See, for instance, a recent article, "Police using ‘predictive analytics’ to prevent crimes before they happen", published by Agence France-Presse on The Raw Story (Jul-29-2012).

    Setting aside obvious civil liberties questions, consider the application of this technology.  My suspicion is that targeting police efforts by geographic locale and day-of-week/time-of-day using this approach will decrease the overall level of crime, but by how much is not clear. This is typical of problems faced by businesses: It is not enough to predict what we already know, nor is it enough to trot out glowing but artificial technical measures of performance.  Knowledge that real improvement has occurred requires more.  For instance, at least some effect of police effort on the street does not decrease crime, but merely moves it to new locations.

    Were I mayor of a small town approached by the vendor of such a solution, I'd want to see some sort of experimental design which made apples-to-apples comparison between our best estimates of what happens with the new tool, and what happens without it.  Only once this firm measure of benefit has been obtained could one reasonably weigh it against the financial and political costs.

    Thursday, June 21, 2012

    Where it began: John Elder and Dean Abbott at Barron Associates

    I came across this photo today and couldn't resist. John Elder and I worked at the company Barron Associates, Inc. (BAI) in Charlottesville, VA in the 80s (John was employee #1). Hopefully you can identify John in the back right and me in the front right, though it will take some good pattern matching to do so!

    The Founder and President of the company was Roger Barron. Both John and I were introduced to statistical learning methods at BAI and of course went on to careers in the field now known as Data Mining or Predictive Analytics (among other things). I write about my experience with BAI in the forthcoming book Journeys to Data Mining: Experiences from 15 Renowned Researchers, Ed. by Dr Mohamed Medhat Gaber, published by Springer. Authors in the book include John (thanks to John for recommending my inclusion in the book), Gregory Piatetsky-Shapiro, Mohammed J. Zaki, and of course several others.

    The photo appeared as I was searching for descriptions of our field back in the 80s and was looking in particular for the Barron and Barron paper "Statistical Learning Networks: A Unifying View", where "Statistical Learning Networks" was the phrase of interesting, along with "Models", "Classifiers", and "Neural Networks". We used to refer to the field as "Pattern Recognition" and "Artificial Intelligence". It's interesting to note that pattern recognition on Wikipedia contains a list of "See Also" terms that includes the more modern terms such as data mining and predictive analytics.

    I will post within a couple days on the pattern recognition terms of the day and how they are changing.

    Wednesday, May 02, 2012

    Predictive Analytics World Had the Target Story First

    The New York Times Magazine article "How Companies Learn Your Secrets" by Charles Duhigg with the key descriptions of Target, pregnancy, predictive analytics (blogged on here and here) certainly generated a lot of buzz; if you are unable to see the NYTimes Magazine article, the Forbes summary is a good summary. However, few know that Eric Siegel booked Andy Pole for the October 2010 Predictive Analytics World conference as a keynote speaker. The full video of that talk is here.

    In this talk, Mr. Pole discussed how Target was using Predictive Analytics including descriptions of using potential value models, coupon models, and...yes...predicting when a woman is due (if you aren't the patient type, it is at about 34:30 in the video). These models were very profitable at Target, adding an additional 30% to the number of woman suspected or known to be pregnant over those already known to be (via self-disclosure or baby registries). The fact that this went on for over a year after the Predictive Analytics World talk and before the fallout tells me that it didn't cause significant problems for Target prior to the attention brought to the subject related to the NYTimes article.

    After watching the talk, what struck me most was that Target was applying a true "360 deg" customer view by linking guests through store visits, web, email, and mobile interactions. In addition, close attention was paid to linking the interactions so that coupons made sense: they didn't print coupons to those who had just purchased items they score high for couponing, and they identify which mediums don't generate responses and stop using those channels.

    I suspect what Target is doing is no different than most retailers, but this talk was an interesting glimpse into how much they value the different channels and try to find ways to provide customers with what they are most interested in, and suppress what they are not interested in.

    Thursday, April 26, 2012

    Another Wisdom of Crowds Prediction Win at eMetrics / Predictive Analytics World

    This past week at Predictive Analytics World / Toronto (PAW) has been a great time for connecting with thought leaders and practitioners in the field. Sometimes there are unexpected pleasures as well, which is certainly the case this time. One of the exhibitors for the eMetrics conference, co-locating with PAW at the venue, was Unilytics, a web analytics company. At their booth there was a cylindrical container filled with crumpled dollar bills with a sign soliciting predictions of how many dollar bills were in the container (the winner getting all the dollars). After watching the announcement of the winner, who guessed $352, only $10 off from the actual $362, I thought this would be the perfect opportunity for another Wisdom of Crowds test,just like the one conducted 9 months ago and blogged on here.
    Two Unilytics employees at the booth, Gary Panchoo and Keith MacDonald, were kind enough to indulge me and my request to compute the average of all the guesses. John Elder was also there, licking his wounds from finished a close second as his guess, $374 was off by $12, a mere $2 away from the winning entry! The results of the analysis are here (summary statistics created by JMP Pro 10 for the mac). In summary, the results are as follows:

    Dollar Bill Guess Scores

    MethodGuess ValueError
    Ensemble/Average (N=61)3653
    Winning Guess (person)35210
    John Elder37412
    Guess without outlier (2000), 3rd place33824
    Median, 19th place27587

    So once again, the average of the entries (the "Crowds" answer) beat the single best entry. What is fascinating to me about this is not that the average won (though this in of itself isn't terribly surprising), but rather how it won. Summary statistics are below. Note that the Median is 275, far below the mean. Not too how skewed the distribution of guesses are (skew = 3.35). The fact that the guesses are skewed positively for a relatively small answer (362) isn't a surprise, but the amount of skew is a bit surprising to me. What these statistics tell us is that while the mean value of the guesses would have been the winner, a more robust statistic would not, meaning that the skew was critical in obtaining a good guess. Or put another way, people more often than not under-guessed by quite a bit (the median is off by 87). Or put a third way, the outlier (2000) which one might naturally want to discount because it was a crazy guess was instrumental to the average being correct. In the prior post on this from July 2011, I trimmed the guesses, removing the "crazy" ones. So when should we remove the wild guesses and when shouldn't we? (If I had removed the 2000, the "average" still would have finished 3rd). I have no answer to when the guesses are not reasonable, but wasn't inclined to remove the 2000 initially here. Full stats from JMP are below, with the histogram showing the amount of skew that exists in this data.

    Distribution of Dollar Bill Guesses - Built with JMP

    Summary Statistics

    StatisticGuess Value
    Std Dev299.80071
    Std Err Mean38.385548
    Upper 95% Mean441.78253
    Lower 95% Mean288.21747
    2% Trimmed Mean331.45614
    Interquartile Range185.5

    Note: The mode shown is the smallest of 2 modes with a count of 3.


    QuantileGuess Value

    Monday, April 09, 2012

    Dilbert, Database marketing and spam

    Ruben's comment that referred to spam reminded me of an old Dilbert comic which conveys the misconception about database marketing (e-marketing) and spam.

    I know Ruben well and know he was poking fun, though I still have to correct folks who after finding out I do "data mining" actually comment that I'm responsible for spam. Answer: "No, I'm the reason you don't get as much spam!"

    Friday, April 06, 2012

    What I'm Working On

    Sometimes folks ask me what I'm doing, so I thought I'd share a few things on my plate right now:

    Courses and Conferences
    1. Reading several papers for the KDD 2012 Conference Industrial / Government Track
    2. Preparing for the Predictive Analytics World / Toronto "Advanced Methods Hands-on:
    Predictive Modeling Techniques
    " workshop on April 27. I'm using the Statsoft Statistica package.
    3. Starting preparation for a talk at the Salford Analytics and Data Mining Conference 2012, "A More Transparent Interpretation of Health Club Surveys" on May 24. It will highlight use of the CART software package in the analysis. This was work that motivated interviews with New York Times reporter Charles Duhigg, and ended with a mention (albeit *very* briefly) in the fascinating new book by Duhigg, "The Power of Habit: Why We Do What We Do in Life and Business".
    4. Working through data exercises for the next UCSD-Extension Text Mining Course on May 11, 18, and 25th. I'm using KNIME for this course.

    Approximately 80% of my time is spent on active consulting. While I can't describe most of the work I'm doing, my current clients are in the following domains:
    1. Web Analytics and email remarketing for retail via a great startup company, Smarter Remarketer headed by Angel Morales (Founder and Chief Innovation Officer), Howard Bates (CEO), and me (Founder and Chief Scientist).
    2. Customer Acquisition, web/online and offline (2 clients)
    3. Tax Modeling (2 clients)
    4. Data mining software tool selection for a large health care provider.

    Here's to productive application of predictive analytics!

    Thursday, April 05, 2012

    Why Defining the Target Variable in Predictive Analytics is Critical

    Every data mining project begins with defining what problem will be solved. I won't describe the CRISP-DM process here, but I use that general framework often when working with customers so they have an idea of the process.

    Part of the problem definition is defining the target variable. I argue that this is the most critical step in the process that relates to the data, and more important than data preparation, missing value imputation, and the algorithm that is used to build models, as important as they all are.

    The target variable carries with it allthe information that summarizes the outcome we would like to predict from the perspective of the algorithms we use to build the predictive models. Yet this can be misleading is many ways. I'm addressing one way we can be fooled by the target variable here, and please indulge me to lead you down the path.

    Let's say we are building fraud models in our organization. Let's assume that in our organization, the process for determining fraud is first to identify possible fraud cases (by tips or predictive models), then assign the case to a manager who determines which investigator will get the case (assuming the manager believes there is value in investigating the case), then assign the case to an investigator, and if fraud is found, the case is tried in court, and ultimately a conviction is made or the party is found not guilty.

    Our organization would like to prioritize which cases should be sent to investigators using predictive modeling. It is decided that we will use as a target variable all cases that were found to be fraudulent, that is, all cases that had been tried and a conviction achieved. Let's assume here that all individuals involved are good at their jobs and do not make arbitrary or poor decisions (which of course is also a problem!)

    Let's also put aside for a moment the time lag involved here (a problem itself) and just consider the conviction as a target variable. What does the target variable actually convey to us? Of course our desire is that this target variable conveys fraud risk. Certainly when the conviction has occurred, we have high confidence that the case was indeed fraudulent, so the "1"s are strong and clear labels for fraud.

    But, what about the "0"s? Which cases do they include?
    --cases never investigated (i.e., we suspect they are not fraud, but don't know)
    --cases assigned to a manager who never assigned the case (he/she didn't think they were worth investigating).
    --cases assigned to an investigator but the investigation has not yet been completed, or was never completed, or was determined not contain fraud
    --cases that went to court but was found "not guilty"

    Remember, all of these are given the identical label: "0"

    That means that any cases that look on the surface to be fraudulent, but there were insufficient resources to investigate them, are called "not fraudulent. That means cases that were investigated but the investigator was taken off the case to investigate other cases are called "not fraudulent". It means too that court cases that were thrown out of court due to a technicality unrelated to the fraud itself are called "not fraud".

    In other words, the target variable defined as only the "final conviction" represents not only the risk of fraud for a case, but also the investigation and legal system. Perhaps complex cases that are high risk are thrown out because they aren't (at this particular time, with these particular investigators) worth the time. Is this what we want to predict? I would argue "no". We want our target variable to represent the risk, not the system.

    This is why when I work on fraud detection problems, the definition of the target variable takes time: we have to find measures that represent risk and are informative and consistent, but don't measure the system itself. For different customers this means different trade-offs, but usually it means using a measure from earlier in the process.

    So in summary, think carefully about the target variable you are defining, and don't be surprised when your predictive models predict exactly what you told them to!

    Tuesday, February 21, 2012

    Target, Pregnancy, and Predictive Analytics,
    Part II

    This is part II of my thoughts on the New York Times article "How Companies Learn Your Secrets".

    In the first post, I commented on the quote
    “It’s like an arms race to hire statisticians nowadays,” said Andreas Weigend, the former chief scientist at “Mathematicians are suddenly sexy.”
    Comments on this can be seen in Part I here.

    In this post, the next portion of the article I found fascinating can be summarized by the section that says
    Habits aren’t destiny — they can be ignored, changed or replaced. But it’s also true that once the loop is established and a habit emerges, your brain stops fully participating in decision-making. So unless you deliberately fight a habit — unless you find new cues and rewards — the old pattern will unfold automatically.
    Habits are what predictive models are all about. Or putting as a question, "is customer behavior predictable based on their past behavior?" The Frawley, Piatetsky-Shapiro, Mattheus definition of knowledge discovery in databases (KDD) as follows:
    Knowledge discovery is the nontrivial extraction of implicit, previously unknown,and potentially useful information from data. (PDF of the paper can be found here)
    This quote has often been applied to data mining and predictive analytics, and rightfully so. We believe there are patterns hidden in the data and want to characterize those patterns with predictive modeols. Predictive models usually work best when individuals don't even realize what they are doing so we can capture their behavior solely based on what they want to do rather than behavior influence by how they want to be perceived, which is exactly how the Target models were built.

    So what does this have to do with the NYTimes quote? The "habits" that "unfold automatically" as described in the article was fascinating precisely because predictive models rely on habits; we wish to make the connection between past behavior and expected result as captured in the data that are consistent and repeatable (that is, habitual!). These expected results could be "is likely to respond to a mailing", "is likely purchase a product online", "is likely to commit fraud", or in the case of the article, "is likely to be pregnant". Duhigg (and presumably Pole describing it to Duhigg) characterizes this very well. The behavior Target measured was shoppers purchasing behavior when they were to give birth some weeks or months in the future, and nothing more. They had to apply broadly to thousands of "Guest IDs" for models to work effectively.

    The description of what Andy Pole did for target is an excellent summary of what predictive modelers can and should do. The approach included domain knowledge, understanding of what predictive models can do, and most of all a forensic mindset. I quote again from the article:
    "Target has a baby-shower registry, and Pole started there, observing how shopping habits changed as a woman approached her due date, which women on the registry had willingly disclosed. He ran test after test, analyzing the data, and before long some useful patterns emerged. Lotions, for example. Lots of people buy lotion, but one of Pole’s colleagues noticed that women on the baby registry were buying larger quantities of unscented lotion around the beginning of their second trimester. Another analyst noted that sometime in the first 20 weeks, pregnant women loaded up on supplements like calcium, magnesium and zinc. Many shoppers purchase soap and cotton balls, but when someone suddenly starts buying lots of scent-free soap and extra-big bags of cotton balls, in addition to hand sanitizers and washcloths, it signals they could be getting close to their delivery date." (emphases mine)
    To me, the key descriptive terms in the quote from the article are "observed", "noticed" and "noted". This means the models were not built as black boxes; the analysts asked "does this make sense?" and leveraged insights gained from the patterns found in the data to produce better predictive models. It undoubtedly was iterative; as they "noticed" patterns, they were prompted to consider other patterns they had not explicitly considered before (and maybe had not even occurred to them before). But it was these patterns that turned out to be the difference-makers in predicting pregnancy.

    So after all my preamble here, the key take-home messages from the article are:
    1) understand the data,
    2) understand why the models are focusing on particular input patterns,
    3) ask lots of questions (why does the model like these fields best? why not these other fields?)
    4) be forensic (now that's interesting or that's odd...I wonder...),
    5) be prepared to iterate, (how can we predict better for those customers we don't characterize well)
    6) be prepared to learn during the modeling process

    We have to "notice" patterns in the data and connect them to behavior. This is one reason I like to build multiple models: different algorithms can find different kinds of patterns. Regression is a global predictor (one continuous equation for all data), whereas decision trees and kNN are local estimators.

    So we shouldn't be surprised that we will be surprised, or put another way, we should expect to be surprised. The best models I've built contain surprises, and I'm glad they did!

    Saturday, February 18, 2012

    Target, Pregnancy, and Predictive Analytics, Part I

    There have been a plethora of tweets about the New York Times article "How Companies Learn Your Secrets", mostly focused on the story of how Target can predict if a customer is pregnant. The tweets I've seen on this most often have a reaction that this is somewhat creepy or invasive. I may write more on this topic at some future time (which probably means I won't!) because I don't find it creepy at all that a company would try to understand my behavior and infer the cause of that behavior. But I digress…

    The parts of the article I find far more interesting include these:

    “It’s like an arms race to hire statisticians nowadays,” said Andreas Weigend, the former chief scientist at “Mathematicians are suddenly sexy.”


    Habits aren’t destiny — they can be ignored, changed or replaced. But it’s also true that once the loop is established and a habit emerges, your brain stops fully participating in decision-making. So unless you deliberately fight a habit — unless you find new cues and rewards — the old pattern will unfold automatically.

    Part I will address the first question, and next week I'll post the second, much longer part.

    First, mathematics and predictive analytics…

    The first quote is a tremendous statement and one that all of us in the field should take notice of. While college students enrollment with STEM majors continues to decline, we have fewer and fewer candidates (as a percentage) to choose from.

    But I don't think this is necessarily hopeless. I just finished teaching a text mining course, and one woman in the course told me that she never liked mathematics, yet it was obvious that she not only did data mining, but she understood it and was able to use the techniques successfully. There is something different about statistics, data mining and predictive analytics. It isn't math, it's forensic. It's a like solving a puzzle rather than proving a theorem or solving for "x".

    Almost every major retailer, from grocery chains to investment banks to the U.S. Postal Service, has a “predictive analytics” department devoted to understanding not just consumers’ shopping habits but also their personal habits, so as to more efficiently market to them.

    Really? I appreciate the statement of how widespread predictive analytics is. But I think it overstates the case. I've personally done work for retailers and other major organizations without predictive analytics departments. Now they may have several individuals who are analysts, but they aren't organized as a department. More often, they are part of the "marketing" department with an "analyst" title. This matters because collaboration is key in building predictive models well. One thing I try to encourage with all of my customers is building a collaborate environment where ideas, insights, and lessons learned are exchanged. With most customers, this is something they already do or are eager to do. With a few it has been more challenging.

    “But Target has always been one of the smartest at this,” says Eric Siegel, a consultant and the chairman of a conference called Predictive Analytics World. “We’re living through a golden age of behavioral research. It’s amazing how much we can figure out about how people think now.”

    I completely agree with Eric that we live in a world now where we finally have enough data, enough accessible data, the technical ability, and the interest in understanding that data. These are indeed good times to be in predictive analytics!

    We need both kinds of analysts: the mathematically astute one, and those that don't care about the match, but understand deeply how to build and use predictive models. We need to develop both kinds of analysts, but there are far more of the latter, and they can do the job.

    Thursday, January 05, 2012

    Top 5 Posts from 2011

    By far, the most visited post of 2011 was the "What Do Data Miners Need to Learn" post from June.

    The top five visited posts that were first posted in 2011 are (with actual ranks for all posts):
    1. What Do Data Miners Need to Learn
    2. Statistical Rules of Thumb, Part III
    3. Statistical Rules of Thumb, Part II
    4. Number of Hidden Layer Neurons to Use
    5. Statistics: The Need for Integration

    The top six viewed posts in 2011 originally created prior to 2011 were:
    1. Why Normalization Matters with K-Means (2009)
    2. Free and Inexpensive Data Mining Software (2006)
    3. Data Mining Data Sets (2008)
    4. Can you Learn Data Mining in Undergraduate or Graduate School (2009)
    5. Quotes from Moneyball (2007)
    6. Business Analytics vs. Business Intelligence (2009)

    The "Free Data Mining Tools" post is understandably relatively popular, even after 5 years. The Moneyball quotes has a particularly high bounce rate. I'm most surprised that the K-Means normalization post has remained popular for so long.