Tuesday, January 19, 2010

Is there anything new in Predictive Analytics?

Federal Computer Week's John Zyskowski posted an article on Jan 8, 2010 on Predictive Analytics entitled "Deja vu all over again: Predictive analytics look forward into the past". (kudos for the great Yogi Berra quote! But beware, as Berra stated himself, "I really didn't say everything I said")

Back to Predictive Analytics...Pieter Mimno is quoted as stating:
There's nothing new about this (Predictive Analytics). It's just old techniques that are being done better.
To support this argument, John quotes me related to work done at DFAS 10 years ago. Is this true? Is there nothing new in predictive analytics? If it isn't true, what is new?

I think what is new is not algorithms, but a better integration of data mining software in the business environment, primarily in two places: on the front end and on the back end. On the front end, data mining tools are better at connecting to databases now compared to 10 years ago, and provide the analyst better tools for assessing the data coming into the software. This has always been a big hurdle, and was the reason that at KDD 1999 in San Diego, the panel discussion on "Data Mining into Vertical Solutions" concluded that data mining functionality would be integrated into the database to a large degree. But while it hasn't happened quite the way it was envisioned 10 years ago, it is clearly much easier to do now.

On the back end, I believe the most significant step forward in data mining tools has been giving the analyst the ability to assess models in a manner consistent with the business objectives of the model. So rather than comparing models based on R^2 or overall classification accuracy, most tools give you the ability to generate an ROI chart, or a ROC curve, or build a custom model assessment engine based on rank-ordered model predictions. This means that when we convey what models are doing to decision makers, we can do so in the language they understanding and not force them to understand how good an R^2 of 0.4 really is. And then, data mining tools are to a greater degree producing scoring code that is usable outside of the tool itself by creating SQL code, SAS code, C or Java, or PMML. What I'm waiting for next is for vendors to provide PMML or other code for all the data prep one does in the tool prior to the model itself; typically, PMML code is generated only for the model itself.

Sunday, January 10, 2010

Counting Observations

Data is fodder for the data mining process. One fundamental aspect of the data we analyze is its size, which is most often characterized by the number of observations and the number of variables in the given set of data- typically measured as counts of "rows and columns", respectively. It is worth taking a closer look at this, though, as questions such as "Do we have enough data?" depend on an apt measure of how much data we have.


Outcome Distributions

In many predictive modeling situations, cases are spread fairly evenly among the possible outcomes, but this is not always true. Many fraud detection problems, for instance, involve extreme class imbalance: target class cases (known frauds) may represent a small fraction of 1% of the available records. Despite having many total observations of customer behavior, observations of fraudulent behavior may be rather sparse. Data miners who work in the fraud detection field are acutely aware of this issue and characterize their data sets not just by 'total number of observations', but also by 'observations of the behavior of interest'. When assessing an existing data set, or specifying a new one, such an analyst generally employ both counts.

Numeric outcome variables may also suffer from this problem. Most numeric variables are not uniformly distributed, and areas in which outcome data is sparse- for instance, long tails of high personal income- are areas which may be poorly represented in models derived from that data.

With both class and numeric outcomes, it might be argued that outcome values which are infrequent are, by definition, less important. This may or may not be so, depending on the modeling process and our priorities. If the model is expected to perform well on the top personal income decile, then data should be evaluated by how many cases fall in that range, not just on the total observation count.


Predictor Distributions

Issues of coverage occur on the input variable side, as well. Keeping in mind that generalization is the goal of discovered models, the total record count by itself seems inadequate when, for example, data are drawn from a process which has (or may have) a seasonal component. Having 250,000 records in a single data set sounds like many, but if they are only drawn from October, November and December, then one might reasonably take the perspective that only 3 "observations" of monthly behavior are represented, out of 12 possibilities. In fact, (assuming some level of stability from year to year) one could argue that not only should all 12 calendar months be included, but that they should be drawn from multiple historical years, so that there are multiple observations for each calendar month.

Other groupings of cases in the input space may also be important. For instance, of hundreds of observations of retail sales may be observed, but if only from 25 salespeople out of a sales force of 300, then the simple record count as "observation count" may be deceiving.


Validation Issues

Observations as aggregates of single records should be considered during the construction of train/test data, as well. When pixel-level data are drawn from images for the construction of a pixel level classifier, for instance, it makes sense to avoid having pixels from a given image serve as training observations, and other pixels from that same image serve as validation observations. Entire images should be labeled as "train" or "test", and pixels drawn as observations according, to avoid "cheating" during model construction, based on the inherent redundancy in image data.


Conclusion

This posting has only briefly touched on some of the issues which arise when attempting to measure the volume of data in one's possession, and has not explored yet more subtle concepts such as sampling techniques, observation weighting or model performance measures. Hopefully though, it gives the reader some things to think about when assessing data sets in terms of their size and quality.

Wednesday, January 06, 2010

Data Mining and Terrorism... Counterpoint

In a recent posting to this Web log (Data Mining and Privacy...again, Jan-04-2010), Dean Abbott made several points regarding the use of data mining to counter terrorism, and related privacy issues. I'd like to address the question of the usefulness of data mining in this application.

Dean quoted Bruce Schneier's argument against data mining's use in anti-terrorism programs. The specific technical argument that Schneier has made (and he is not alone in this) is: Automatic classification systems are unlikely to be effective at identifying individual terrorists, since terrorists are so rare. Schneier concludes that the rate of "false positives" could never be made low enough for such a system to work effectively.

As far as this specific technical line of thought goes, I agree absolutely, and doubt that any competent data analyst would disagree. It is the extension of this argument to the much broader conclusion that data mining is not a fruitful line of inquiry for those seeking to oppose terrorists that I take issue with.

Many (most?) computerized classification systems in practice output probabilities, as opposed to simple class predictions. Owners of such systems use them to prioritize their efforts (think of database marketers who sort name lists to find the so many who are most likely to respond to an offer). Classifiers need not be perfect to be useful, and portraying them as such is what I call the "Minority Report strawman".

Beyond this, data mining has been used to great effect in rooting out other criminal behaviors, such as money laundering, which are associated with terrorism. While those who practice our art against terrorism are unlikely to be forthcoming about their work, it is not difficult to imagine data mining systems other than classifiers being used in this struggle, such as analysis on networks of associates of terrorists.

It would take considerable naivety to believe that present computer systems could be trained to throw up red flags on a small number of individuals, previously unknown to be terrorists, with any serious degree of reliability. Given the other chores which data mining systems may perform in this fight, I think it is equally naive to abandon that promise for an overextended technical argument.

Monday, January 04, 2010

The Next Predictive Analytics World

Just a reminder that the next Predictive Analytics World is coming in another 6 weeks--Feb 16-17 in San Francisco.

I'll be teaching a pre-conference Hands-On Predictive Analytics workshop using SAS Enterprise Miner on the 15th, and presenting a text mining case study on the 16th.

For any readers here who may be going, feel free to use this discount code during registration to get a 15% discount off the 2-day conference: DEANABBOTT010

Hope to see you there.

Data Mining and Privacy...again

A google search tonight on "data mining" referred to the latest DHS Privacy Office 2009 Data Mining Report to Congress. I'm always nervous when I see "data mining" in titles like this, especially when linked to privacy because of the misconceptions about what data mining is and does. I have long contented that data mining only does what humans would do manually if they had enough time to do it. The concerns that most privacy advocates really are complaining about is the data that one has available to make the inferences from, albeit more efficiently with data mining.

What I like about this article are the common-sense comments made. Data mining on extremely rare events (such as terrorist attacks) is very difficult because there are not enough examples of the patterns to have high confidence that the predictions are not by chance. Or as it is stated in the article:

Security expert Bruce Schneier explains well. When searching for a needle in a haystack, adding more "hay" does not good at all. Computers and data mining are useful only if they are looking for something relatively common compared to the database searched. For instance, out of 900 million credit card in the US, about 1% are stolen or fraudulently used every year. One in a hundred is certainly the exception rather than the rule, but it is a common enough occurrence to be worth data mining for. By contrast, the 9-11 hijackers were a 19-man needle in a 300 million person haystack, beyond the ken of even the finest super computer to seek out. Even an extremely low rate of false alarms will swamp the system.

Now this is true for the most commonly used data mining techniques (predictive models like decision trees, regression, neural nets, SVM). However, there are other techniques that are used to find links between interesting entities that are extremely unlikely to occur by chance. This isn't foolproof, of course, but while there will be lots of false alarms, they can still be useful. Again from the enlightened layperson:

An NSA data miner acknowledged, "Frankly, we'll probably be wrong 99 percent of the time . . . but 1 percent is far better than 1 in 100 million times if you were just guessing at random."

It's not as if this were a new topic. From the Cato Institute, this article describes the same phenomenon, and links to a Jeff Jonas presentation that describes how good investigation would have linked the 9/11 terrorists (rather than using data mining). Fair enough, but analytic techniques are still valuable in removing the chaff--those individuals or events that very uninteresting. In fact, I have found this to be a very useful approach to handling difficult problems.

Tuesday, December 29, 2009

2009 Retrospective

I was thinking about top data mining trends in 2009, and searched for what others thought about it. I'll combine a few 2009 "top 3" lists here, including top trends (as described at Enterprise Regulars here), and posts here that generated the most buzz.

First, the top data mining news story was IBM's purchase of SPSS. It will be very interesting to see if this continues the trend toward integration of Business Intelligence and Predictive Analytics that one sees with SAS, Tibco and now IBM/SPSS.

The Enterprise Regulars post included a few interesting 2010 trends (but since data mining is all about using historical data to make predictions of future behavior, assuming past behavior will continue). In particular, there are 4 mentioned that were of interest to me:
  1. The holy grail of the predictive, real-time enterprise (his #2)
  2. SaaS / Cloud BI Tools will steal significant revenue from on-premise vendors but also fight for limited oxygen amongst themselves. (his #5)
  3. Advanced Visualization will continue to increase in depth and relevance to broader audiences. (his #7)
  4. Open Source offerings will continue to make in-roads against on-premise offerings. (his #8)
I agree with his #2 and #7 (integration of BI/PA and visualization). Several customers I work with are trying to integrate predictive analytics into the database to make better decisions. The difference now is that there is also interest in integrating this process with other data-centric (BI) operations to provide the right information to decision-makers with the right level of granularity (detail). This is typically a combination of creating the ability to perform ad hoc queries along with examining the results (rankings and projections) from predictive analytics.

However,I have not seen Cloud computing and Open source take off from the perspective of customers I work with. The latter two certainly have generated buzz, and in the courses I teach, there is considerable interest in open source computing (R in particular), but it has still be interest rather than action. I expect though that as the allure of data mining and predictive analytics extends its reach deeper into organizations, the need for inexpensive tools (in dollars) will result in increased use of the open source and free tools, such as R, RapidMiner, Weka, Tanagra, Orange, Knime, and others. Lastly, from this blog, the top posts of 2009 were
  1. Why normalization matters with K-Means
  2. How many software packages are too much?
  3. Data Mining: Does it get any better than this?
  4. Text Mining and Regular Expressions

Happy New Year!

Tuesday, December 15, 2009

Overlap in the Business Intelligence / Predictive Analytics Space

I've received considerable feedback on the post Business Intelligence vs. Business Analytics, which has also caused me to think more about the BI space and its overlap with data mining (DM) / predictive analytics (PA) / business analytics (BA). One place to look for this, of course, is with Gartner, how they define Business Intelligence, and which vendors overlap between these industries. (I think of this in much same way as I do DM; I look to data miners to define themselves and what they do rather than to other industries and how they define data mining).

I found the Gartner Magic Quadrant for Business Intelligence in 2009 here, and was very curious to understand (1) how they define BI, and which BI players are also big players in the data mining space. Answering the first question, data analysis in the BI world is defined here as comprising four parts: OLAP, visualization, scorecards, and data mining. So DM in this view is a subset of BI.

Second, the key players in the quadrant interestingly contains only a few vendors I would consider to be top data mining vendors: SAS, Oracle, IBM (Cognos), and Microsoft in the "Leaders" category, and Tibco in the visionaries category. Of these, only SAS (with Enterprise Miner) and Microsoft (SQL Server) showed up in the top 10 of the Rexer Analytics 2008 software tool survey, though Tibco showed up in the top 20 (with Tibco Spotfire Miner).

I think this emphasizes again that BI and DM/PA/BA approach analysis differently, even if the end result is the same (a scorecard, dashboard, report, or transactional decisioning system).

Sunday, December 06, 2009

Business Analytics vs. Business Intelligence

I used to be one that thought the term "data mining" would stay as the description of the kind of analytic work I do. To a large degree it has, but there are always new spins on things, and it seems that quite often in the business world, Predictive Analytics or Business Analytics are the terms of the day.

I just came across this post from the Smart Data Collective: OLAP is Dead (Long Live Analytics), which had some fascinating graphs on hits related to the phrases OLAP and Analytics. The first shows the steady decline of OLAP as a searched term to the point where even the OLAP report has been renamed to The BI Verdict. Meanwhile, "analytics" has been increasing steadily in hits. SAS even touts themselves as leaders in "Business Analytics" now.

Which brings me to the question in the title of this post. It seems to me that Business Intelligence has taken over the role that OLAP and dashboarding used to take on (at least in the circles I worked in). Is there a difference between Business Intelligence and Business Analytics? James Taylor, someone whom I respect tremendously, doesn't think so.
As SAS talked about its business analytics framework it became clear that they envision the results of data mining and predictive analytics (where they genuinely have offerings superior to almost everyone) will be delivered in reports or dashboards. This is what I have somewhat dismissively called "predictive reporting" and while it is better than purely historical reporting, it does not do much to make every decision analytically based as it leaves out the decisions made by machines (which don't read reports) and those made by people with too little time to read a report (most call center or retail staff, for instance) or no skill at interpreting it.

I guess I just don't see the difference between BI and BA...

If all of business analytics is reduced to "predictive reporting", then I can see why some might consider it no more than business intelligence. But even so, are they the same? I don't mean are the results the same either. For that matter, the final decisions from analytics for say classification look just the same as a human decision (buy or not buy? fraud or not?). But is the process the same? I would argue "no". Much of the power of predictive analytics comes from the automation in searching for and assessing nonlinearities, interaction effects, and combinatorics relating observables to outcomes. So, rather than manually assessing these, one automates the process through the use of "decision trees", "neural networks", or some other algorithm. So the difference lies in efficiency in the process.

Now how the predictive information is used, in a report, as part of an automated system or in some other way, is a critically important question, but independent of how the decisions are generated.

Tuesday, December 01, 2009

Computer Science and Theology

I have been reading a book by Don Knuth called Things a Computer Scientist Rarely Talks About (Center for the Study of Language and Information - Lecture Notes)--a very good read for those of you interested in theology as well as analytics. This post is not about the theology of the book (as interesting as that is to me), but rather the reason described in this book for his writing of another book called 3:16, a study of all the 3:16 verses in the Bible. In his chapter on randomized testing (I like to think of model ensembles here), he describes how random sampling is a good way to get an idea of the content of "stuff", whether computer science assignments (he actually does this--randomly take page X of a project and look at that in depth), or understanding books (like the Bible). His 3:16 book takes this verse from every book in the Bible to get a sense of the overall message of the Bible. He admittedly chose 3:16 because of John 3:16 so that he would get at least one great verse, but this was a concession to making the book marketable.

At first I wasn't a big fan of this idea. After all, it is a small sample, But he describes how he then studied these verses in depth. Whereas his prior understanding of the Bible was vague and general (which has its positive points), this exercise led also to a deeper (albeit narrow) understanding as well. I recommend this approach very much.

What does this have to do with analytics? Data Mining often is viewed as a way to get the gist of your data, see the big picture, understand patterns through summarized views. But just as important is the deep view, looking at a few examples (prototypes) in depth. In the text mining project I'm working on right now, while we extract "concepts", much of our time is also spent tracing a few text blocks through the processing to understand in detail why the analytics is working the way it does. I'm a "both / and" kind of guy, so this suits me well; big picture analytics as well as deep dives into record-level descriptions.

Monday, November 23, 2009

Stratified Sampling vs. Posterior Probability Thresholds

One of the great things about conference like the recent Predictive Analytics World is how many technical interactions one has with top practitioners; this past October was no exception. One such interaction was with Tim Manns who blogs here. We were talking about Clementine and what to do with small populations of 1s in the target variable, which prompted me to jump onto my soapbox with an issue that I had never read about, but which occurs commonly in data mining problems such as response modeling and fraud detection.

The setup goes something like this: you have 1% responders, you build models, and the model "says" every record is a 0. My explanation for this was always that errors in classification models take place when the same pattern of inputs can produce both outcomes. In this situation, what is the best guess? The most commonly occurring output variable value. If you have 99% 0s, that is most likely a 0, and therefore data mining tools will produce the answer "0". The common solution to this is to resample the data (stratify) so that one has equal numbers of 0s and 1s in the data, and then rebuild the model. While this is true, it misses an important factor.

I can't claim credit for this (thanks Marie!). I was working on a consulting project with a statistician, and when we were building logistic regression models, I recommended resampling so we don't have the "model calls everything a 0" problem. She seemed puzzled by this, and asked why not threshold at the prior probability level. It was clear right away that this is true, and I've been doing it ever since (with logistic regression or neural networks in particular).

What was she saying? First, it needs to be stated that no algorithm produces "decisions". Logistic regression produces probabilities. Neural networks produce confidence values (though I just had a conversation with one of the smartest machine learning guys I know who talked about neural networks producing true probabilities--maybe I'll blog on this more another time). The decisions that one sees ("all records are called 0s") are produced by the software, interpreting the probabilities or confidence values by thresholding them at 0.5. It is always 0.5. I don't think I've ever found a data mining software package that doesn't threshold at 0.5, in fact. So the software expects the prior probabilities of 0s and 1s to be equal. When they are not (like with 99% 0s and 1% 1s), this threshold is completely inappropriate; the center of density of the distribution of probabilities will center roughly on the prior probability level (0.01 for the 1% response rate problem). I show some examples of this in my data mining course that makes this more clear.

So what can one do? If one thresholds at 0.01 rather than 0.5, one gets a nice confusion matrix out of the classification problem. Of course if you use a ROC curve, Lift Chart or Gains Chart to assess your model, you don't worry about thresholding anyway.

Which brings me to the conversation with Tim Manns. I'm glad he tried it out himself, though I don't think one has to make the target variable continuous to make this work. Tim did his testing in Clementine, but the same holds for any other data mining software tool. What Tim's trick does is correct: if you make the [0,1] target variable numeric, you can build a neural network just fine and the predicted value is "exposed". In Clementine, if you keep it as a "flag" variable, you would threshold the propensity value ($NRP-target).

So, read Tim's post (and his other posts!). This trick can be used with nearly any tool--I've done it with Matlab and Tibco Spotfire Miner, among others).

Now, if tools would only include an option to threshold the propensity at 0.5 or the prior probability (or more precisely, the proportion in the training data).