Tuesday, January 19, 2010

Is there anything new in Predictive Analytics?

Federal Computer Week's John Zyskowski posted an article on Jan 8, 2010 on Predictive Analytics entitled "Deja vu all over again: Predictive analytics look forward into the past". (kudos for the great Yogi Berra quote! But beware, as Berra stated himself, "I really didn't say everything I said")

Back to Predictive Analytics...Pieter Mimno is quoted as stating:
There's nothing new about this (Predictive Analytics). It's just old techniques that are being done better.
To support this argument, John quotes me related to work done at DFAS 10 years ago. Is this true? Is there nothing new in predictive analytics? If it isn't true, what is new?

I think what is new is not algorithms, but a better integration of data mining software in the business environment, primarily in two places: on the front end and on the back end. On the front end, data mining tools are better at connecting to databases now compared to 10 years ago, and provide the analyst better tools for assessing the data coming into the software. This has always been a big hurdle, and was the reason that at KDD 1999 in San Diego, the panel discussion on "Data Mining into Vertical Solutions" concluded that data mining functionality would be integrated into the database to a large degree. But while it hasn't happened quite the way it was envisioned 10 years ago, it is clearly much easier to do now.

On the back end, I believe the most significant step forward in data mining tools has been giving the analyst the ability to assess models in a manner consistent with the business objectives of the model. So rather than comparing models based on R^2 or overall classification accuracy, most tools give you the ability to generate an ROI chart, or a ROC curve, or build a custom model assessment engine based on rank-ordered model predictions. This means that when we convey what models are doing to decision makers, we can do so in the language they understanding and not force them to understand how good an R^2 of 0.4 really is. And then, data mining tools are to a greater degree producing scoring code that is usable outside of the tool itself by creating SQL code, SAS code, C or Java, or PMML. What I'm waiting for next is for vendors to provide PMML or other code for all the data prep one does in the tool prior to the model itself; typically, PMML code is generated only for the model itself.

Sunday, January 10, 2010

Counting Observations

Data is fodder for the data mining process. One fundamental aspect of the data we analyze is its size, which is most often characterized by the number of observations and the number of variables in the given set of data- typically measured as counts of "rows and columns", respectively. It is worth taking a closer look at this, though, as questions such as "Do we have enough data?" depend on an apt measure of how much data we have.


Outcome Distributions

In many predictive modeling situations, cases are spread fairly evenly among the possible outcomes, but this is not always true. Many fraud detection problems, for instance, involve extreme class imbalance: target class cases (known frauds) may represent a small fraction of 1% of the available records. Despite having many total observations of customer behavior, observations of fraudulent behavior may be rather sparse. Data miners who work in the fraud detection field are acutely aware of this issue and characterize their data sets not just by 'total number of observations', but also by 'observations of the behavior of interest'. When assessing an existing data set, or specifying a new one, such an analyst generally employ both counts.

Numeric outcome variables may also suffer from this problem. Most numeric variables are not uniformly distributed, and areas in which outcome data is sparse- for instance, long tails of high personal income- are areas which may be poorly represented in models derived from that data.

With both class and numeric outcomes, it might be argued that outcome values which are infrequent are, by definition, less important. This may or may not be so, depending on the modeling process and our priorities. If the model is expected to perform well on the top personal income decile, then data should be evaluated by how many cases fall in that range, not just on the total observation count.


Predictor Distributions

Issues of coverage occur on the input variable side, as well. Keeping in mind that generalization is the goal of discovered models, the total record count by itself seems inadequate when, for example, data are drawn from a process which has (or may have) a seasonal component. Having 250,000 records in a single data set sounds like many, but if they are only drawn from October, November and December, then one might reasonably take the perspective that only 3 "observations" of monthly behavior are represented, out of 12 possibilities. In fact, (assuming some level of stability from year to year) one could argue that not only should all 12 calendar months be included, but that they should be drawn from multiple historical years, so that there are multiple observations for each calendar month.

Other groupings of cases in the input space may also be important. For instance, of hundreds of observations of retail sales may be observed, but if only from 25 salespeople out of a sales force of 300, then the simple record count as "observation count" may be deceiving.


Validation Issues

Observations as aggregates of single records should be considered during the construction of train/test data, as well. When pixel-level data are drawn from images for the construction of a pixel level classifier, for instance, it makes sense to avoid having pixels from a given image serve as training observations, and other pixels from that same image serve as validation observations. Entire images should be labeled as "train" or "test", and pixels drawn as observations according, to avoid "cheating" during model construction, based on the inherent redundancy in image data.


Conclusion

This posting has only briefly touched on some of the issues which arise when attempting to measure the volume of data in one's possession, and has not explored yet more subtle concepts such as sampling techniques, observation weighting or model performance measures. Hopefully though, it gives the reader some things to think about when assessing data sets in terms of their size and quality.

Wednesday, January 06, 2010

Data Mining and Terrorism... Counterpoint

In a recent posting to this Web log (Data Mining and Privacy...again, Jan-04-2010), Dean Abbott made several points regarding the use of data mining to counter terrorism, and related privacy issues. I'd like to address the question of the usefulness of data mining in this application.

Dean quoted Bruce Schneier's argument against data mining's use in anti-terrorism programs. The specific technical argument that Schneier has made (and he is not alone in this) is: Automatic classification systems are unlikely to be effective at identifying individual terrorists, since terrorists are so rare. Schneier concludes that the rate of "false positives" could never be made low enough for such a system to work effectively.

As far as this specific technical line of thought goes, I agree absolutely, and doubt that any competent data analyst would disagree. It is the extension of this argument to the much broader conclusion that data mining is not a fruitful line of inquiry for those seeking to oppose terrorists that I take issue with.

Many (most?) computerized classification systems in practice output probabilities, as opposed to simple class predictions. Owners of such systems use them to prioritize their efforts (think of database marketers who sort name lists to find the so many who are most likely to respond to an offer). Classifiers need not be perfect to be useful, and portraying them as such is what I call the "Minority Report strawman".

Beyond this, data mining has been used to great effect in rooting out other criminal behaviors, such as money laundering, which are associated with terrorism. While those who practice our art against terrorism are unlikely to be forthcoming about their work, it is not difficult to imagine data mining systems other than classifiers being used in this struggle, such as analysis on networks of associates of terrorists.

It would take considerable naivety to believe that present computer systems could be trained to throw up red flags on a small number of individuals, previously unknown to be terrorists, with any serious degree of reliability. Given the other chores which data mining systems may perform in this fight, I think it is equally naive to abandon that promise for an overextended technical argument.

Monday, January 04, 2010

The Next Predictive Analytics World

Just a reminder that the next Predictive Analytics World is coming in another 6 weeks--Feb 16-17 in San Francisco.

I'll be teaching a pre-conference Hands-On Predictive Analytics workshop using SAS Enterprise Miner on the 15th, and presenting a text mining case study on the 16th.

For any readers here who may be going, feel free to use this discount code during registration to get a 15% discount off the 2-day conference: DEANABBOTT010

Hope to see you there.

Data Mining and Privacy...again

A google search tonight on "data mining" referred to the latest DHS Privacy Office 2009 Data Mining Report to Congress. I'm always nervous when I see "data mining" in titles like this, especially when linked to privacy because of the misconceptions about what data mining is and does. I have long contended that data mining only does what humans would do manually if they had enough time to do it. The concerns that most privacy advocates really are complaining about is the data that one has available to make the inferences from, albeit more efficiently with data mining.

What I like about this article are the common-sense comments made. Data mining on extremely rare events (such as terrorist attacks) is very difficult because there are not enough examples of the patterns to have high confidence that the predictions are not by chance. Or as it is stated in the article:

Security expert Bruce Schneier explains well. When searching for a needle in a haystack, adding more "hay" does not good at all. Computers and data mining are useful only if they are looking for something relatively common compared to the database searched. For instance, out of 900 million credit card in the US, about 1% are stolen or fraudulently used every year. One in a hundred is certainly the exception rather than the rule, but it is a common enough occurrence to be worth data mining for. By contrast, the 9-11 hijackers were a 19-man needle in a 300 million person haystack, beyond the ken of even the finest super computer to seek out. Even an extremely low rate of false alarms will swamp the system.

Now this is true for the most commonly used data mining techniques (predictive models like decision trees, regression, neural nets, SVM). However, there are other techniques that are used to find links between interesting entities that are extremely unlikely to occur by chance. This isn't foolproof, of course, but while there will be lots of false alarms, they can still be useful. Again from the enlightened layperson:

An NSA data miner acknowledged, "Frankly, we'll probably be wrong 99 percent of the time . . . but 1 percent is far better than 1 in 100 million times if you were just guessing at random."

It's not as if this were a new topic. From the Cato Institute, this article describes the same phenomenon, and links to a Jeff Jonas presentation that describes how good investigation would have linked the 9/11 terrorists (rather than using data mining). Fair enough, but analytic techniques are still valuable in removing the chaff--those individuals or events that very uninteresting. In fact, I have found this to be a very useful approach to handling difficult problems.