Wednesday, February 17, 2010

Predictive Analytics World Recap

Predictive Analytics World (PAW) just ended today, and here are a few thoughts on the conference.

PAW was a bigger conference than October's or last February's and it definitely felt bigger. It seemed to me that there was a larger international presence as well.

Major data mining software vendors included the ones you would expect (in alphabetical order to avoid any appearance of favoritism): Salford Systems, SAS, SPSS (an IBM company), Statsoft, and Tibco. Others who were there included Netezza (a new one for me--they have an innovative approach to data storage and retrieval), SAP, Florio (another new one for me--a drag-and-drop simulation tool) and REvolution.

One surprise to me was how many text mining case studies were presented. John Elder rightfully described text mining as "the wild west" of analytics in his talk and SAS introduced a new initiative in text analytics (including sentiment analysis, a topic that came up in several discussions I had with other attendees).

A second theme emphasized by Eric Siegel in the keynote and discussed in a technical manner by Day 2 Keynote Kim Larsen was uplift modeling, or as Larsen described it, Net Lift modeling. This approach makes so much sense, that one should consider not just responders, but should instead set up data to be able to identify those individuals that respond because of the marketing campaign and not bother those who would respond anyway. I'm interested in understanding the particular way that Larsen approaches Net Lift models with variable selection and a variant of Naive Bayes.

But for me, the key is setting up the data right and Larsen described the data particularly well. A good campaign will have a treatment set and a control set, where the treatment set gets the promotion or mailing, and the control set does not. There are several possible outcomes here. First, in the treatment set, there are those individuals who would have responded anyway, those who respond because of the campaign, and those who do not respond. For the control set, there are those who respond despite not receiving a mailing, and those who do not. The problem, of course, is that in the treatment set, you don't know which individuals would have responded if they had not been mailed, but you suspect that they look like those in the control set who responded.

A third area that struck me was that of big data. There was a session (that I missed, unfortunately) on in-dateabase vs. in-cloud computing (by Neil Raden of Hired Brains), and Robert Grossman's talk on building and maintaining 10K predictive models. This latter application was one that I believe will be the approach that we move toward as data size increases, where the multiple models are customized by geography, product, demographic group, etc.

I enjoyed the conference tremendously, including the conversations with attendees. One of note was the use of ensembles of clustering models that I hope will be presented at a future PAW.

Prinicpal Components for Modeling

Problem Statement

Analysts constructing predictive models frequently encounter the need to reduce the size of the available data, both in terms of variables and observations. One reason is that data sets are now available which are far too large to be modeled directly in their entirety using contemporary hardware and software. Another reason is that some data elements (variables) have an associated cost. For instance, medical tests bring an economic and sometimes human cost, so it would be ideal to minimize their use if possible. Another problem is overfitting: Many modeling algorithms will eagerly consume however much data they are fed, but increasing the size of this data will eventually produce models of increased complexity without a corresponding increase in quality. Model deployment and maintenance, too, may be encumbered by extra model inputs, in terms of both execution time and required data preparation and storage.

Naturally, the goal in data reduction is to decrease the size of needed data, while maintaining (as much as is possible) model performance, this process must be performed carefully.


A Solution: Principal Components

Selection of candidate predictor variables to retain (or to eliminate) is the most obvious way to reduce the size of the data. If model performance is not to suffer, though, then some effective measure of each variable's usefulness in the final model must be employed- which is complicated by the correlations among predictors. Several important procedures have been developed along these lines, such as forward selection, backward selection and stepwise selection.

Another possibility is principal components analysis ("PCA" to his friends), which is a procedure from multivariate statistics which yields a new set of variables (the same number as before), called the principal components. Conveniently, all of the principal components are simply linear functions of the original variables. As a side benefit, all of the principal components are completely uncorrelated. The technical details will not be presented here (see the reference, below), but suffice it to say that if 100 variables enter PCA, then 100 new variables (called the principal components come out. You are now wondering, perhaps, where the "data reduction" is? Simple: PCA constructs the new variables so that the first principal component exhibits the largest variance, the second principal component exhibits the second largest variance, and so on.

How well this works in practice depends completely on the data. In some cases, though, a large fraction of the total variance in the data can be compressed into a very small number of principal components. The data reduction comes when the analyst decides to retain only the first n principal components.

Note that PCA does not eliminate the need for the original variables: they are all still used in the calculation of the principal components, no matter how few of the principal components are retained. Also, statistical variance (which is what is concentrated by PCA) may not correspond perfectly to "predictive information", although it is often a reasonable approximation.


Last Words

Many statistical and data mining software packages will perform PCA, and it is not difficult to write one's own code. If you haven't tried this technique before, I recommend it: It is truly impressive to see PCA squeeze 90% of the variance in a large data set into a handful of variables.

Note: Related terms from the engineering world: eigenanalysis, eigenvector and eigenfunction.


Reference

For the down-and-dirty technical details of PCA (with enough information to allow you to program PCA), see:

Multivariate Statistical Methods: A Primer, by Manly (ISBN: 0-412-28620-3)

Note: The first edition is adequate for coding PCA, and is at present much cheaper than the second or third editions.

Friday, February 12, 2010

Predictive Analytics World - San Francisco

The next Predictive Analytics World is coming up next week. This is a conference look forward to very much because of the attendees; I have found that at the first two PAWs, there have a been a good mix of folks who are experts and those who are spinning up on Predictive Analytics. I'll be teaching a hands-on workshop Monday (using Enterprise Miner), and presenting a talk on using trees to generate business rules for a help-desk text analytics application on Tuesday the 16h. You can still get the 15% discount if you use the registration code DEANABBOTT010 in the registration process (this is not a sales plug--I won't receive any benefit from this). 


Look me up if you are going; I will be there both days (16th and 17th).