Wednesday, February 17, 2010

Predictive Analytics World Recap

Predictive Analytics World (PAW) just ended today, and here are a few thoughts on the conference.

PAW was a bigger conference than October's or last February's and it definitely felt bigger. It seemed to me that there was a larger international presence as well.

Major data mining software vendors included the ones you would expect (in alphabetical order to avoid any appearance of favoritism): Salford Systems, SAS, SPSS (an IBM company), Statsoft, and Tibco. Others who were there included Netezza (a new one for me--they have an innovative approach to data storage and retrieval), SAP, Florio (another new one for me--a drag-and-drop simulation tool) and REvolution.

One surprise to me was how many text mining case studies were presented. John Elder rightfully described text mining as "the wild west" of analytics in his talk and SAS introduced a new initiative in text analytics (including sentiment analysis, a topic that came up in several discussions I had with other attendees).

A second theme emphasized by Eric Siegel in the keynote and discussed in a technical manner by Day 2 Keynote Kim Larsen was uplift modeling, or as Larsen described it, Net Lift modeling. This approach makes so much sense, that one should consider not just responders, but should instead set up data to be able to identify those individuals that respond because of the marketing campaign and not bother those who would respond anyway. I'm interested in understanding the particular way that Larsen approaches Net Lift models with variable selection and a variant of Naive Bayes.

But for me, the key is setting up the data right and Larsen described the data particularly well. A good campaign will have a treatment set and a control set, where the treatment set gets the promotion or mailing, and the control set does not. There are several possible outcomes here. First, in the treatment set, there are those individuals who would have responded anyway, those who respond because of the campaign, and those who do not respond. For the control set, there are those who respond despite not receiving a mailing, and those who do not. The problem, of course, is that in the treatment set, you don't know which individuals would have responded if they had not been mailed, but you suspect that they look like those in the control set who responded.

A third area that struck me was that of big data. There was a session (that I missed, unfortunately) on in-dateabase vs. in-cloud computing (by Neil Raden of Hired Brains), and Robert Grossman's talk on building and maintaining 10K predictive models. This latter application was one that I believe will be the approach that we move toward as data size increases, where the multiple models are customized by geography, product, demographic group, etc.

I enjoyed the conference tremendously, including the conversations with attendees. One of note was the use of ensembles of clustering models that I hope will be presented at a future PAW.

4 comments:

Rama Ramakrishnan said...

This is a very helpful summary of the macro themes, thank you.

Did you see much conversation/interest around the use of "social network data" to improve prediction effectiveness?

Lou Bajuk-Yorgan said...

Dean

Great summary of the conference. I agree, this conference definitely felt bigger than the last one, and it was good to see a larger presence by the end users (in addition to vendors and consultants).

In terms of key themes, another one that struck me was the importance of operationalizing predictive analytics. In his opening keynote speech, Eric Siegel (the conference chair) saw the most important innovation in the field of Predictive Analytics focused on applying predictive analytics to operational decisions (as opposed to more established application areas such as customer churn & product recommendations). In a later talk, James Taylor of Decision Management Solutions (and co-author of the great book “Smart (Enough) Systems”), echoed Eric’s emphasis on operational results, encapsulated in the phrase “Action support, not just decision support.” James advised building an analytic platform that focused on the end game: the need to operationalize analytic decisions.

This is great validation for us, since operationalizing analytics is at the heart of TIBCO’s vision for its combined platform with Spotfire and S+ (as shown in products like Operations Analytics).

Evan Stubbs said...

I'm still curious to know why more people don't use text analytics as part of their standard process. There's no good reason on the technology side - it's been fairly well integrated into most decent tools for quite a while. My anecdotal experience is that people seem to think it's in the "too hard" basket, but that doesn't make sense to me; if you've got data, you should be using it. Text is just another field type - why not include it?

Dean Abbott said...

Evan:

There is such a wide range of "maturity levels" of text that one can use for data mining that it is hard to generalize here. In the applications I have worked on, it has taken months to either extract the text well enough for analytics to use it effectively, or to present the text concepts in a way that was useful for the data mining algorithms.

In the application I presented on at PAW had this very problem: the straightforward text extraction did nothing for us; it was only after thinking more about how the text keywords could be used that we found an effective solution.

But that stated, text is being integrated into data mining software so well now that it is worth giving it a shot even one is on a short time schedule, so in that sense I agree with you. I'm just not as optimistic about how effective it would be in "version 1".