Predictive Analytics World (PAW) just ended today, and here are a few thoughts on the conference.
PAW was a bigger conference than October's or last February's and it definitely felt bigger. It seemed to me that there was a larger international presence as well.
Major data mining software vendors included the ones you would expect (in alphabetical order to avoid any appearance of favoritism): Salford Systems, SAS, SPSS (an IBM company), Statsoft, and Tibco. Others who were there included Netezza (a new one for me--they have an innovative approach to data storage and retrieval), SAP, Florio (another new one for me--a drag-and-drop simulation tool) and REvolution.
One surprise to me was how many text mining case studies were presented. John Elder rightfully described text mining as "the wild west" of analytics in his talk and SAS introduced a new initiative in text analytics (including sentiment analysis, a topic that came up in several discussions I had with other attendees).
A second theme emphasized by Eric Siegel in the keynote and discussed in a technical manner by Day 2 Keynote Kim Larsen was uplift modeling, or as Larsen described it, Net Lift modeling. This approach makes so much sense, that one should consider not just responders, but should instead set up data to be able to identify those individuals that respond because of the marketing campaign and not bother those who would respond anyway. I'm interested in understanding the particular way that Larsen approaches Net Lift models with variable selection and a variant of Naive Bayes.
But for me, the key is setting up the data right and Larsen described the data particularly well. A good campaign will have a treatment set and a control set, where the treatment set gets the promotion or mailing, and the control set does not. There are several possible outcomes here. First, in the treatment set, there are those individuals who would have responded anyway, those who respond because of the campaign, and those who do not respond. For the control set, there are those who respond despite not receiving a mailing, and those who do not. The problem, of course, is that in the treatment set, you don't know which individuals would have responded if they had not been mailed, but you suspect that they look like those in the control set who responded.
A third area that struck me was that of big data. There was a session (that I missed, unfortunately) on in-dateabase vs. in-cloud computing (by Neil Raden of Hired Brains), and Robert Grossman's talk on building and maintaining 10K predictive models. This latter application was one that I believe will be the approach that we move toward as data size increases, where the multiple models are customized by geography, product, demographic group, etc.
I enjoyed the conference tremendously, including the conversations with attendees. One of note was the use of ensembles of clustering models that I hope will be presented at a future PAW.