Tuesday, January 19, 2010

Is there anything new in Predictive Analytics?

Federal Computer Week's John Zyskowski posted an article on Jan 8, 2010 on Predictive Analytics entitled "Deja vu all over again: Predictive analytics look forward into the past". (kudos for the great Yogi Berra quote! But beware, as Berra stated himself, "I really didn't say everything I said")

Back to Predictive Analytics...Pieter Mimno is quoted as stating:
There's nothing new about this (Predictive Analytics). It's just old techniques that are being done better.
To support this argument, John quotes me related to work done at DFAS 10 years ago. Is this true? Is there nothing new in predictive analytics? If it isn't true, what is new?

I think what is new is not algorithms, but a better integration of data mining software in the business environment, primarily in two places: on the front end and on the back end. On the front end, data mining tools are better at connecting to databases now compared to 10 years ago, and provide the analyst better tools for assessing the data coming into the software. This has always been a big hurdle, and was the reason that at KDD 1999 in San Diego, the panel discussion on "Data Mining into Vertical Solutions" concluded that data mining functionality would be integrated into the database to a large degree. But while it hasn't happened quite the way it was envisioned 10 years ago, it is clearly much easier to do now.

On the back end, I believe the most significant step forward in data mining tools has been giving the analyst the ability to assess models in a manner consistent with the business objectives of the model. So rather than comparing models based on R^2 or overall classification accuracy, most tools give you the ability to generate an ROI chart, or a ROC curve, or build a custom model assessment engine based on rank-ordered model predictions. This means that when we convey what models are doing to decision makers, we can do so in the language they understanding and not force them to understand how good an R^2 of 0.4 really is. And then, data mining tools are to a greater degree producing scoring code that is usable outside of the tool itself by creating SQL code, SAS code, C or Java, or PMML. What I'm waiting for next is for vendors to provide PMML or other code for all the data prep one does in the tool prior to the model itself; typically, PMML code is generated only for the model itself.

No comments: