Sunday, April 08, 2007

Future Data Mining Trends

In his latest post, Sandro has a nice summary about future data mining trends here. I'm with him that being a prognosticator is not something I do a lot of, but I do have one idea that I still think will happen.

First, let me say that of the references provided by Sandro, the Tom Dietterich one is something I like very much, especially his treatment of model ensembles.

At the 1999 or 2000 KDD conference in San Diego, I think there was a roundtable discussion on the future of data mining with the particular emphasis revolving around whether or not data mining will occur inside the database or external to the database. The general consensus was that mining will move more inside the database, and I frankly agreed. This has not materialized nearly to the degree I expected, though it has progressed especially in the past couple of years with improvements to Oracle Data Miner and SQL Server 2005 Business Intelligence. (I'm not familiar with the current state of DB2 Data Warehouse Edition, and I don't think there has been much work done in recent years on the Teradata Warehouse Miner product, formerly TeraMiner).

However, most folks I know who do data mining still pull data from a datamart or warehouse, build models in a standalone app, and then push models and/or scores back up to the warehouse. I think this is going to move more and more into the warehouse either through improved software in the warehouse (like what we're seeing with Oracle and Microsoft), or, perhaps more likely, through improved interfaces to warehouse functions by standalone data mining software. For example Clementine from SPSS allows you pushback database function to the database itself rather than operating on data that has been pulled from the warehouse. This speeds up basic data processing considerably I've found. I think the latter is the more likely area of growth in data mining software and how practitioners use data mining software.