Dean, your comments in data mining and software development are interesting. At this point, I largely use my own MATLAB code for data mining. I have access to the Statistics and Curve Fitting Toolboxes, which provide some modeling capability and some useful utility functions. My experience is that, very often I need something which commercial tools (at the convenient interface level) do not provide. With MATLAB, once I have the data, I can prepare the data, perform modeling and report and graph results all under one roof. MATLAB-specific benefits aside, the same sort of thing could be done in other, more conventional languages like Fortran, Java or C++, perhaps with libraries like those from IMSL.
The dark side is the responsibility. I have to do all the things which the commercial shells do, such as manage the data. Occasionally I even need to manage the RAM, on really big problems. My current work machine is a Windows workstation with 2GB RAM (soon to move to a faster machine with 4GB RAM). While I have much more flexibility than the commercial tools provide, sometimes my fingers bleed (figuratively, not literally- yuck) taking care of all the details.
Still, once a decent code base is established, it isn't so bad. For instance, my feature selection process at this point is fairly efficient and robust, being implemented as a few MATLAB functions.
Tips, tricks, and comments related to topics in data science and machine learning. Used to be called "data mining and predictive analytics" but updated the title to reflect the language of the day!
Hosted by Dean Abbott, Abbott Analytics
Friday, October 27, 2006
data mining and software development
I've been posting a bit at the yahoo group "datamining2"--we'll see how interesting that group is. I recently responded to a post about java and data mining, and even found another blog that discussed that very issue earlier this week (I just found that post today) -- http://dataminingresearch.blogspot.com/
I don't code anymore, at least not seriously. One reason for that is because the data mining software environments have progessed to the point that I don't need to dust off my C/C++ skills (or lack thereof). And in those relatively rare cases that I do need to program, I can use 4th generation languages, which are quite powerful (if I can ever remember the syntax, but that's another story altogether). Nearly every data mining software package has it's own language: S-Plus command line for S-Plus and Insightful Miner, Clem for Clementine, CART has its own language, Matlab, Visual Basic for Statistica, and of course SAS in Enterprise Miner. This is just naming a few, of course.
I don't code anymore, at least not seriously. One reason for that is because the data mining software environments have progessed to the point that I don't need to dust off my C/C++ skills (or lack thereof). And in those relatively rare cases that I do need to program, I can use 4th generation languages, which are quite powerful (if I can ever remember the syntax, but that's another story altogether). Nearly every data mining software package has it's own language: S-Plus command line for S-Plus and Insightful Miner, Clem for Clementine, CART has its own language, Matlab, Visual Basic for Statistica, and of course SAS in Enterprise Miner. This is just naming a few, of course.
Cluster Ensembles
This past week I received the November 2006 issue of the IEEE Transactions on Pattern Analysis and Machine Intelligence, and found very interesting the article "Evaluation of Stability of k-Means Cluster Ensembles with Respect to Random Initialization". This is something that I have thought about, but (to my discredit) haven't read on or even experimented with beyond very simple case studies.
If it of course the logical extension of the ensemble techniques that have been used for the past decade. The method that I found most accessible was to (1) resample the data with bootstrap samples, (2) create k-means cluster models for each sample, and (3) use the cluster labels to associate with each record (at this point, you have R records, M fields used to build the clusters, and P cluster models, one new field for each model). Finally, you can built a hierarchical clustering model based on records and the new "P" fields.
More on this after some experiments.
If it of course the logical extension of the ensemble techniques that have been used for the past decade. The method that I found most accessible was to (1) resample the data with bootstrap samples, (2) create k-means cluster models for each sample, and (3) use the cluster labels to associate with each record (at this point, you have R records, M fields used to build the clusters, and P cluster models, one new field for each model). Finally, you can built a hierarchical clustering model based on records and the new "P" fields.
More on this after some experiments.
Thursday, October 19, 2006
Data Mining vs. Predictive Analytics
I find that the terminology associated with specialized fields like data mining very interesting to track. My first boss, Roger Barron (better described as a mentor and later truly a friend--I owe much of who I am as a professional to him), used to talk of the transitions of terminology in technology: bionics, cybernetics, artificial intelligence, neural networks, etc.
I find that data mining and predictive analytics fall into the same category--they are the same basic technology but described from different perspectives. Sometimes colleagues have tried to point out distinctions, and I think one of the better ones was posted by Eric King here, where my definition of "better" means simple and clear.
Predictive analytics is a term I see more in the CRM and database worlds (TDWI conferences come to mind). Perhaps some of this is due to the encroachment of BI into the data mining world, where queries and OLAP are sometimes called data mining (after all, you are "drilling" down into the data!). This would necessitate creating further distinctions in terminology.
However, I don't see data mining losing hold on the style of predictive modeling that is largely empirical and data driven. So I include predictive analytics in the title of this blog as an alternative to data mining in name only, not in purpose.
I find that data mining and predictive analytics fall into the same category--they are the same basic technology but described from different perspectives. Sometimes colleagues have tried to point out distinctions, and I think one of the better ones was posted by Eric King here, where my definition of "better" means simple and clear.
Predictive analytics is a term I see more in the CRM and database worlds (TDWI conferences come to mind). Perhaps some of this is due to the encroachment of BI into the data mining world, where queries and OLAP are sometimes called data mining (after all, you are "drilling" down into the data!). This would necessitate creating further distinctions in terminology.
However, I don't see data mining losing hold on the style of predictive modeling that is largely empirical and data driven. So I include predictive analytics in the title of this blog as an alternative to data mining in name only, not in purpose.
How to doom data mining solutions before even beginning to build models
I was reminding today while speaking with an email marketing expert of the reason many data mining projects fail. It is usually the case that in developing a data mining approach to solve a business objective that there is a disconnect between the two. When data mining algorithms look at data, they are thinking in terms like "minimum squared error", or "R-squared", or "Percent Correct Classification".
These are usually of little importance to the business objective, which may be to find a population of customers who will purchase at least $100 of goods, or respond at a rate greater than 8% to a campaign. In these cases, a model that performs "well" in the algorithm's view may not be particular good at identifying the top-tier responders. Therefore, the problem should be set up with the business objective in mind, not the data mining algorithm's objective in mind, and the models should be assessed using a metric that matches as closely as possible to the business objective.
These are usually of little importance to the business objective, which may be to find a population of customers who will purchase at least $100 of goods, or respond at a rate greater than 8% to a campaign. In these cases, a model that performs "well" in the algorithm's view may not be particular good at identifying the top-tier responders. Therefore, the problem should be set up with the business objective in mind, not the data mining algorithm's objective in mind, and the models should be assessed using a metric that matches as closely as possible to the business objective.