Friday, November 03, 2006

In Praise of Simpler Models -- at least in practice

Originally, this was goingt o be a comment, but it became so long, that I'm just posting it.

I'll never forget the hubbub that was generated by a talk by Pedros Domingos, I think at the 1998 KDD conference in New York City. His talk, "Occam's Two Razors: The Sharp and the Blunt" turned the usually calm and measured data mining crowd into an agitated hoard! I mean there was even yelling there (and if you've never been to a conference of statisticians or data miner, you won't know how unbelievable this scene was. This is from the Abstract, also posted at Citeseer in the link above:

Occam's razor has been the subject of much controversy. This paper argues
that this is partly because it has been interpreted in two quite different ways,
the first of which (simplicity is a goal in itself) is essentially correct,
while the second (simplicity leads to greater accuracy) is not. The paper
reviews the large variety of theoretical arguments and empirical evidence for
and against the "second razor," and concludes that the balance is strongly
against it.

John Elder provided a very useful and reasoned explanation for the apparent difficulties with Occam's Razor here (in a summary of KDD '98 presented at my favorite conference, the Symposium on the Interface). To quote,

Jianming Ye presented an intriguing alternative metric, Generalized
Degrees of Freedom, which finds the sensitivity of the fitted values to
perturbations in the outputs [5]. This allows complexity to be measured
and compared in an entirely experimental manner [Ye, J. (March 1998). On Measuring and Correcting the Effects of Data Mining and Model Selection. Journal of the American Statistical Association 93, no. 441, pp. 120-131.]

The key idea is this: just because a model has lots of weights or splits, doesn't mean it is acting in a complex way. John has shown subsequently (at the most recent Salford Systems conference in San Diego, March 2006) that using the GDoF idea, ensembles of trees can actually behave less complex than individual trees within the ensemble.

So I agree with the idea that simpler models are usually better, and for that reason, I always try linear models first. However, sometimes the best models in terms of accuracy are complex looking, but in reality behave in relatively simple ways.

In Praise of Simple Models

At a time when ever more subtle and complex modeling techniques are emerging, it is interesting to note the continuing effectiveness of comparatively simple modeling methods, such as logistic regression. In my work over the past so many years in the finance industry, I have found repeated success employing such methods. One reason for their success is that data may not be as complex as is commonly believed, especially in high-dimensional spaces. See writings on Holte's 1R algorithm as an example. Also, labeling a model or modeling technique as "simple" may be deceptive. With the exception of clinical or academic settings, the majority of real-world models based on transformed (or even un-transformed) linear functions are helped by the use of derived features which make the model function potentially very complex in the space of raw variables.

Some observations:

1. Simple models are typically very fast to train, allowing more time for handling other aspects of the modeling problem, such as attribute selection.

2. Most non-technical people are more comfrotable with explanations of linear-based models than any other kind. In regulated industries, this is of tremendous benefit.

3. When most modelers think of "simple models", linear regression, linear discriminants and logistic regression come to mind, but there are other, less-well known options, such as extreme-value regression (also called complementary log-log regression). Indeed, the transfer function, where one is needed, can be any monotonic function. Further, the linear portion of the model need not be fit by traditional methods. I trained a linear model via a global optimizer to maximize class separation (AUC) for a commercial application. This model was found to be highly effective and presently has been in service for 18 months.

4. Complex models can be built out of collections of simple ones. One of my recent responsibilities was to create and maintain a predictive model used in the management of several billion dollars worth of assets. Any individual case falls into one of 20 mutually-exclusive segements, each with its own logistic regression.

I recommend Generalized Linear Models by McCullagh and Nelder for readers interested in exploring this subject further.

Friday, October 27, 2006

Data Mining and Software Development

Dean, your comments in data mining and software development are interesting. At this point, I largely use my own MATLAB code for data mining. I have access to the Statistics and Curve Fitting Toolboxes, which provide some modeling capability and some useful utility functions. My experience is that, very often I need something which commercial tools (at the convenient interface level) do not provide. With MATLAB, once I have the data, I can prepare the data, perform modeling and report and graph results all under one roof. MATLAB-specific benefits aside, the same sort of thing could be done in other, more conventional languages like Fortran, Java or C++, perhaps with libraries like those from IMSL.

The dark side is the responsibility. I have to do all the things which the commercial shells do, such as manage the data. Occasionally I even need to manage the RAM, on really big problems. My current work machine is a Windows workstation with 2GB RAM (soon to move to a faster machine with 4GB RAM). While I have much more flexibility than the commercial tools provide, sometimes my fingers bleed (figuratively, not literally- yuck) taking care of all the details.

Still, once a decent code base is established, it isn't so bad. For instance, my feature selection process at this point is fairly efficient and robust, being implemented as a few MATLAB functions.

data mining and software development

I've been posting a bit at the yahoo group "datamining2"--we'll see how interesting that group is. I recently responded to a post about java and data mining, and even found another blog that discussed that very issue earlier this week (I just found that post today) -- http://dataminingresearch.blogspot.com/

I don't code anymore, at least not seriously. One reason for that is because the data mining software environments have progessed to the point that I don't need to dust off my C/C++ skills (or lack thereof). And in those relatively rare cases that I do need to program, I can use 4th generation languages, which are quite powerful (if I can ever remember the syntax, but that's another story altogether). Nearly every data mining software package has it's own language: S-Plus command line for S-Plus and Insightful Miner, Clem for Clementine, CART has its own language, Matlab, Visual Basic for Statistica, and of course SAS in Enterprise Miner. This is just naming a few, of course.

Cluster Ensembles

This past week I received the November 2006 issue of the IEEE Transactions on Pattern Analysis and Machine Intelligence, and found very interesting the article "Evaluation of Stability of k-Means Cluster Ensembles with Respect to Random Initialization". This is something that I have thought about, but (to my discredit) haven't read on or even experimented with beyond very simple case studies.

If it of course the logical extension of the ensemble techniques that have been used for the past decade. The method that I found most accessible was to (1) resample the data with bootstrap samples, (2) create k-means cluster models for each sample, and (3) use the cluster labels to associate with each record (at this point, you have R records, M fields used to build the clusters, and P cluster models, one new field for each model). Finally, you can built a hierarchical clustering model based on records and the new "P" fields.

More on this after some experiments.

Thursday, October 19, 2006

Data Mining vs. Predictive Analytics

I find that the terminology associated with specialized fields like data mining very interesting to track. My first boss, Roger Barron (better described as a mentor and later truly a friend--I owe much of who I am as a professional to him), used to talk of the transitions of terminology in technology: bionics, cybernetics, artificial intelligence, neural networks, etc.

I find that data mining and predictive analytics fall into the same category--they are the same basic technology but described from different perspectives. Sometimes colleagues have tried to point out distinctions, and I think one of the better ones was posted by Eric King here, where my definition of "better" means simple and clear.

Predictive analytics is a term I see more in the CRM and database worlds (TDWI conferences come to mind). Perhaps some of this is due to the encroachment of BI into the data mining world, where queries and OLAP are sometimes called data mining (after all, you are "drilling" down into the data!). This would necessitate creating further distinctions in terminology.

However, I don't see data mining losing hold on the style of predictive modeling that is largely empirical and data driven. So I include predictive analytics in the title of this blog as an alternative to data mining in name only, not in purpose.

How to doom data mining solutions before even beginning to build models

I was reminding today while speaking with an email marketing expert of the reason many data mining projects fail. It is usually the case that in developing a data mining approach to solve a business objective that there is a disconnect between the two. When data mining algorithms look at data, they are thinking in terms like "minimum squared error", or "R-squared", or "Percent Correct Classification".

These are usually of little importance to the business objective, which may be to find a population of customers who will purchase at least $100 of goods, or respond at a rate greater than 8% to a campaign. In these cases, a model that performs "well" in the algorithm's view may not be particular good at identifying the top-tier responders. Therefore, the problem should be set up with the business objective in mind, not the data mining algorithm's objective in mind, and the models should be assessed using a metric that matches as closely as possible to the business objective.

Starting again

It's time to start up again. Stay tuned for posts on a regular basis

Tuesday, June 07, 2005

Beware of Being Fooled with Model Performance

Interpreting model performance is a minefield. If one wants model performance to be as good as possible, it is critical to define exactly what "good" means. How does one measure "goodness"? The easiest way to communicate performance is with a single-valued score, such as percent correct classification or R-squared. However, it is precisely this simplification of a complex idea the model is predicting to a single number that can cause one to be fooled. A simple example follows.

Let's assume that a non-profit organization wants a model built that predicts the propensity of individuals to send donations, and that this model has 80+% classification accuracy, even on a test set. Furthermore, assume that the two indicators "Recent Donation Amount" (X1) and "Average Donation Amount" (X2) are two of the top predictors in the model. The figure at the left shows what a Support Vector Machine model did with this data. Even with the good accuracy, there is something disturbing about the model that isn't clear unless one sees a picture: the model isn't finding ranges of average and recent donation amounts that are associated with donors, but rather it is finding islands of donors. The second model (on the right) provided corrective measures to smooth the model, and it much more pleasing. It is saying (roughly) that when someone donates between about $10-$50 on average (X2), they are more likely to respond. It is smooth and there are no pockets of isolated donation amounts, making this model much more believable, even though some accuracy was lost in the process.







Thursday, May 05, 2005

Gain Insights by Building Models from Several Algorithms

A frequent question in data mining is "which algorithm performs best"? While I would argue that getting the data right is far more important than which algorithm is used, there are differences in algorithms that the analyst can use to his or her advantage.

Decision trees build rules to identify homogeneous groups of data, (essentially, little rectangular chunks), so it is building a whole set of local descriptions of the data. In contrast to this, regression builds a single global model of all the data. Neural networks are somewhere in between: a single global model is built, but it can be highly nonlinear, including finding local pockets of homogeneous data. It can be useful, therefore, to look at these different styles of models to understand general trends (regression), and local behaviour (trees). It is often the case that different algorithms arrive at the same level of performance in different ways. Seeing these differences can provide additional insight into the business processes underlying the models.