Wednesday, December 27, 2006

Two Book Recommendations

In my data mining courses, there are two books I always recommend to course attendees who are new to data mining. The first is Data Preparation for Data Mining by Dorian Pyle. I like this book because data preparation is usually the most time-consuming step in the data mining process, and there is only one book I know of that is written entirely for the purpose of data preparation (the second hit in the amazon list I linked is a data prep for SAS book, but that one is SAS-specific).

The second book I recommend is for the analyst who is not a statistician is Data Mining: Practical Machine Learning Tools and Techniques by Witten and Frank. They do a great job of describing algorithms and techniques in data mining in an intuitive way; there are few equations and derivations to cloud the issues for non-mathematicians. The biggest critique I have is that there is no description of neural networks, one of the key algorithms in data mining software packages. But that doesn't dampen my enthusiasm for the book. (If you would like a good, free description of neurla networks, go to the SAS Neural Network FAQ.)

Data mining and Terrorism

Just came across an article by Jim Harper entitled Data Mining Can't Improve Our Security that argues persuasively against the use of data mining in such matters. He writes

Data mining for terrorism prediction has two fundamental flaws:

— First, terrorist acts and their precursors are too rare in our society for there to be patterns to find. There simply is no nugget of information to mine.

— Second, the lack of suitable patterns means that any algorithm used to turn up supposedly suspicious behavior or suspicious people will yield so many false positives as to make it useless. A list of potential terror suspects generated from pattern analysis would not be sufficiently targeted to justify investigating people on the list.

and concludes

Unfortunately, there is no magic bullet that solves the security conundrums created by terrorism. Data mining is a useful technique in many areas, but not this one.


I must say that in general I agree, and have argued the same privately to course attendees who were interested in this type of analysis. [A disclamor: I have never worked with intellgences data, so these opinions are not based on direct experience with terrorist-related data.] However, I also believe that while the data is particularly difficult, it is not useless, and therefore disagree with the final conclusion. The question I think should be asked is this: can data mining improve the ability of law enforcement to identify suspected terrorists. Now it may be that any improvement in information provided by data mining may not worth the effort (as he argues)--this I don't know. But if it can improve the odds of finding a dangerous individual 10-fold or 100-fold over what is currently done, then is it not helpful? (I won't comment on the privacy issues here--that is another important issue, but unrelated to his premise that data mining is not useful here).

For those who have worked on difficult problems, the exploratory data analysis that takes place during the data mining project nearly always yields useful information even if the no final models are produced. Therefore, the conclusion to not do any data mining at all seems to me to be an overreaction.






Tuesday, December 19, 2006

The Best Data Mining Book of 2005

A bit late, but better late than never! Actually, I just heard Stephen Levitt speak at SPSS Directions in November and was reminded, of course, of his book Freakonomics. In 2005, I recommended the book to my data mining course attendees as my favorite data mining book of the year, despite the term "data mining" never appearing (to the best of my knowlege) in the book at all. I think a quote in the preface summarizes why I liked it:

What interested Levitt were the stuff and riddles of everyday life... 'He (Levitt) is an intuitionist. He sifts through a pile of data to find a story no one else has found. He figures a way to measure an effect that veteran economists had declared unmeasurable.
It was the idea of "sifting", a prominent term in the Gartner Group definition (and one that I like in particular) that struck me. And all the examples Levitt gives in his book are examples of uncovering patterns in data that are not the most obvious answers, but rather are ones that fit the data better (in his opinion). I like the book because he approaches data with a forensic mindset.

Tuesday, December 05, 2006

Distance metrics

Will Dwinnell (fellow poster here, but who also has a blog specific to Matlab at http://matlabdatamining.blogspot.com) recently posted on Mahalanobis distance as an alternative to Euclidean distance. We are kindred spirits on this one as I have long advocated the Mahalanobis distance, particularly for data that is close to being normally distributed (there are fixes to make numeric data more normally distributed, of course, but that's for another post, perhaps).

The reasons he gives are right on point, but I'd like to expand the application side. I was first introduced to Mahalanobis distance in the context of Nearest Mean classifiers. In case anyone is not familiar with the M. distance, it weights the Euclidean distance by the covariance matrix (think of Euclidean distance as weighing the distance by the Identity matrix). But it is very useful in many other contexts in addition to these or Radial Basis Function networks, and in fact, any time you compute a distance in an algorithm the M. distance is my preferred distance metric, including k-nearest neighbor, and clustering (like Isodata or K-Means).

The problem is that very few data mining software packages have it. I was introduced to it in an obsolete tool called OLPARS (I also wrote code for it, including Perceptrons, RBFs, and some neural networks). But Matlab does it quite easily.

Thursday, November 30, 2006

Error Measures

All models must be assessed somehow. Despite the existence of a bewildering array of performance measures, much commercial modeling software provides a surprisingly limited range of options. I will provide a short introduction of such measures in this article.



Numerical Models (Regressions)

Mean Squared Error (MSE) is by far the most common measure of numerical model performance. It is simply the average of the squares of the differences between the predicted and actual values. It is a reasonably good measure of performance, though it could be argued that it overemphasizes the importance of larger errors. Many modeling procedures directly minimize the MSE.

Mean Absolute Error (MAE) is similar to the Mean Squared Error, but it uses absolute values instead of squaring. This measure is not as popular as MSE, though its meaning is more intuitive (the "average error").

Bias is the average of the differences between the predicted and actual values. With this measure, positive errors cancel out negative ones. Bias is intended to assess how much higher or lower predictions are, on average, than actual values.

Mean Absolute Percent Error (MAPE) is the average of the absolute errors, as a percentage of the actual values. This is a relative measure of error, which is useful when larger errors are more acceptable on larger actual values.



Classifiers

Classifiers come in two basic varieties: those which produce class outputs, and those which produce probabilities of classes.


Classifiers: Class Output

Accuracy is the proportion of the time that the predicted class equals the actual class, usually expressed as a percentage. It's meaning is straightforward, but may obscure important differences in costs associated with different errors. The classic example of such costs is the medical diagnostic situation, in which one can err be either: 1. keeping a healthy patient in the hospital (low cost), or 2. sending home a sick patient (very high cost).


Classifiers: Probability Output

These classifiers need to be checked for both the accuracy of their probabilities (Do cases predicted to have a 5% (30%, 80%, etc.) probability really belong to the target class 5% (30%, 80%, etc.) of the time?) and their ability to separate the classes in question.

Accuracy can be measured using many of the same metrics used to evaluate numerical models (MSE, MAE, etc.). One interesting alternative which is specific to classification, the informational loss, is based on information theory and is described in Data Mining by Witten and Frank (ISBN 1-55860-552-5).

Some applications (as in marketing) are focused on how many items from the target class can be identified in the best so-many percent of the population. If for example, one only has the resources to mail marketing literature to 10% of the customer file, the ideal would be to pack as many actual respondents as possible into that best 10%. The mirror situation is typified by lenders who wish to cram as many bad loans as possible into the worst 10% of their file. Probably the most popular measure of class separation at present in the literature is the Area Under the ROC Curve (AUC or AUROC), which is like measuring separation across the whole spectrum.



The intrepid data miner is invited to explore these performance measures and related topics on his or her own:

confusion matrix
F-measure
sensitivity and specificity

Tuesday, November 14, 2006

Ensembles everywhere

After reading the ensembles of cluster article recently, I just say another ariticle in IEEE PAMI entitled "Rotation Forest: A New Classifier Ensemble Method". The approach is interesting: much like Random Forests (where the diversity of trees used in the ensemble are developed by both bootstrap sampling and random variable selection), there is a random selection of variables to use in trees. But the twist here (and the "rotation") is by using PCA on the random subset of candidate variables.

I'm sure there are near infinite ways to tweak the ideas of random record/variable selection, but once again the keys to the success of ensembles here and always are:
1) Diversity in information (i.e., data) the modeling algorithm sees.
2) Algorithms that are weak learners benefit from ensembles

Random Forests, it seems to me, works well because not only are trees coarse, blunt (and unstable) predictors, but they are greedy searches that can be fooled into going down a sub-optimal path. By constraining the splits to contain only some of the variables, the tree is forced out of it's greedy perspective to consider other ways to achieve the solution. This new algorithm does the same thing, with the twist of using PCA to develop linear projections of the original data (subsets to be more precise).

I think we'll be seing more and more variations on the same theme in the coming years.

Free And Inexpensive Data Mining Software

I recently came across an article on-line which included the claim that data mining was "hugely expensive". I disagree. Given reasonably capable desktop hardware, and a qualified data miner (who is the most expensive component of data mining cost!), a variety of capable data mining software packages are available for free or relatively little (meaning < US$100). I cannot vouch that any one of these tools is a good fit for your specific application, but they are certainly worth a look:

DMSK
KNIME
SNNS (Stuttgart Neural Network Simulator)
YALE (Yet Another Learning Environment)
Weka

There are some commercial tools which sell for less than US$250, such as:

BrainMaker

Of course, there is always the "roll-your-own" approach, in which the data miner constructs or gathers his or her own tools. The internet houses a wealth of source code in a variety of languages. Aside from searching on general-purpose Internet search engines for things like:

"quadratic discriminant" "source code"

or

backpropagation Java

...there are also source code repositories, such as:

Google Code
LiteratePrograms
MATLAB Central

Sunday, November 12, 2006

Seeking the "Best" Model or, Data Mining's El Dorado

As their explorations of South America in the 1500s progressed, Spanish explorers encountered stories of El Dorado, originally "a man covered in gold", but eventually coming to mean a lost city of gold. These stories varied in the details, but at the center of all of them was the gold. Such legends circulated widely and provoked considerable interest. Significant energy (and time and money and lives) were spent in an effort to discover the location of El Dorado, without success.

There apparently was a real ritual in part of South America involving a king covered in gold dust, which perhaps gave rise to the legend. It has also been theorized that the people inhabiting lands being consumed by the Spanish empire essentially told the Spanish whatever they wanted to hear- whatever it took to get them to go away. Either way, the Spanish searched in vain for something that wasn't there.

Today, organizations search for models which approximate reality. A frequent feature of this search, particularly prevalent among novices is the search for the "best" model. Models vary in their performance, with some clearly being "better" than others. It is natural enough, then, to think that data mining is an undertaking whose ideal outcome is the "best" model.

Some time ago, I was being interviewed by auditors regarding methods I used in building models of customer behavior. They asked questions about various aspects data gathering, model building and testing methods. Their questions eventually came around to the issue of how I knew that my models were the best possible. I told them that I did not know that my models were the best possible. I told them that, in fact, I knew that my models were not the best possible. I said that I knew that my models were technically sound, had historically been reliable over long periods of time and that they answered the needs of the business. Searching for a "best possible" model, I asserted, was an academic exercise.

How much time a given modeling effort deserves is an open question, and I am not advocating doing a bad job. Remember, though, that with enough data, a search for better models could continue indefinitely, but this would be a fool's errand. There will always be modeling techniques and derived variables not yet explored. That a deployed model fails to meet some unknown and theoretical ideal is only important as a missed opportunity (but a potentially very expensive one to explore). Much more important is whether a model meets a business need and whether it is likely to continue to do so. This implies that time in data mining is better spent validating models, rather than exhaustively chasing trivial improvements in apparent performance.

Small News Item: I have started another, tool-specific Web log, Data Mining in MATLAB.

Tuesday, November 07, 2006

Family Recipe For Neural Networks

I continue to read accounts of neural networks (specifically multilayer perceptrons) which characterize them as "slow" to train or "difficult" to configure. I suspect that most such descriptions are written by people with relatively little experience using neural networks. I believe that people who have enjoyed success with neural networks, like myself, have developed some standard procedure for configuring and training them. What follows is my basic neural network recipe. Season to taste.

Training to "Convergence"

Don't. Use early stopping instead. Use a separately drawn "tuning" data set to track the model fit. Train until the error on the "tuning" set reaches a minimum, then stop. Training further implies over-fitting.

The other sensible alternative is to train until convergence, but limit the number of hidden neurons. This is theoretically acceptable, but in practice requires longer training runs and more experimentation than early stopping.


Number of Hidden Layers

Use one hidden layer, always. In my experience, stacking on extra layers rarely improves performance enough to justify the computational effort. The time spent on longer training times and experimentation with the combination of 2 or more hidden layers is better spent on other performance-boosting efforts, such as dreaming up better derived features.


Number of Hidden Neurons

Training with early stopping means that over-fitting should not be an issue, so I generally don't worry about having "too many" hidden neurons unless memory use or training time really get out of hand. Since early stopping typically means shorter training runs (less passes over the data) than "training to convergence", one can afford a little experimentation here.

Surprisingly small hidden layers (as few as, say, 5 neurons) can often be effective, and are quick to train. If this does not produce satisfactory results, make the hidden layer larger by perhaps 50% at a time.


Scaling Inputs

If your software does not scale the input variables automatically, it may be beneficial to do so even though it is theoretically unneccessary. Wildly different input variable ranges can slow training.


Scaling Outputs

If your software does not do this automatically, this is neccessary as neural networks have a limited output range, either 0.0 to 1.0 or -1.0 to +1.0. Incidentally, unless I have a good reason, I don't bother with slightly reduced scales (like 0.001 to 0.999) as some authors suggest.


There are many other "tricks" one can use to turbo-charge performance, naturally, but this basic recipe has served me well over the years. Despite other analysts' stories of needing thousands of passes over the data, in my experience (with many thousands of training observations and over one hundred inputs), training with early stopping will often complete in less than 50 runs. Your mileage, of course, may vary.


What I've been reading lately: Common Errors in Statistics (and How to Avoid Them) Second Edition by Phillip I. Good and James W. Hardin (ISBN: 0471794317).

Sunday, November 05, 2006

Keep Your Eye On The Problem

An occupational hazard of statistical modeling is to view the model as the solution to the problem. Direct application of modeling in an effort to leap directly from the available data to a final result is not always feasible or efficient. Models are always part of a solution, and guaging how large a role they play in the solution is important.

Consider the solution described in the paper, Fuzzy Decision System for Threshold Selection to Cluster Cauliflower Plant Blobs From Field Visual Images, by L. García-Pérez, M. C. García-Alegre, J. Marchant and T. Hague, which involves an agricultural machine vision problem. One fundamental step in the solution of this particular problem is segmentation of the digitized image into two classes: plants and background. In this case, the authors have elected to employ a single threshold of brightness to be applied to the entire image, above which pixels will be regarded as plants, and below which pixels will be regarded as background.

A complete solution to this problem accepts image data as input and produces a suitable threshold value as output. Having gotten in the habit of viewing the model as the solution, it is easy for the modeler to prematurely conclude that his or her model should, by itself, bridge this entire gap.

In the paper mentioned above, however, the authors took a different perspective. They constructed a model (a "policy", really- in this case composed of fuzzy logic rules) which indicates only by how much a given threshold should be incremented or decremented to improve performance. The threshold is initialized and updated by iteratively firing the model to modify the threshold. This is essentially a search across candidate threshold values. When the threshold value stabilizes (when the model indicates zero suggested change in the threshold), the process is finished and the final threshold is output. Note that the model does not directly determine the threshold. This model accepts image data and a candidate threshold as input and produces a suggested change for the threshold value as output. The total solution, which does accept image data as input and produces a suitable threshold value as output is composed of the fuzzy logic model as well as a simple looping search mechanism.

I suggest that it is worth trying to find the best way to leverage the power of modeling, whether as a complete solution unto itself or otherwise. Re-arranging what, exactly, is treated as model input and model output is one way to accomplish this.

Friday, November 03, 2006

In Praise of Simpler Models -- at least in practice

Originally, this was goingt o be a comment, but it became so long, that I'm just posting it.

I'll never forget the hubbub that was generated by a talk by Pedros Domingos, I think at the 1998 KDD conference in New York City. His talk, "Occam's Two Razors: The Sharp and the Blunt" turned the usually calm and measured data mining crowd into an agitated hoard! I mean there was even yelling there (and if you've never been to a conference of statisticians or data miner, you won't know how unbelievable this scene was. This is from the Abstract, also posted at Citeseer in the link above:

Occam's razor has been the subject of much controversy. This paper argues
that this is partly because it has been interpreted in two quite different ways,
the first of which (simplicity is a goal in itself) is essentially correct,
while the second (simplicity leads to greater accuracy) is not. The paper
reviews the large variety of theoretical arguments and empirical evidence for
and against the "second razor," and concludes that the balance is strongly
against it.

John Elder provided a very useful and reasoned explanation for the apparent difficulties with Occam's Razor here (in a summary of KDD '98 presented at my favorite conference, the Symposium on the Interface). To quote,

Jianming Ye presented an intriguing alternative metric, Generalized
Degrees of Freedom, which finds the sensitivity of the fitted values to
perturbations in the outputs [5]. This allows complexity to be measured
and compared in an entirely experimental manner [Ye, J. (March 1998). On Measuring and Correcting the Effects of Data Mining and Model Selection. Journal of the American Statistical Association 93, no. 441, pp. 120-131.]

The key idea is this: just because a model has lots of weights or splits, doesn't mean it is acting in a complex way. John has shown subsequently (at the most recent Salford Systems conference in San Diego, March 2006) that using the GDoF idea, ensembles of trees can actually behave less complex than individual trees within the ensemble.

So I agree with the idea that simpler models are usually better, and for that reason, I always try linear models first. However, sometimes the best models in terms of accuracy are complex looking, but in reality behave in relatively simple ways.

In Praise of Simple Models

At a time when ever more subtle and complex modeling techniques are emerging, it is interesting to note the continuing effectiveness of comparatively simple modeling methods, such as logistic regression. In my work over the past so many years in the finance industry, I have found repeated success employing such methods. One reason for their success is that data may not be as complex as is commonly believed, especially in high-dimensional spaces. See writings on Holte's 1R algorithm as an example. Also, labeling a model or modeling technique as "simple" may be deceptive. With the exception of clinical or academic settings, the majority of real-world models based on transformed (or even un-transformed) linear functions are helped by the use of derived features which make the model function potentially very complex in the space of raw variables.

Some observations:

1. Simple models are typically very fast to train, allowing more time for handling other aspects of the modeling problem, such as attribute selection.

2. Most non-technical people are more comfrotable with explanations of linear-based models than any other kind. In regulated industries, this is of tremendous benefit.

3. When most modelers think of "simple models", linear regression, linear discriminants and logistic regression come to mind, but there are other, less-well known options, such as extreme-value regression (also called complementary log-log regression). Indeed, the transfer function, where one is needed, can be any monotonic function. Further, the linear portion of the model need not be fit by traditional methods. I trained a linear model via a global optimizer to maximize class separation (AUC) for a commercial application. This model was found to be highly effective and presently has been in service for 18 months.

4. Complex models can be built out of collections of simple ones. One of my recent responsibilities was to create and maintain a predictive model used in the management of several billion dollars worth of assets. Any individual case falls into one of 20 mutually-exclusive segements, each with its own logistic regression.

I recommend Generalized Linear Models by McCullagh and Nelder for readers interested in exploring this subject further.

Friday, October 27, 2006

Data Mining and Software Development

Dean, your comments in data mining and software development are interesting. At this point, I largely use my own MATLAB code for data mining. I have access to the Statistics and Curve Fitting Toolboxes, which provide some modeling capability and some useful utility functions. My experience is that, very often I need something which commercial tools (at the convenient interface level) do not provide. With MATLAB, once I have the data, I can prepare the data, perform modeling and report and graph results all under one roof. MATLAB-specific benefits aside, the same sort of thing could be done in other, more conventional languages like Fortran, Java or C++, perhaps with libraries like those from IMSL.

The dark side is the responsibility. I have to do all the things which the commercial shells do, such as manage the data. Occasionally I even need to manage the RAM, on really big problems. My current work machine is a Windows workstation with 2GB RAM (soon to move to a faster machine with 4GB RAM). While I have much more flexibility than the commercial tools provide, sometimes my fingers bleed (figuratively, not literally- yuck) taking care of all the details.

Still, once a decent code base is established, it isn't so bad. For instance, my feature selection process at this point is fairly efficient and robust, being implemented as a few MATLAB functions.

data mining and software development

I've been posting a bit at the yahoo group "datamining2"--we'll see how interesting that group is. I recently responded to a post about java and data mining, and even found another blog that discussed that very issue earlier this week (I just found that post today) -- http://dataminingresearch.blogspot.com/

I don't code anymore, at least not seriously. One reason for that is because the data mining software environments have progessed to the point that I don't need to dust off my C/C++ skills (or lack thereof). And in those relatively rare cases that I do need to program, I can use 4th generation languages, which are quite powerful (if I can ever remember the syntax, but that's another story altogether). Nearly every data mining software package has it's own language: S-Plus command line for S-Plus and Insightful Miner, Clem for Clementine, CART has its own language, Matlab, Visual Basic for Statistica, and of course SAS in Enterprise Miner. This is just naming a few, of course.

Cluster Ensembles

This past week I received the November 2006 issue of the IEEE Transactions on Pattern Analysis and Machine Intelligence, and found very interesting the article "Evaluation of Stability of k-Means Cluster Ensembles with Respect to Random Initialization". This is something that I have thought about, but (to my discredit) haven't read on or even experimented with beyond very simple case studies.

If it of course the logical extension of the ensemble techniques that have been used for the past decade. The method that I found most accessible was to (1) resample the data with bootstrap samples, (2) create k-means cluster models for each sample, and (3) use the cluster labels to associate with each record (at this point, you have R records, M fields used to build the clusters, and P cluster models, one new field for each model). Finally, you can built a hierarchical clustering model based on records and the new "P" fields.

More on this after some experiments.

Thursday, October 19, 2006

Data Mining vs. Predictive Analytics

I find that the terminology associated with specialized fields like data mining very interesting to track. My first boss, Roger Barron (better described as a mentor and later truly a friend--I owe much of who I am as a professional to him), used to talk of the transitions of terminology in technology: bionics, cybernetics, artificial intelligence, neural networks, etc.

I find that data mining and predictive analytics fall into the same category--they are the same basic technology but described from different perspectives. Sometimes colleagues have tried to point out distinctions, and I think one of the better ones was posted by Eric King here, where my definition of "better" means simple and clear.

Predictive analytics is a term I see more in the CRM and database worlds (TDWI conferences come to mind). Perhaps some of this is due to the encroachment of BI into the data mining world, where queries and OLAP are sometimes called data mining (after all, you are "drilling" down into the data!). This would necessitate creating further distinctions in terminology.

However, I don't see data mining losing hold on the style of predictive modeling that is largely empirical and data driven. So I include predictive analytics in the title of this blog as an alternative to data mining in name only, not in purpose.

How to doom data mining solutions before even beginning to build models

I was reminding today while speaking with an email marketing expert of the reason many data mining projects fail. It is usually the case that in developing a data mining approach to solve a business objective that there is a disconnect between the two. When data mining algorithms look at data, they are thinking in terms like "minimum squared error", or "R-squared", or "Percent Correct Classification".

These are usually of little importance to the business objective, which may be to find a population of customers who will purchase at least $100 of goods, or respond at a rate greater than 8% to a campaign. In these cases, a model that performs "well" in the algorithm's view may not be particular good at identifying the top-tier responders. Therefore, the problem should be set up with the business objective in mind, not the data mining algorithm's objective in mind, and the models should be assessed using a metric that matches as closely as possible to the business objective.

Starting again

It's time to start up again. Stay tuned for posts on a regular basis