Wednesday, December 27, 2006
The second book I recommend is for the analyst who is not a statistician is Data Mining: Practical Machine Learning Tools and Techniques by Witten and Frank. They do a great job of describing algorithms and techniques in data mining in an intuitive way; there are few equations and derivations to cloud the issues for non-mathematicians. The biggest critique I have is that there is no description of neural networks, one of the key algorithms in data mining software packages. But that doesn't dampen my enthusiasm for the book. (If you would like a good, free description of neurla networks, go to the SAS Neural Network FAQ.)
Data mining for terrorism prediction has two fundamental flaws:
— First, terrorist acts and their precursors are too rare in our society for there to be patterns to find. There simply is no nugget of information to mine.
— Second, the lack of suitable patterns means that any algorithm used to turn up supposedly suspicious behavior or suspicious people will yield so many false positives as to make it useless. A list of potential terror suspects generated from pattern analysis would not be sufficiently targeted to justify investigating people on the list.
Unfortunately, there is no magic bullet that solves the security conundrums created by terrorism. Data mining is a useful technique in many areas, but not this one.
I must say that in general I agree, and have argued the same privately to course attendees who were interested in this type of analysis. [A disclamor: I have never worked with intellgences data, so these opinions are not based on direct experience with terrorist-related data.] However, I also believe that while the data is particularly difficult, it is not useless, and therefore disagree with the final conclusion. The question I think should be asked is this: can data mining improve the ability of law enforcement to identify suspected terrorists. Now it may be that any improvement in information provided by data mining may not worth the effort (as he argues)--this I don't know. But if it can improve the odds of finding a dangerous individual 10-fold or 100-fold over what is currently done, then is it not helpful? (I won't comment on the privacy issues here--that is another important issue, but unrelated to his premise that data mining is not useful here).
For those who have worked on difficult problems, the exploratory data analysis that takes place during the data mining project nearly always yields useful information even if the no final models are produced. Therefore, the conclusion to not do any data mining at all seems to me to be an overreaction.
Tuesday, December 19, 2006
What interested Levitt were the stuff and riddles of everyday life... 'He (Levitt) is an intuitionist. He sifts through a pile of data to find a story no one else has found. He figures a way to measure an effect that veteran economists had declared unmeasurable.It was the idea of "sifting", a prominent term in the Gartner Group definition (and one that I like in particular) that struck me. And all the examples Levitt gives in his book are examples of uncovering patterns in data that are not the most obvious answers, but rather are ones that fit the data better (in his opinion). I like the book because he approaches data with a forensic mindset.
Tuesday, December 05, 2006
The reasons he gives are right on point, but I'd like to expand the application side. I was first introduced to Mahalanobis distance in the context of Nearest Mean classifiers. In case anyone is not familiar with the M. distance, it weights the Euclidean distance by the covariance matrix (think of Euclidean distance as weighing the distance by the Identity matrix). But it is very useful in many other contexts in addition to these or Radial Basis Function networks, and in fact, any time you compute a distance in an algorithm the M. distance is my preferred distance metric, including k-nearest neighbor, and clustering (like Isodata or K-Means).
The problem is that very few data mining software packages have it. I was introduced to it in an obsolete tool called OLPARS (I also wrote code for it, including Perceptrons, RBFs, and some neural networks). But Matlab does it quite easily.
Thursday, November 30, 2006
Numerical Models (Regressions)
Mean Squared Error (MSE) is by far the most common measure of numerical model performance. It is simply the average of the squares of the differences between the predicted and actual values. It is a reasonably good measure of performance, though it could be argued that it overemphasizes the importance of larger errors. Many modeling procedures directly minimize the MSE.
Mean Absolute Error (MAE) is similar to the Mean Squared Error, but it uses absolute values instead of squaring. This measure is not as popular as MSE, though its meaning is more intuitive (the "average error").
Bias is the average of the differences between the predicted and actual values. With this measure, positive errors cancel out negative ones. Bias is intended to assess how much higher or lower predictions are, on average, than actual values.
Mean Absolute Percent Error (MAPE) is the average of the absolute errors, as a percentage of the actual values. This is a relative measure of error, which is useful when larger errors are more acceptable on larger actual values.
Classifiers come in two basic varieties: those which produce class outputs, and those which produce probabilities of classes.
Classifiers: Class Output
Accuracy is the proportion of the time that the predicted class equals the actual class, usually expressed as a percentage. It's meaning is straightforward, but may obscure important differences in costs associated with different errors. The classic example of such costs is the medical diagnostic situation, in which one can err be either: 1. keeping a healthy patient in the hospital (low cost), or 2. sending home a sick patient (very high cost).
Classifiers: Probability Output
These classifiers need to be checked for both the accuracy of their probabilities (Do cases predicted to have a 5% (30%, 80%, etc.) probability really belong to the target class 5% (30%, 80%, etc.) of the time?) and their ability to separate the classes in question.
Accuracy can be measured using many of the same metrics used to evaluate numerical models (MSE, MAE, etc.). One interesting alternative which is specific to classification, the informational loss, is based on information theory and is described in Data Mining by Witten and Frank (ISBN 1-55860-552-5).
Some applications (as in marketing) are focused on how many items from the target class can be identified in the best so-many percent of the population. If for example, one only has the resources to mail marketing literature to 10% of the customer file, the ideal would be to pack as many actual respondents as possible into that best 10%. The mirror situation is typified by lenders who wish to cram as many bad loans as possible into the worst 10% of their file. Probably the most popular measure of class separation at present in the literature is the Area Under the ROC Curve (AUC or AUROC), which is like measuring separation across the whole spectrum.
The intrepid data miner is invited to explore these performance measures and related topics on his or her own:
sensitivity and specificity
Tuesday, November 14, 2006
I'm sure there are near infinite ways to tweak the ideas of random record/variable selection, but once again the keys to the success of ensembles here and always are:
1) Diversity in information (i.e., data) the modeling algorithm sees.
2) Algorithms that are weak learners benefit from ensembles
Random Forests, it seems to me, works well because not only are trees coarse, blunt (and unstable) predictors, but they are greedy searches that can be fooled into going down a sub-optimal path. By constraining the splits to contain only some of the variables, the tree is forced out of it's greedy perspective to consider other ways to achieve the solution. This new algorithm does the same thing, with the twist of using PCA to develop linear projections of the original data (subsets to be more precise).
I think we'll be seing more and more variations on the same theme in the coming years.
SNNS (Stuttgart Neural Network Simulator)
YALE (Yet Another Learning Environment)
There are some commercial tools which sell for less than US$250, such as:
Of course, there is always the "roll-your-own" approach, in which the data miner constructs or gathers his or her own tools. The internet houses a wealth of source code in a variety of languages. Aside from searching on general-purpose Internet search engines for things like:
"quadratic discriminant" "source code"
...there are also source code repositories, such as:
Sunday, November 12, 2006
There apparently was a real ritual in part of South America involving a king covered in gold dust, which perhaps gave rise to the legend. It has also been theorized that the people inhabiting lands being consumed by the Spanish empire essentially told the Spanish whatever they wanted to hear- whatever it took to get them to go away. Either way, the Spanish searched in vain for something that wasn't there.
Today, organizations search for models which approximate reality. A frequent feature of this search, particularly prevalent among novices is the search for the "best" model. Models vary in their performance, with some clearly being "better" than others. It is natural enough, then, to think that data mining is an undertaking whose ideal outcome is the "best" model.
Some time ago, I was being interviewed by auditors regarding methods I used in building models of customer behavior. They asked questions about various aspects data gathering, model building and testing methods. Their questions eventually came around to the issue of how I knew that my models were the best possible. I told them that I did not know that my models were the best possible. I told them that, in fact, I knew that my models were not the best possible. I said that I knew that my models were technically sound, had historically been reliable over long periods of time and that they answered the needs of the business. Searching for a "best possible" model, I asserted, was an academic exercise.
How much time a given modeling effort deserves is an open question, and I am not advocating doing a bad job. Remember, though, that with enough data, a search for better models could continue indefinitely, but this would be a fool's errand. There will always be modeling techniques and derived variables not yet explored. That a deployed model fails to meet some unknown and theoretical ideal is only important as a missed opportunity (but a potentially very expensive one to explore). Much more important is whether a model meets a business need and whether it is likely to continue to do so. This implies that time in data mining is better spent validating models, rather than exhaustively chasing trivial improvements in apparent performance.
Small News Item: I have started another, tool-specific Web log, Data Mining in MATLAB.
Tuesday, November 07, 2006
Training to "Convergence"
Don't. Use early stopping instead. Use a separately drawn "tuning" data set to track the model fit. Train until the error on the "tuning" set reaches a minimum, then stop. Training further implies over-fitting.
The other sensible alternative is to train until convergence, but limit the number of hidden neurons. This is theoretically acceptable, but in practice requires longer training runs and more experimentation than early stopping.
Number of Hidden Layers
Use one hidden layer, always. In my experience, stacking on extra layers rarely improves performance enough to justify the computational effort. The time spent on longer training times and experimentation with the combination of 2 or more hidden layers is better spent on other performance-boosting efforts, such as dreaming up better derived features.
Number of Hidden Neurons
Training with early stopping means that over-fitting should not be an issue, so I generally don't worry about having "too many" hidden neurons unless memory use or training time really get out of hand. Since early stopping typically means shorter training runs (less passes over the data) than "training to convergence", one can afford a little experimentation here.
Surprisingly small hidden layers (as few as, say, 5 neurons) can often be effective, and are quick to train. If this does not produce satisfactory results, make the hidden layer larger by perhaps 50% at a time.
If your software does not scale the input variables automatically, it may be beneficial to do so even though it is theoretically unneccessary. Wildly different input variable ranges can slow training.
If your software does not do this automatically, this is neccessary as neural networks have a limited output range, either 0.0 to 1.0 or -1.0 to +1.0. Incidentally, unless I have a good reason, I don't bother with slightly reduced scales (like 0.001 to 0.999) as some authors suggest.
There are many other "tricks" one can use to turbo-charge performance, naturally, but this basic recipe has served me well over the years. Despite other analysts' stories of needing thousands of passes over the data, in my experience (with many thousands of training observations and over one hundred inputs), training with early stopping will often complete in less than 50 runs. Your mileage, of course, may vary.
What I've been reading lately: Common Errors in Statistics (and How to Avoid Them) Second Edition by Phillip I. Good and James W. Hardin (ISBN: 0471794317).
Sunday, November 05, 2006
Consider the solution described in the paper, Fuzzy Decision System for Threshold Selection to Cluster Cauliflower Plant Blobs From Field Visual Images, by L. García-Pérez, M. C. García-Alegre, J. Marchant and T. Hague, which involves an agricultural machine vision problem. One fundamental step in the solution of this particular problem is segmentation of the digitized image into two classes: plants and background. In this case, the authors have elected to employ a single threshold of brightness to be applied to the entire image, above which pixels will be regarded as plants, and below which pixels will be regarded as background.
A complete solution to this problem accepts image data as input and produces a suitable threshold value as output. Having gotten in the habit of viewing the model as the solution, it is easy for the modeler to prematurely conclude that his or her model should, by itself, bridge this entire gap.
In the paper mentioned above, however, the authors took a different perspective. They constructed a model (a "policy", really- in this case composed of fuzzy logic rules) which indicates only by how much a given threshold should be incremented or decremented to improve performance. The threshold is initialized and updated by iteratively firing the model to modify the threshold. This is essentially a search across candidate threshold values. When the threshold value stabilizes (when the model indicates zero suggested change in the threshold), the process is finished and the final threshold is output. Note that the model does not directly determine the threshold. This model accepts image data and a candidate threshold as input and produces a suggested change for the threshold value as output. The total solution, which does accept image data as input and produces a suitable threshold value as output is composed of the fuzzy logic model as well as a simple looping search mechanism.
I suggest that it is worth trying to find the best way to leverage the power of modeling, whether as a complete solution unto itself or otherwise. Re-arranging what, exactly, is treated as model input and model output is one way to accomplish this.
Friday, November 03, 2006
I'll never forget the hubbub that was generated by a talk by Pedros Domingos, I think at the 1998 KDD conference in New York City. His talk, "Occam's Two Razors: The Sharp and the Blunt" turned the usually calm and measured data mining crowd into an agitated hoard! I mean there was even yelling there (and if you've never been to a conference of statisticians or data miner, you won't know how unbelievable this scene was. This is from the Abstract, also posted at Citeseer in the link above:
Occam's razor has been the subject of much controversy. This paper argues
that this is partly because it has been interpreted in two quite different ways,
the first of which (simplicity is a goal in itself) is essentially correct,
while the second (simplicity leads to greater accuracy) is not. The paper
reviews the large variety of theoretical arguments and empirical evidence for
and against the "second razor," and concludes that the balance is strongly
John Elder provided a very useful and reasoned explanation for the apparent difficulties with Occam's Razor here (in a summary of KDD '98 presented at my favorite conference, the Symposium on the Interface). To quote,
Jianming Ye presented an intriguing alternative metric, Generalized
Degrees of Freedom, which finds the sensitivity of the fitted values to
perturbations in the outputs . This allows complexity to be measured
and compared in an entirely experimental manner [Ye, J. (March 1998). On Measuring and Correcting the Effects of Data Mining and Model Selection. Journal of the American Statistical Association 93, no. 441, pp. 120-131.]
The key idea is this: just because a model has lots of weights or splits, doesn't mean it is acting in a complex way. John has shown subsequently (at the most recent Salford Systems conference in San Diego, March 2006) that using the GDoF idea, ensembles of trees can actually behave less complex than individual trees within the ensemble.
So I agree with the idea that simpler models are usually better, and for that reason, I always try linear models first. However, sometimes the best models in terms of accuracy are complex looking, but in reality behave in relatively simple ways.
1. Simple models are typically very fast to train, allowing more time for handling other aspects of the modeling problem, such as attribute selection.
2. Most non-technical people are more comfrotable with explanations of linear-based models than any other kind. In regulated industries, this is of tremendous benefit.
3. When most modelers think of "simple models", linear regression, linear discriminants and logistic regression come to mind, but there are other, less-well known options, such as extreme-value regression (also called complementary log-log regression). Indeed, the transfer function, where one is needed, can be any monotonic function. Further, the linear portion of the model need not be fit by traditional methods. I trained a linear model via a global optimizer to maximize class separation (AUC) for a commercial application. This model was found to be highly effective and presently has been in service for 18 months.
4. Complex models can be built out of collections of simple ones. One of my recent responsibilities was to create and maintain a predictive model used in the management of several billion dollars worth of assets. Any individual case falls into one of 20 mutually-exclusive segements, each with its own logistic regression.
I recommend Generalized Linear Models by McCullagh and Nelder for readers interested in exploring this subject further.
Friday, October 27, 2006
The dark side is the responsibility. I have to do all the things which the commercial shells do, such as manage the data. Occasionally I even need to manage the RAM, on really big problems. My current work machine is a Windows workstation with 2GB RAM (soon to move to a faster machine with 4GB RAM). While I have much more flexibility than the commercial tools provide, sometimes my fingers bleed (figuratively, not literally- yuck) taking care of all the details.
Still, once a decent code base is established, it isn't so bad. For instance, my feature selection process at this point is fairly efficient and robust, being implemented as a few MATLAB functions.
I don't code anymore, at least not seriously. One reason for that is because the data mining software environments have progessed to the point that I don't need to dust off my C/C++ skills (or lack thereof). And in those relatively rare cases that I do need to program, I can use 4th generation languages, which are quite powerful (if I can ever remember the syntax, but that's another story altogether). Nearly every data mining software package has it's own language: S-Plus command line for S-Plus and Insightful Miner, Clem for Clementine, CART has its own language, Matlab, Visual Basic for Statistica, and of course SAS in Enterprise Miner. This is just naming a few, of course.
If it of course the logical extension of the ensemble techniques that have been used for the past decade. The method that I found most accessible was to (1) resample the data with bootstrap samples, (2) create k-means cluster models for each sample, and (3) use the cluster labels to associate with each record (at this point, you have R records, M fields used to build the clusters, and P cluster models, one new field for each model). Finally, you can built a hierarchical clustering model based on records and the new "P" fields.
More on this after some experiments.
Thursday, October 19, 2006
I find that data mining and predictive analytics fall into the same category--they are the same basic technology but described from different perspectives. Sometimes colleagues have tried to point out distinctions, and I think one of the better ones was posted by Eric King here, where my definition of "better" means simple and clear.
Predictive analytics is a term I see more in the CRM and database worlds (TDWI conferences come to mind). Perhaps some of this is due to the encroachment of BI into the data mining world, where queries and OLAP are sometimes called data mining (after all, you are "drilling" down into the data!). This would necessitate creating further distinctions in terminology.
However, I don't see data mining losing hold on the style of predictive modeling that is largely empirical and data driven. So I include predictive analytics in the title of this blog as an alternative to data mining in name only, not in purpose.
These are usually of little importance to the business objective, which may be to find a population of customers who will purchase at least $100 of goods, or respond at a rate greater than 8% to a campaign. In these cases, a model that performs "well" in the algorithm's view may not be particular good at identifying the top-tier responders. Therefore, the problem should be set up with the business objective in mind, not the data mining algorithm's objective in mind, and the models should be assessed using a metric that matches as closely as possible to the business objective.