Friday, December 14, 2007

Three Critical Junctures

I don't know that it's possible to say that any single part of the data mining process is the "most important", but there are three junctures which are absolutely critical to successful data mining: 1. problem definition, 2. data acquisition and 3. model validation. Failures at other points will more often lead to loss in the form of missed opportunities.


Problem Definition

Problem definition means understanding the "real-world" or "business" problem, as opposed to the technical modeling or segmentation problem. In some cases, deliberation on the nature of the business problem may reveal that an empirical model or other sophisticated analysis is not needed at all. In most cases, the model will only be one part of a larger solution. This is a point worth elaboration. Saying that the model is only part of a larger solution is not merely a nod to the database which feeds to model and the reporting system which summarizes model performance in the field. The point here is that a predictive model or clustering mechanism must somehow be fit into the architecture of the solution some how. The important question here is: "How?" Models sometimes solve the whole (technical) problem, but in other situations, optimizers are run over models, or models are used to guide a separate search process. Deciding exactly how the model will be used with the total solution is not always trivial.

Also: attacking the wrong business problem all but ensures failure, since the chances of being able to quickly and inexpensively "re-engineer" a fully-constructed technical solution for the real business problem are slim.


Data Acquisition

Data acquisition refers to the actual collection of whatever data is to be used to build the model. If, for instance, sampling is not representative of the statistical universe to which to model will be applied, all bets are off. More than once, I have received analytical extracts of databases from other individuals which, for instance, contained no accounts with last names starting with the letter 'P' through 'Z'! Clearly, a very arbitrary sample had been drawn. The same thing happens all the time when database programmers naively query for limited ranges of account numbers or other record index values ("all account numbers less than 140000").

With larger and larger data sets being examined by data miners, the need for sampling will not go away in the foreseeable future. Sampling has long been studied within statistics and there are far too many pitfalls in this area to ignore the issue. My strong recommendation is to learn about it, and I suggest a book like Sampling: Design and Analysis Sampling: Design and Analysis by Sharon L. Lohr (ISBN-13: 978-0534353612).


Model Validation

Model validation gets my vote for "most important step in any data mining project". This is where- to the extent it's possible- the data miner determines how much the model really has learned. As I write this, it is the end of the year 2007, yet, amazingly people who call themselves "analysts" continue to produce models without delivering any sort of serious evidence that their models work. Years after the publication of "Magical Thinking in Data Mining: Lessons From CoIL Challenge 2000" by Charles Elkan, in which the dangers of testing on the training set were (yet again!) demonstrated, models are not receiving the rigorous testing they need.

"Knowing what you know" (and what you don't know) is critical. No model is perfect, and understanding the limits of likely performance is crucial. This requires the use of error resampling methods, such as holdout testing, k-fold cross-validation and bootstrapping. Performance of models, once deployed, should not be a surprise, nor a matter of faith.