Sunday, November 12, 2006

Seeking the "Best" Model or, Data Mining's El Dorado

As their explorations of South America in the 1500s progressed, Spanish explorers encountered stories of El Dorado, originally "a man covered in gold", but eventually coming to mean a lost city of gold. These stories varied in the details, but at the center of all of them was the gold. Such legends circulated widely and provoked considerable interest. Significant energy (and time and money and lives) were spent in an effort to discover the location of El Dorado, without success.

There apparently was a real ritual in part of South America involving a king covered in gold dust, which perhaps gave rise to the legend. It has also been theorized that the people inhabiting lands being consumed by the Spanish empire essentially told the Spanish whatever they wanted to hear- whatever it took to get them to go away. Either way, the Spanish searched in vain for something that wasn't there.

Today, organizations search for models which approximate reality. A frequent feature of this search, particularly prevalent among novices is the search for the "best" model. Models vary in their performance, with some clearly being "better" than others. It is natural enough, then, to think that data mining is an undertaking whose ideal outcome is the "best" model.

Some time ago, I was being interviewed by auditors regarding methods I used in building models of customer behavior. They asked questions about various aspects data gathering, model building and testing methods. Their questions eventually came around to the issue of how I knew that my models were the best possible. I told them that I did not know that my models were the best possible. I told them that, in fact, I knew that my models were not the best possible. I said that I knew that my models were technically sound, had historically been reliable over long periods of time and that they answered the needs of the business. Searching for a "best possible" model, I asserted, was an academic exercise.

How much time a given modeling effort deserves is an open question, and I am not advocating doing a bad job. Remember, though, that with enough data, a search for better models could continue indefinitely, but this would be a fool's errand. There will always be modeling techniques and derived variables not yet explored. That a deployed model fails to meet some unknown and theoretical ideal is only important as a missed opportunity (but a potentially very expensive one to explore). Much more important is whether a model meets a business need and whether it is likely to continue to do so. This implies that time in data mining is better spent validating models, rather than exhaustively chasing trivial improvements in apparent performance.

Small News Item: I have started another, tool-specific Web log, Data Mining in MATLAB.


mfbkb said...

What you say is true. It is impossible to find the “best model”, at least with real data, but the Information Theory (IT) gives us a method to find an optimal model. Hopefully a data set contains information enough to solve our problem. Using the IT it’s possible to measure the maximum amount of information that a data set could contain and the actual amount of information that the data set does contain about the dependent variable. That assures the creation of an optimal model given the available data.

It is important to note that using the Information Theory makes unnecessary the preparation of the variables.

Dorian Pyle adapted the Information Theory to the needs of data mining and the initial discussion can be found in the Chapter 11 of “Data Preparation for Data Mining”. Powerhouse ( is the program that apply these ideas and shows that it is possible to get an optimal model in the shortest possible time with the least technical effort possible.

tombreur said...

Information Theory makes it possible to determine the information content present "within" a given data set.

What this allows is to see how well any given model is performing, given "this" data set, and the theoretical ceiling given by the information content in the data set.

It is NOT true that with IT you would not need to prepare variables, quite to the contrary. What IT enables is OPTIMAL transformation of input variables, GIVEN an output variable one is trying to predict (by maximizing the information transmission between input variabels and the target variable). The confusion may have arisen because Powerhouse, the tool that Dorian Pyle has been developing in the past years, has these preparation routines built in which require absolutely minimal effort and/or intervention from the miner.

Dean Abbott said...

Can you give some specifics on how IT helps here (as I am unfamiliar with PTI and the techniques that Pyle uses)? This isn't my field, but the way you are describing what IT does makes me think of information theoretic criteria, such as MDL, AIC, PSE, BIC, etc. that provide tradeoffs between fitting accuracy and complexity in models. One can then optimize models by selecting only those variables whose presence are justified by reducing fitting error sufficiently enough.

mfbkb said...

Well, if you create a information map using IT to analyse data, one of the consequences is that the variables are prepared automatically in the process, so what I wanted to say was that the analyst could skip the DP stage.

Once the information map is created, it is possible to select variables one by one. The first variable to choose is the one carrying the most information. The next variable to select is the one carrying the most additional information. The process continues until there are no more variables available or including any additional variable would cause more loss of representativeness than information gain.

These selected variables carries the maximum information for the least loss of representativeness and can be used to create an optimal model.