As their explorations of South America in the 1500s progressed, Spanish explorers encountered stories of El Dorado, originally "a man covered in gold", but eventually coming to mean a lost city of gold. These stories varied in the details, but at the center of all of them was the gold. Such legends circulated widely and provoked considerable interest. Significant energy (and time and money and lives) were spent in an effort to discover the location of El Dorado, without success.
There apparently was a real ritual in part of South America involving a king covered in gold dust, which perhaps gave rise to the legend. It has also been theorized that the people inhabiting lands being consumed by the Spanish empire essentially told the Spanish whatever they wanted to hear- whatever it took to get them to go away. Either way, the Spanish searched in vain for something that wasn't there.
Today, organizations search for models which approximate reality. A frequent feature of this search, particularly prevalent among novices is the search for the "best" model. Models vary in their performance, with some clearly being "better" than others. It is natural enough, then, to think that data mining is an undertaking whose ideal outcome is the "best" model.
Some time ago, I was being interviewed by auditors regarding methods I used in building models of customer behavior. They asked questions about various aspects data gathering, model building and testing methods. Their questions eventually came around to the issue of how I knew that my models were the best possible. I told them that I did not know that my models were the best possible. I told them that, in fact, I knew that my models were not the best possible. I said that I knew that my models were technically sound, had historically been reliable over long periods of time and that they answered the needs of the business. Searching for a "best possible" model, I asserted, was an academic exercise.
How much time a given modeling effort deserves is an open question, and I am not advocating doing a bad job. Remember, though, that with enough data, a search for better models could continue indefinitely, but this would be a fool's errand. There will always be modeling techniques and derived variables not yet explored. That a deployed model fails to meet some unknown and theoretical ideal is only important as a missed opportunity (but a potentially very expensive one to explore). Much more important is whether a model meets a business need and whether it is likely to continue to do so. This implies that time in data mining is better spent validating models, rather than exhaustively chasing trivial improvements in apparent performance.
Small News Item: I have started another, tool-specific Web log, Data Mining in MATLAB.