Thursday, May 05, 2005

Gain Insights by Building Models from Several Algorithms

A frequent question in data mining is "which algorithm performs best"? While I would argue that getting the data right is far more important than which algorithm is used, there are differences in algorithms that the analyst can use to his or her advantage.

Decision trees build rules to identify homogeneous groups of data, (essentially, little rectangular chunks), so it is building a whole set of local descriptions of the data. In contrast to this, regression builds a single global model of all the data. Neural networks are somewhere in between: a single global model is built, but it can be highly nonlinear, including finding local pockets of homogeneous data. It can be useful, therefore, to look at these different styles of models to understand general trends (regression), and local behaviour (trees). It is often the case that different algorithms arrive at the same level of performance in different ways. Seeing these differences can provide additional insight into the business processes underlying the models.