Occasionally, I come across descriptions of clustering or modeling techniques which include mention of "assumptions" being made by the algorithm. The "assumption" of normal errors from the linear model in least-squares regression is a good example. The "assumption" of Gaussian-distributed classes in discriminant analysis is another. I imagine that such assertions must leave novices with some questions and hesitation. What happens if these assumptions are not met? Can techniques ever be used if their assumptions are not tested and met? How badly can the assumption be broken before things go horribly wrong? It is important to understand the implications of these assumptions, and how they affect analysis.
In fact, the assumptions being made are made by the theorist who designed the algorithm, not the algorithm itself. Most often, such assumptions are necessary for some proof of optimality to hold. Considering myself the practical sort, I do not worry too much about these assumptions. What matters to me and my clients is how well the model works in practice (which can be assessed via test data), not how well its assumptions are met. Generally, such assumptions are rarely, if ever, strictly met in practice, and most of these algorithms do reasonably well even under such circumstances. A particular modeling algorithm may well be the best one available, despite not having its assumptions met.
My advice is to be aware of these assumptions to better understand the behavior of the algorithms one is using. Evaluate the performance of a specific modeling technique, not by looking back to its assumptions, but by looking forward to expected behavior, as indicated by rigorous out-of-sample and out-of-time testing.
I think you have a point there. A 'pure' statistician would definitely disagree but the algorithm might work regardless if the assumptions are met. The guilt stays with the analyst all the way to interpretation. If we could just get rid of a priori assumptions...
ReplyDeleteI fully agree, Will. Knowing the assumptions gives you clues to why algorithms may or may not work well. For example, outliers can severely impact a linear regressions model because of the squared-error criterion. If one knows this, and identifies the outliers (and removes them or mitigates their influence), the overall model can be improved.
ReplyDeleteAlso, knowing how forgiving an algorithm is to violations in assumptions is extremely important, but unfortunately I don't know how to quantify this well prior to modeling. For me, it is more of a "feel" from seeing lots of models on lots of data sets. How big of an outlier can a regression model include? Is it balancing vis-a-vis another outlier? You can measure these effects after the model is built, but I don't know of a way to do it before.
Nice post, which reminded me a problem i once had : I was unable to normalize a positively skewed distribution whether doing sqr, log,1/x, you name it! What is your advice on a situation like this?
ReplyDeleteThanks!