Recently I was working with an organization, building estimation models (rather than classification). They were interested in using linear regression, so I dutifully looked at the distribution,
as shown to the left (all pictures were generated by Clementine, and I also scaled the distribution to protect the data even more, but didn't change the shape of the data).
There were approximately 120,000 examples. If this were a typical skewed transformation, I would log transform it and be done with it. However, in this distribution there are three interesting problems:
1) skew is 57--heavy positive skew
2) kurtosis is 6180--heavily peaked
3) about 15K of these had value 0, contributing to the kurtosis value
So what to do? One answer is to create the log transform, but maintain sign, using sgn(x)*log10( 1 + abs(x) ). This picture looks like this:
This takes care of the summary statistics problems, as skew became 0.6 and kurtosis -0.14. But it doesn't look right--the spike at 0 looks problematic (and turned out that it was). Also, the distribution actually ends up with two ~normal distributions of different variance, one to the left and one to the right of 0.
Another approach to this is to use the logistic transform 1 / ( 1 + exp(-x/A) ) where A is a scaling factor. Here are the distributions for the original distribution (baddist), the log-transformed version (baddist_nlog10), and the logistic transformed with 3 values of A: 5, 10, and 20, with the corresponding pictures for the three logistic transformed versions.
Of course, going solely on the basis of the summary statistics, I might have a mild preference for the nlog10 version. As it turned out, the logistic transform produced "better" scores (we measure model accuracy by how well the model rank-ordered the predicted amounts, and I'll leave it at that). That was interesting in of itself since none of the distributions really looked very good. However, another interesting question was which value of "A" to use: 5, 10, 20 (or some other value I don't show here). We found the value that worked best for us, but because of the severity of the logistic transform in how it scales the tails of the distribution, the selection of "A" depended on which range of the target values we were most interested in rank-ordering well. The smaller values of A produced bigger spikes at the extremes, and therefore the model did not rank-order these values well (these models did better on the lower end of distribution magnitudes). If we wanted to identify the tails better, we should increase the scaling factor "A" and it did in fact improve the rank-ordering at the extremes.
So, in the end, the scaling of the target value depends on the business question being answered (no surprises here). So now I open it up to all of you--what would you do? And, if you are interested in this data, I have it on my web site that you can access here.