Friday, April 18, 2008

When Distributions Go Bad

Recently I was working with an organization, building estimation models (rather than classification). They were interested in using linear regression, so I dutifully looked at the distribution,
as shown to the left (all pictures were generated by Clementine, and I also scaled the distribution to protect the data even more, but didn't change the shape of the data).
There were approximately 120,000 examples. If this were a typical skewed transformation, I would log transform it and be done with it. However, in this distribution there are three interesting problems:

1) skew is 57--heavy positive skew
2) kurtosis is 6180--heavily peaked
3) about 15K of these had value 0, contributing to the kurtosis value

So what to do? One answer is to create the log transform, but maintain sign, using sgn(x)*log10( 1 + abs(x) ). This picture looks like this:

This takes care of the summary statistics problems, as skew became 0.6 and kurtosis -0.14. But it doesn't look right--the spike at 0 looks problematic (and turned out that it was). Also, the distribution actually ends up with two ~normal distributions of different variance, one to the left and one to the right of 0.

Another approach to this is to use the logistic transform 1 / ( 1 + exp(-x/A) ) where A is a scaling factor. Here are the distributions for the original distribution (baddist), the log-transformed version (baddist_nlog10), and the logistic transformed with 3 values of A: 5, 10, and 20, with the corresponding pictures for the three logistic transformed versions.

Of course, going solely on the basis of the summary statistics, I might have a mild preference for the nlog10 version. As it turned out, the logistic transform produced "better" scores (we measure model accuracy by how well the model rank-ordered the predicted amounts, and I'll leave it at that). That was interesting in of itself since none of the distributions really looked very good. However, another interesting question was which value of "A" to use: 5, 10, 20 (or some other value I don't show here). We found the value that worked best for us, but because of the severity of the logistic transform in how it scales the tails of the distribution, the selection of "A" depended on which range of the target values we were most interested in rank-ordering well. The smaller values of A produced bigger spikes at the extremes, and therefore the model did not rank-order these values well (these models did better on the lower end of distribution magnitudes). If we wanted to identify the tails better, we should increase the scaling factor "A" and it did in fact improve the rank-ordering at the extremes.

So, in the end, the scaling of the target value depends on the business question being answered (no surprises here). So now I open it up to all of you--what would you do? And, if you are interested in this data, I have it on my web site that you can access here.

Thursday, April 17, 2008

Data Mining survey

Karl Rexer of Rexer Analytics conducted an extensive survey of data miners in 2007, and reported on those results here at (a site I had never heard of before--unfortunately, you have to register to see it).

This is not to be confused with their 2008 survey, results due out soon I would expect.

A few interesting items in the survey results:

• Correspondingly, the most commonly used algorithms are regression (79 percent), decision trees (77 percent) and cluster analysis (72 percent). Again, this reflects what we have seen in our own work. Regression certainly remains the algorithm of choice for large sections of the academic community and within the financial services sector. More and more data miners, however, are using decision trees, and cluster analysis has long been the bedrock of the marketing community.

I find it interesting in of itself that academics are participating in a data mining survey, and I don't mean that in a negative way. I have viewed data mining more as a business-centric way of thinking, and to have regression advocates participate in a survey of this type is a good sign. Of course it could also mean that business folks don't have the time to fill out surveys :)

• SPSS, SPSS Clementine, and SAS are the three most frequently utilized analytic tools and were each used in 2006 by more than 40 percent of data miners. Forty-five percent of data miners also employed their own code in 2006. Respondents were asked about 26 different software packages from the powerhouses above to less-visible and -utilized packages such as Chordiant, Fair Isaac and KXEN.

Clementine usually shows up at the top of the KDNuggets survey, and I've never been sure if it was because of the kdnuggets typical user, or if it reflected true general use in the data mining community. This gives further evidence that its use is more widespread. The fact that SPSS and SAS are the others show the dominance in the survey of statisticians or acamedicians. I rarely find heavy SPSS or SAS users among technical business analysts.

• Comparisons of reported 2006 use and planned 2007 use show that there is increasing interest in the Oracle Data Mining tool, and decreasing interest in C4.5/C5.0/See5. It will be interesting to see how these trends develop over time and if other tools find greater prominence in the future.

I concur from my experience. I would put SQL Server in that category as well. I think the C4.5 popularity was largely due to licensing.

• The primary factors data miners consider when selecting an analytic tool are: 1) the dependability and stability of software, 2) the ability to handle large data sets, and 3) data manipulation capabilities. Data miners were least interested in the reputation of the software and the software’s compatibility either with other programs or with software used by colleagues.

THis looks like the responses of technical people--very much common sense. I wonder what decision makers would say? Reputation I would think ranks much higher among these people.

• The top challenges facing data miners are dirty data, data access and explaining data mining to others. Over three-quarters of data miners listed dirty data as one of the major challenges that they face. This is again consistent with our own experience and the conventional wisdom discussed at data mining conferences: a significant proportion of most projects consist of data understanding, data cleaning and data preparation.

No surprises here! However, once one goes through this process, its importance is reduced (because it is solved).

Thanks to Rexer Analytics for putting this and the 2008 survey together. I'm looking forward to those results.

Wednesday, April 16, 2008

Data Mining Data Sets

Every once in a while I receive a request or see one posted on some bulletin board about data mining data sets. I have to say, I have little patience for many of these requests because a simple google (or Clusty) search will solve the problem. Nevertheless, here are four sites I've used in the past to grab data for some testing of algorithms of software packages:

There are several sites for data, including:

UC Irvine Machine Learning Repository:

Carnegie Mellon Statlib Archive:

DELVE Datasets:

MIT Broad Institute Cancer Datasets:

Tuesday, April 15, 2008

DM Radio and Text Mining

I'll be interviewed on the topic of text mining this coming Thursday, April 17th at 3pm EDT on DM Radio along with Barry DeVille of SAS and Jeff Catlin Lexalytics. The title of this entry links to the DM Review site.

I think you have to register to listen.

The schedule will go something like this:

3:00 PM
Hosts Eric Kavanagh and Jim Ericson frame the argument: What is text analytics, and how can it be used to find those golden needles in the haystack?

3:12 PM
Hosts interview Barry DeVille of SAS Institute: What are some good examples of customer success? What are some common mistakes?

3:24 PM
Hosts interview Jeff Catlin, CEO of Lexalytics: How does his application work? What are some examples of text mining at work?

3:36 PM
Hosts interview Dean Abbott of The Modeling Agency: We heard what the vendors said, but what does that all really mean?

3:48 PM
Roundtable discussion: All bets are off! Guests are encouraged to engage in open dialogue, and listeners can email their questions to

Thursday, April 10, 2008

Data Mining: Widespread Acceptance When?

Data mining is widely accepted today among industries which have a history of "management by numbers", such as banking, pure science and market research. Data mining is easily viewed by management in such industries as a logical extension of less sophisticated quantitative analysis which already enjoys currency there. Further, information infrastructure necessary to feed the data mining process is typically already present.

It seems likely that at least some (if not many) other industries could realize a significant benefit from data mining, yet this has emerged in practice only sporadically. The question is: Why?

Under what organizational conditions will data mining spread to a broader audience?

Friday, April 04, 2008

Data modeling infrastructure in data mining

I've had two inquiries in the last day relating to the building of data infrastructure between the database and predictive modeling tool, which I find to be an interesting coincidence. I hadn't even thought about a need here before (perhaps because I wasn't aware of the vendors that address this issue), but am curious if others have thought through this issue/problem.

I have seen situations where the analyst and DBA need to coordinate, but due to the politics or personalities in an organization, do not. In these cases, a data miner may need tables that actually exist, but the miner doesn't have permission to access the tables, or perhaps doesn't have the expertise to know how to join all the requisite tables. In these cases, I can imagine this middleware if you will could be quite useful if it were more user-friendly. However, I'm not yet convinced this a real issue for most organizations.

Any thoughts?

Wednesday, April 02, 2008

Another Moneyball quote

Gotta get back in the habit of posting...

A quick way is to post another quote from Moneyball that I really liked

Intelligence about baseball statistics had become equated in the public mind with the ability to recite arcane baseball stats. What James's wider audience had failed to understand was that the statistics were beside the point. The point was understanding; the point was to make life on earth just a bit more intelligible; and that point, somehow, had been lost. "I wonder," James wrote, "if we haven't become so numbed by all these numbers that we are no longer capable of truly assimilating any knowledge which might result from them."

What I like about this quote is that it is something may of us in the analytics world have experienced: losing the point of the modeling or summary statistics by forgetting why we are doing the analysis in the first place. Or, as my good friend John Elder used to describe it, "rapture of the depths"