Saturday, July 28, 2007

NY Times Defines Data Mining

In their article here, the NY Times defines data mining in this way:

It is not known precisely why searching the databases, or data mining, raised such a furious legal debate. But such databases contain records of the phone calls and e-mail messages of millions of Americans, and their examination by the government would raise privacy issues.

While I recognize that the NYT is not a technical body, and reporters often get the gist of technology wrong, this particular kind of definition has swept the media to such a degree that the term "data mining" may never recover.

The definition itself has problems, such as
1) searching databases per se I'm sure is not what they mean by data mining; almost certainly they mean programs that automatically searching the databases to find interesting patterns (and presumably horribly overfitting int he process, registering many false positives) as the problem. After all, a Nexus search searches a database and no one raises an eyebrow at that.

2) the problem with the searching is not the searching (or the data mining in their terminology), but the data that is being searched. Therefore the headline of the story, "Mining of Data Prompted Fight Over Spying" should probably more accurately read something like "Data allowed to be Mined Prompted Fight Over Spying"

It is this second point that I have argued over with others who are concerned about privacy, and therefore have become anti-data-mining. It is the data that is the problem, not the mining (regardless of the definition of mining). But I think the term "data mining" resonates well and generates a clear mental image of what is going on, which is why it gained popularity in the first place.

So I predict that within 5 years, few data miners (and I consider myself one of them) will refer to him/herself as a data miner, nor will we describe what we do as data mining. Predictive Analytics anyone?

Saturday, July 21, 2007

Idempotent Capable Modeling Algorithms

In Idempotent-capable Predictors, the Jul-06-2007 posting to Machine Learning (Theory) Web log, the author suggests the importance of empirical models being idempotent (in this case, meaning that they can use one of the input variables as the model output).

This is of interest since: 1. One would like to believe that the modeling process could generate the right answer, once it had actually been given the right answer, and 2. It is not uncommon for analysts to design inputs to models which give "hints" (which are partial solutions of the problem). In the article mentioned above, it is noted that some typical modeling algorithms, such as logistic regression, are not idempotent capable. The author wonders how important this property is, and I do, too. Thoughts?

Tuesday, July 17, 2007

More Statistics Humor

In February of this year, Dean posted a witty comment regarding statistics which ignited an amusing exchange of comments (Quote of the day). Readers who found that item entertaining may also appreciate the quotes listed at the bottom of The Jesus Tomb Math.

Wednesday, July 04, 2007

When Data and Decisions Don't Match--Little League Baseball

Maybe it's because I used to pitch in Little League when I was a kid, but this article in the July 1 Union Tribune really struck me. It describes how injuries to Little League pitchers has increased significantly over the past 10 years from one a week to 3-4 a day with elbow and/or shoulder injuries from baseball. What's the cause? Apparently, as the article indicates, it is from "overuse" (i.e., pitchers pitching too much). And here is the key statistic:
young pitchers who pitch more than 8 months a year are 5 times as likely to need surgery as those who pitch 5 1/2 months a year.

In San Diego, where I'm located, this can be a big problem because there is baseball going on all year round (even in Little League, where there are summer and fall leagues, plus the ever-present year-round traveling teams).

So what's the solution? A year ago or so they instituted an 85 pitch limit per game. Now, this may a good thing to do, but I have great difficulty seeing a direct connection. Here's why.

With any decision inferences (classification), there are two questions to be asked:
1) what patterns are related to the outcome of interest
2) are there differences between patterns related to the outcome of interest and those related to another outcome?

Here's my problem: I have seen no data (in the article) to indicate that pitchers today throw more pitches than boys did 10 years ago. And I see no evidence in particular that boys today throw more than 85 pitches more frequently that boys did 10 years ago. If this isn't the case, then why would the new limit have any effect at all? It can only be due to a cause that is not directly addressed here. If by limited pitches in a game (and therefore in any given week), the boys throw fewer pitches in a year, there might be an effect.

But based on the evidence that is known and not speculation, wouldn't it make more sense to limit pitchers to five months of pitching per calendar year? That after all has direct empirical evidence of tangible results.

I see this happen in the business world as well, where despite empirical evidence that indicate "Procedure A", the decision makers go with "Procedure B" for a variety of reasons unrelated to the data. And sometimes there is good reason to do so despite the data, but at least we should know that in these cases we are ignoring the data.

I suspect one reason this strikes me is that I used to pitch on traveling teams in my Little League years, back before one cared about pitch counts (30+ years ago). I'm sure I pitched games well over 85, and probably 100+ pitches on a regular basis. One difference was that I lived in New England where you were fortunate to play March through August, and so we all had a good period of time to recover.