Dean has written recently about confusion surrounding the term "data mining" (see his Jan-11-2007 posting, Will the term "Data Mining" survive?). Clearly, this has muddled much of the debate surrounding things like the privacy and security implications of data mining by government.
Setting other definitions aside, though, there remain issues of data mining acceptance in the world of business. A short, interesting item on this subject is Sam Batterman's's Jan-19-2007 posting, Interesting Thread on Why Data Mining (DM) is not used more in business. My response, which is in the Comments section there, is:
"While it is frequently lamented that technology advances much more quickly than government, especially law enforcement and the judiciary, it is clearly the case that businesses are only better by comparison. Even in industries with a long-established and accepted need for sophisticated statistical anlysis, managers display a shocking lack of understanding of what is possible with newer tools from fields like data mining. Further, this ignorance is not the exclusive domain of executive or senior management, who are somewhat removed from the people and systems which perform data mining. Managers whose immediate subordinates do the actual data mining frequently require education, as any statistical knowledge they possess seems typically stuck in the late 1970s. In my experience, upward lobbying efforts on the part of the data miner are only sometimes effective. The argument to recalcitrant management which I have found most effective is "If we only do what you did at your last company, and what everyone else in the industry is doing, where will our competitive advantage come from?" Sadly, it is my expectation that data mining will only catch on in individual industries after some intrepid manager demonstrates conclusively the money that data mining can return, and the others follow like sheep."
I'd be curious to learn what readers' thoughts on this are.
Tips, tricks, and comments related to topics in data science and machine learning. Used to be called "data mining and predictive analytics" but updated the title to reflect the language of the day!
Hosted by Dean Abbott, Abbott Analytics
Sunday, January 28, 2007
Saturday, January 13, 2007
Do and Do Not
There's too many men, too many people making too many problems, and not much love to go 'round. Can't you see? This is the land of confusion.
-Genesis, Land of Confusion
In my travels, I have encountered a wide variety of people who use mathematics to analyze data and make predictions. They go by a variety of titles and work in many different fields. My first job out of college was working in an econometrics group for the Port Authority of New York and New Jersey, in the Twin Towers. The emphasis there was on traditional econometric techniques. Later in my career, I worked as a consultant for SKF, a large manufacturing firm, with engineers who emphasized quality control techniques. Most recently, I have been working with bankers doing credit scoring and the like. Surprise, surprise: the bankers have their own way of doing things, too. I won't bore the reader with the myriad other diverse quantitative analysts I've met in between, because you probably already get the idea.
These industry-specific sub-disciplines of analysis developed largely in isolation and, unfortunately, most are quite parochial. For the most part, technique has become stagnant, reflecting old rules of thumb which are outdated, if they weren't invalid in the first place.
Many people say that data mining (modeling, forecasting, etc.) are "part art, part science". I agree, but the science should give parameters to the art. Creativity in the combined discipline of quantitative model-building does not give license to venture beyond the absolutes that statistical science has provided. From this perspective, there are some things which should always be practiced, and some which should never be practiced: Do and Do Not: Everything in between is up to the taste of the analyst.
Sadly, many practitioners and even entire industries have become arthritic by establishing new, would-be "absolutes" beyond the dictates of probability theory. Some of these rules attempt to expand the Do by setting capricious limits on modeling which are not theoretically justified. The Director of risk management at one credit card company once told me that a "good" model had about 8 or 10 inputs. Naturally, that is nonsense. The number of input variables should be determined by the data via appropriate testing, not some rule-of-thumb. Others of these rules try to expand the Do Not by prohibiting practices which are well established by both theory and experiment.
As a data miner ("statistician", "econometrician", "forecaster", "meteorologist", "quality technician", "direct marketer", etc.), it is one's responsibility to continue to study the latest literature to understand how the collective knowledge of Do and Do Not have progressed. This is the only way to avoid arbitrary processes which both hold back empirical modeling and push it to make serious mistakes.
-Genesis, Land of Confusion
In my travels, I have encountered a wide variety of people who use mathematics to analyze data and make predictions. They go by a variety of titles and work in many different fields. My first job out of college was working in an econometrics group for the Port Authority of New York and New Jersey, in the Twin Towers. The emphasis there was on traditional econometric techniques. Later in my career, I worked as a consultant for SKF, a large manufacturing firm, with engineers who emphasized quality control techniques. Most recently, I have been working with bankers doing credit scoring and the like. Surprise, surprise: the bankers have their own way of doing things, too. I won't bore the reader with the myriad other diverse quantitative analysts I've met in between, because you probably already get the idea.
These industry-specific sub-disciplines of analysis developed largely in isolation and, unfortunately, most are quite parochial. For the most part, technique has become stagnant, reflecting old rules of thumb which are outdated, if they weren't invalid in the first place.
Many people say that data mining (modeling, forecasting, etc.) are "part art, part science". I agree, but the science should give parameters to the art. Creativity in the combined discipline of quantitative model-building does not give license to venture beyond the absolutes that statistical science has provided. From this perspective, there are some things which should always be practiced, and some which should never be practiced: Do and Do Not: Everything in between is up to the taste of the analyst.
Sadly, many practitioners and even entire industries have become arthritic by establishing new, would-be "absolutes" beyond the dictates of probability theory. Some of these rules attempt to expand the Do by setting capricious limits on modeling which are not theoretically justified. The Director of risk management at one credit card company once told me that a "good" model had about 8 or 10 inputs. Naturally, that is nonsense. The number of input variables should be determined by the data via appropriate testing, not some rule-of-thumb. Others of these rules try to expand the Do Not by prohibiting practices which are well established by both theory and experiment.
As a data miner ("statistician", "econometrician", "forecaster", "meteorologist", "quality technician", "direct marketer", etc.), it is one's responsibility to continue to study the latest literature to understand how the collective knowledge of Do and Do Not have progressed. This is the only way to avoid arbitrary processes which both hold back empirical modeling and push it to make serious mistakes.
Thursday, January 11, 2007
Will the term "Data Mining" survive?
I used to argue that data mining as a field will survive because it was tied so much to the bottom line--CFOs and stakeholders were involved with data mining applications and therefore the field would avoid the hype that crippled neural networks, AI and prior pattern recognition-like technologies. These achieved buzzword status that unfortunately surpassed successful practical applications.
However, it appears that the term data mining is being tied more and more to the process of data collection from multiple sources (and the subsequent analysis of that data), such as here and here and here. I try to argue with critics that the real problem is not with the algorithms, but with the combining of the data sets to begin with. Once the data is joined, whether you use data mining, OLAP, or just simple Excel reports, there is a possible privacy concern. Data mining per se has little to do with this; it only can be used to describe what data is there.
However, the balance may be tipping. Data mining (whether related to government programs or internet cookies) has become the term associated with all that is bad about combining personal information sources so that its days I think are numbered. Maybe it's time to move on to the next term or phrase, and then the next phrase, and so on, and so on, and so on...
However, it appears that the term data mining is being tied more and more to the process of data collection from multiple sources (and the subsequent analysis of that data), such as here and here and here. I try to argue with critics that the real problem is not with the algorithms, but with the combining of the data sets to begin with. Once the data is joined, whether you use data mining, OLAP, or just simple Excel reports, there is a possible privacy concern. Data mining per se has little to do with this; it only can be used to describe what data is there.
However, the balance may be tipping. Data mining (whether related to government programs or internet cookies) has become the term associated with all that is bad about combining personal information sources so that its days I think are numbered. Maybe it's time to move on to the next term or phrase, and then the next phrase, and so on, and so on, and so on...
Special Issue on Data Mining
The International Journal of Computer Applications has a new issue out on data mining applications. I didn't recognize anyone on the list of authors, but there was an interesting looking paper on a new boosting algorithm applied to intrusion detection (and using the KDDCup 99 intrusion detection data set, they claim it was better than the winning algorithm).
(HT Inderscience News)
(HT Inderscience News)
Viewing PPT created on Mac on a PC
I know this isn't about data mining, but I had to vent on this one...
So my daughter created a PPT presentation on a mac, and I tried to print it to a printer from my laptop. We copied the file over to my PC, and I got the dreaded "QuickTime and a TIFF (LZW) decompressor are needed to see this picture" error for all the graphics. I do a google search, and most of the solutions are "you messed up doing drag&drop on your mac--you MUST save the images to a file and then do a Picture->From File import of the images into the presentation". Now I've messed with computers for a lot of years, and this just isn't the way things should be done. The other solutions were things like "create a web page, uncompress the compressed images, and then reimport the images into PPT". Well, it's already 1am and I'm not in much of a mood to redo my daughter's presentation (while she blissfully sleeps).
So, there's another solution (There had to be an easier way). I just exported the file on the mac as a TIFF file (multipage). Voila-it saves all the images as ...well...TIFF (probably not some funky image format within the TIFF wrapper) rather than compressed PICT and it worked like a charm. (I suspect that there are other exports that would work as well). Now why wasn't that on the web as a solution....?
So my daughter created a PPT presentation on a mac, and I tried to print it to a printer from my laptop. We copied the file over to my PC, and I got the dreaded "QuickTime and a TIFF (LZW) decompressor are needed to see this picture" error for all the graphics. I do a google search, and most of the solutions are "you messed up doing drag&drop on your mac--you MUST save the images to a file and then do a Picture->From File import of the images into the presentation". Now I've messed with computers for a lot of years, and this just isn't the way things should be done. The other solutions were things like "create a web page, uncompress the compressed images, and then reimport the images into PPT". Well, it's already 1am and I'm not in much of a mood to redo my daughter's presentation (while she blissfully sleeps).
So, there's another solution (There had to be an easier way). I just exported the file on the mac as a TIFF file (multipage). Voila-it saves all the images as ...well...TIFF (probably not some funky image format within the TIFF wrapper) rather than compressed PICT and it worked like a charm. (I suspect that there are other exports that would work as well). Now why wasn't that on the web as a solution....?
Wednesday, January 10, 2007
Data Visualization: the good, the bad, and the complex
I have found that data visualization for the purposes of explaining results is often done poorly. I am not a fan of the pie chart, for example, and am nearly always against the use of 3-D charts when shown on paper or a computer screen (where it appears as a 2-D entity anyway). With that said, that doesn't mean that charts and graphs need to be boring. If you would like to see some interesting examples of obtuse charts and figures, go Stephen Few's web site to look at the examples--they are very interesting.
I like in particular this one, which also contains a good example of humility on the part of the chart designer, along with their improvement on the original.
However, even well-designed charts are not always winners if they don't communicate the ideas effectively to the intended audience. One of my favorite charts in my work was for a health club is on my web site, and is reproduced here:
The question here was this: based on survey given to members of the clubs, which characteristics expressed in the survey were most related to the members with the highest value? I have always liked it because it has a combination of simplicity (it is easy to see the balls and understand that higher is better for each of them, showing which characteristics for the club are better than the peer average), yet it is rich with information. There are at least four dimensions of information (arguably six). The figure of merit for judging 'good' is a combination of questions on the club survey related to overall satisfaction, likelihood to recommend the club to a friend, and the individual's interest in renewing members--this was called the 'Index of Excellence'
Each bullet was a dimension represented in the plot, but note that bullets 2 and 3 were relative values and really represent two dimensions. Regardless of how many dimensions you would count, the chart I think is visually appealing and information rich. One could simplify it by removing the small dots, but that's about all I would do to it. My web site also has this picture there, but it was recolored to fit the color scheme of the web site, and I think it loses some of its visual intuitive feel as a result.
However, much to my dismay, the end customer found it too complex, and we (Seer Analytics, LLC and I) created another rule-based solution that turned out to be more appealing.
Opinions on the graphic are appeciated as well--maybe Seer and I just missed something here :) But at this point it is all academic anyway since the time for modifying this solution has long passed.
I like in particular this one, which also contains a good example of humility on the part of the chart designer, along with their improvement on the original.
However, even well-designed charts are not always winners if they don't communicate the ideas effectively to the intended audience. One of my favorite charts in my work was for a health club is on my web site, and is reproduced here:
The question here was this: based on survey given to members of the clubs, which characteristics expressed in the survey were most related to the members with the highest value? I have always liked it because it has a combination of simplicity (it is easy to see the balls and understand that higher is better for each of them, showing which characteristics for the club are better than the peer average), yet it is rich with information. There are at least four dimensions of information (arguably six). The figure of merit for judging 'good' is a combination of questions on the club survey related to overall satisfaction, likelihood to recommend the club to a friend, and the individual's interest in renewing members--this was called the 'Index of Excellence'
- seven most significant survey questions are plotted in order right to left (rightmost is the most important). Signficance was determine by a combination of factor analysis and linear regression models
- the relative performance of each club compared to the others in its peer group is shown by the y-axis, with the average of clubs.
- the relative difference between results from the year 2003 and 2002 are shown in two ways: first with the color of the ball (green for better, yellow for about the same, and red for worse), and also by comparing the big ball to the dot in the same relative position (up and down) in the importance axis.
- finally, the size of the ball indicated the relative importance of the survey question for that club--bigger meant more important.
Each bullet was a dimension represented in the plot, but note that bullets 2 and 3 were relative values and really represent two dimensions. Regardless of how many dimensions you would count, the chart I think is visually appealing and information rich. One could simplify it by removing the small dots, but that's about all I would do to it. My web site also has this picture there, but it was recolored to fit the color scheme of the web site, and I think it loses some of its visual intuitive feel as a result.
However, much to my dismay, the end customer found it too complex, and we (Seer Analytics, LLC and I) created another rule-based solution that turned out to be more appealing.
Opinions on the graphic are appeciated as well--maybe Seer and I just missed something here :) But at this point it is all academic anyway since the time for modifying this solution has long passed.
Tuesday, January 09, 2007
Free Data Mining Software Poll Results, and notes on Sample Size
I inadvertantly closed the poll, couldn't figure out how to reopen it, and since it was already up a week, I decided that I will leave it closed.
The results are:
WEKA: 11 (55%)
YALE: 4 (20%)
R: 3 (15%)
Custom: 1 (5%)
Other: 1 (5%)
Total Votes: 20
But is there anything signficant? Is WEKA signficantly more popular than YALE or R? Well, this is outside of my expertise--after all, the word "signficant" is rarely used in data mining circles :)--but it seems to me that the answer is "yes". Why?
By starting with the standard sample size formula, and using the WEKA percentage as the hypothesis (55%, or 0.55), we are only 68% confident that this 55% can be achieved with a sample size of 25 (larger than I used). So it is therefore not a particularly significant finding that WEKA is not more popular than the other tools.
Plugging in the numbers for just WEKA and YALE (if that were the extent of the survey, forcing everyone to vote between just those two, which of course did not happen, but play along for a bit...), where the difference was 55% to 20%, we find that for a sample sizes of 15 (11 votes + 4 vote), we would have been more than 99% confident that the 55% +/- 35% can be achieved.
I'll try another poll once the numbers coming to this blog go up a bit. Thanks for participating!
The results are:
WEKA: 11 (55%)
YALE: 4 (20%)
R: 3 (15%)
Custom: 1 (5%)
Other: 1 (5%)
Total Votes: 20
But is there anything signficant? Is WEKA signficantly more popular than YALE or R? Well, this is outside of my expertise--after all, the word "signficant" is rarely used in data mining circles :)--but it seems to me that the answer is "yes". Why?
By starting with the standard sample size formula, and using the WEKA percentage as the hypothesis (55%, or 0.55), we are only 68% confident that this 55% can be achieved with a sample size of 25 (larger than I used). So it is therefore not a particularly significant finding that WEKA is not more popular than the other tools.
Plugging in the numbers for just WEKA and YALE (if that were the extent of the survey, forcing everyone to vote between just those two, which of course did not happen, but play along for a bit...), where the difference was 55% to 20%, we find that for a sample sizes of 15 (11 votes + 4 vote), we would have been more than 99% confident that the 55% +/- 35% can be achieved.
I'll try another poll once the numbers coming to this blog go up a bit. Thanks for participating!
Tuesday, January 02, 2007
First Poll--Free data mining software
Monday, January 01, 2007
For the Best Answer, Ask the Best Question
A subject of great interest to data mining novices is the selection of data mining software. Frequently these interests are expressed in terms of what is "the best" software to buy. On-line, such queries are often met with quick and eager responses (and not just from vendors). In a way, this mimics the much more common (and much more incendiary) question about which programming language is "the best".
Not withstanding myriad fast answers, the answer to such questions is, of course, "It depends". What is the problem you are trying to solve? What is your familiarity with any of the available alternatives? How large is your budget? How large is your budget for ongoing subscription costs? How do you intend to deploy the result of your data mining effort?
Vendors, naturally, have an incentive to emphasize any feature which they believe will move product. Some vendors are worse about this than others. Years ago, one neural network shell vendor touted the fact that their software used "32-bit math", without ever demonstrating the benefit of this feature. In truth, competing software, which ran 16-bit fixed-point arithmetic was much faster, gave accurate results, and did not require 32-bit hardware.
The problem of irrelevant features is exacerbated by the presence of individuals in the customer organization who buy into this stuff. Some use this as political leverage on their unaware peers. I attended in a vendor presentation once with a banking client in which one would-be expert asked whether the vendor's computers were SIMD or MIMD. This was like asking whether the vendor's cafeteria served this or that brand of coffee and could not have been less relevant to the conversation. The asking of such a question was clearly a power play and served only as a distraction.
When confronted with unfamiliar features, my recommendation is to ask as many questions as it takes to understand why said features are of benefit. Don't stop with the vendor. Ask associates at other firms what they know about the subject. Try on-line discussion groups. Keep asking "Why?" until you are satisfied. Joe Pesci's character in "My Cousin Vinny" is a good model: "Why does SIMD vs. MIMD matter?" "Is one better than the other?" "Exactly how is it better?" "Is it faster? How much faster?" "Does it cost more?" Remember that diligence is the responsibility of the customer.
Some things to consider when framing the question "What is the best data mining software for my purposes?":
-Up front software cost
-Up front hardware cost, if any
-Continuing software costs (subscription prices)
-Training time for users
-Algorithms which match your needs
-Effective data capacity in variables
-Effective data capacity in examples
-Testing capabilities
-Model deployment options (source code, libraries, etc.)
-Model deployment costs (licensing costs, if any)
-Ease of interface with your data sources
-Ability to deal with missing values, special values, outliers, etc.
-Data preparation capabilities (generation of derived or transformed variables)
-Automatic attribute selection / Data reduction
Not withstanding myriad fast answers, the answer to such questions is, of course, "It depends". What is the problem you are trying to solve? What is your familiarity with any of the available alternatives? How large is your budget? How large is your budget for ongoing subscription costs? How do you intend to deploy the result of your data mining effort?
Vendors, naturally, have an incentive to emphasize any feature which they believe will move product. Some vendors are worse about this than others. Years ago, one neural network shell vendor touted the fact that their software used "32-bit math", without ever demonstrating the benefit of this feature. In truth, competing software, which ran 16-bit fixed-point arithmetic was much faster, gave accurate results, and did not require 32-bit hardware.
The problem of irrelevant features is exacerbated by the presence of individuals in the customer organization who buy into this stuff. Some use this as political leverage on their unaware peers. I attended in a vendor presentation once with a banking client in which one would-be expert asked whether the vendor's computers were SIMD or MIMD. This was like asking whether the vendor's cafeteria served this or that brand of coffee and could not have been less relevant to the conversation. The asking of such a question was clearly a power play and served only as a distraction.
When confronted with unfamiliar features, my recommendation is to ask as many questions as it takes to understand why said features are of benefit. Don't stop with the vendor. Ask associates at other firms what they know about the subject. Try on-line discussion groups. Keep asking "Why?" until you are satisfied. Joe Pesci's character in "My Cousin Vinny" is a good model: "Why does SIMD vs. MIMD matter?" "Is one better than the other?" "Exactly how is it better?" "Is it faster? How much faster?" "Does it cost more?" Remember that diligence is the responsibility of the customer.
Some things to consider when framing the question "What is the best data mining software for my purposes?":
-Up front software cost
-Up front hardware cost, if any
-Continuing software costs (subscription prices)
-Training time for users
-Algorithms which match your needs
-Effective data capacity in variables
-Effective data capacity in examples
-Testing capabilities
-Model deployment options (source code, libraries, etc.)
-Model deployment costs (licensing costs, if any)
-Ease of interface with your data sources
-Ability to deal with missing values, special values, outliers, etc.
-Data preparation capabilities (generation of derived or transformed variables)
-Automatic attribute selection / Data reduction