Tuesday, December 29, 2009

2009 Retrospective

I was thinking about top data mining trends in 2009, and searched for what others thought about it. I'll combine a few 2009 "top 3" lists here, including top trends (as described at Enterprise Regulars here), and posts here that generated the most buzz.

First, the top data mining news story was IBM's purchase of SPSS. It will be very interesting to see if this continues the trend toward integration of Business Intelligence and Predictive Analytics that one sees with SAS, Tibco and now IBM/SPSS.

The Enterprise Regulars post included a few interesting 2010 trends (but since data mining is all about using historical data to make predictions of future behavior, assuming past behavior will continue). In particular, there are 4 mentioned that were of interest to me:
  1. The holy grail of the predictive, real-time enterprise (his #2)
  2. SaaS / Cloud BI Tools will steal significant revenue from on-premise vendors but also fight for limited oxygen amongst themselves. (his #5)
  3. Advanced Visualization will continue to increase in depth and relevance to broader audiences. (his #7)
  4. Open Source offerings will continue to make in-roads against on-premise offerings. (his #8)
I agree with his #2 and #7 (integration of BI/PA and visualization). Several customers I work with are trying to integrate predictive analytics into the database to make better decisions. The difference now is that there is also interest in integrating this process with other data-centric (BI) operations to provide the right information to decision-makers with the right level of granularity (detail). This is typically a combination of creating the ability to perform ad hoc queries along with examining the results (rankings and projections) from predictive analytics.

However,I have not seen Cloud computing and Open source take off from the perspective of customers I work with. The latter two certainly have generated buzz, and in the courses I teach, there is considerable interest in open source computing (R in particular), but it has still be interest rather than action. I expect though that as the allure of data mining and predictive analytics extends its reach deeper into organizations, the need for inexpensive tools (in dollars) will result in increased use of the open source and free tools, such as R, RapidMiner, Weka, Tanagra, Orange, Knime, and others. Lastly, from this blog, the top posts of 2009 were
  1. Why normalization matters with K-Means
  2. How many software packages are too much?
  3. Data Mining: Does it get any better than this?
  4. Text Mining and Regular Expressions

Happy New Year!

Tuesday, December 15, 2009

Overlap in the Business Intelligence / Predictive Analytics Space

I've received considerable feedback on the post Business Intelligence vs. Business Analytics, which has also caused me to think more about the BI space and its overlap with data mining (DM) / predictive analytics (PA) / business analytics (BA). One place to look for this, of course, is with Gartner, how they define Business Intelligence, and which vendors overlap between these industries. (I think of this in much same way as I do DM; I look to data miners to define themselves and what they do rather than to other industries and how they define data mining).

I found the Gartner Magic Quadrant for Business Intelligence in 2009 here, and was very curious to understand (1) how they define BI, and which BI players are also big players in the data mining space. Answering the first question, data analysis in the BI world is defined here as comprising four parts: OLAP, visualization, scorecards, and data mining. So DM in this view is a subset of BI.

Second, the key players in the quadrant interestingly contains only a few vendors I would consider to be top data mining vendors: SAS, Oracle, IBM (Cognos), and Microsoft in the "Leaders" category, and Tibco in the visionaries category. Of these, only SAS (with Enterprise Miner) and Microsoft (SQL Server) showed up in the top 10 of the Rexer Analytics 2008 software tool survey, though Tibco showed up in the top 20 (with Tibco Spotfire Miner).

I think this emphasizes again that BI and DM/PA/BA approach analysis differently, even if the end result is the same (a scorecard, dashboard, report, or transactional decisioning system).

Sunday, December 06, 2009

Business Analytics vs. Business Intelligence

I used to be one that thought the term "data mining" would stay as the description of the kind of analytic work I do. To a large degree it has, but there are always new spins on things, and it seems that quite often in the business world, Predictive Analytics or Business Analytics are the terms of the day.

I just came across this post from the Smart Data Collective: OLAP is Dead (Long Live Analytics), which had some fascinating graphs on hits related to the phrases OLAP and Analytics. The first shows the steady decline of OLAP as a searched term to the point where even the OLAP report has been renamed to The BI Verdict. Meanwhile, "analytics" has been increasing steadily in hits. SAS even touts themselves as leaders in "Business Analytics" now.

Which brings me to the question in the title of this post. It seems to me that Business Intelligence has taken over the role that OLAP and dashboarding used to take on (at least in the circles I worked in). Is there a difference between Business Intelligence and Business Analytics? James Taylor, someone whom I respect tremendously, doesn't think so.
As SAS talked about its business analytics framework it became clear that they envision the results of data mining and predictive analytics (where they genuinely have offerings superior to almost everyone) will be delivered in reports or dashboards. This is what I have somewhat dismissively called "predictive reporting" and while it is better than purely historical reporting, it does not do much to make every decision analytically based as it leaves out the decisions made by machines (which don't read reports) and those made by people with too little time to read a report (most call center or retail staff, for instance) or no skill at interpreting it.

I guess I just don't see the difference between BI and BA...

If all of business analytics is reduced to "predictive reporting", then I can see why some might consider it no more than business intelligence. But even so, are they the same? I don't mean are the results the same either. For that matter, the final decisions from analytics for say classification look just the same as a human decision (buy or not buy? fraud or not?). But is the process the same? I would argue "no". Much of the power of predictive analytics comes from the automation in searching for and assessing nonlinearities, interaction effects, and combinatorics relating observables to outcomes. So, rather than manually assessing these, one automates the process through the use of "decision trees", "neural networks", or some other algorithm. So the difference lies in efficiency in the process.

Now how the predictive information is used, in a report, as part of an automated system or in some other way, is a critically important question, but independent of how the decisions are generated.

Tuesday, December 01, 2009

Computer Science and Theology

I have been reading a book by Don Knuth called Things a Computer Scientist Rarely Talks About (Center for the Study of Language and Information - Lecture Notes)--a very good read for those of you interested in theology as well as analytics. This post is not about the theology of the book (as interesting as that is to me), but rather the reason described in this book for his writing of another book called 3:16, a study of all the 3:16 verses in the Bible. In his chapter on randomized testing (I like to think of model ensembles here), he describes how random sampling is a good way to get an idea of the content of "stuff", whether computer science assignments (he actually does this--randomly take page X of a project and look at that in depth), or understanding books (like the Bible). His 3:16 book takes this verse from every book in the Bible to get a sense of the overall message of the Bible. He admittedly chose 3:16 because of John 3:16 so that he would get at least one great verse, but this was a concession to making the book marketable.

At first I wasn't a big fan of this idea. After all, it is a small sample, But he describes how he then studied these verses in depth. Whereas his prior understanding of the Bible was vague and general (which has its positive points), this exercise led also to a deeper (albeit narrow) understanding as well. I recommend this approach very much.

What does this have to do with analytics? Data Mining often is viewed as a way to get the gist of your data, see the big picture, understand patterns through summarized views. But just as important is the deep view, looking at a few examples (prototypes) in depth. In the text mining project I'm working on right now, while we extract "concepts", much of our time is also spent tracing a few text blocks through the processing to understand in detail why the analytics is working the way it does. I'm a "both / and" kind of guy, so this suits me well; big picture analytics as well as deep dives into record-level descriptions.

Monday, November 23, 2009

Stratified Sampling vs. Posterior Probability Thresholds

One of the great things about conference like the recent Predictive Analytics World is how many technical interactions one has with top practitioners; this past October was no exception. One such interaction was with Tim Manns who blogs here. We were talking about Clementine and what to do with small populations of 1s in the target variable, which prompted me to jump onto my soapbox with an issue that I had never read about, but which occurs commonly in data mining problems such as response modeling and fraud detection.

The setup goes something like this: you have 1% responders, you build models, and the model "says" every record is a 0. My explanation for this was always that errors in classification models take place when the same pattern of inputs can produce both outcomes. In this situation, what is the best guess? The most commonly occurring output variable value. If you have 99% 0s, that is most likely a 0, and therefore data mining tools will produce the answer "0". The common solution to this is to resample the data (stratify) so that one has equal numbers of 0s and 1s in the data, and then rebuild the model. While this is true, it misses an important factor.

I can't claim credit for this (thanks Marie!). I was working on a consulting project with a statistician, and when we were building logistic regression models, I recommended resampling so we don't have the "model calls everything a 0" problem. She seemed puzzled by this, and asked why not threshold at the prior probability level. It was clear right away that this is true, and I've been doing it ever since (with logistic regression or neural networks in particular).

What was she saying? First, it needs to be stated that no algorithm produces "decisions". Logistic regression produces probabilities. Neural networks produce confidence values (though I just had a conversation with one of the smartest machine learning guys I know who talked about neural networks producing true probabilities--maybe I'll blog on this more another time). The decisions that one sees ("all records are called 0s") are produced by the software, interpreting the probabilities or confidence values by thresholding them at 0.5. It is always 0.5. I don't think I've ever found a data mining software package that doesn't threshold at 0.5, in fact. So the software expects the prior probabilities of 0s and 1s to be equal. When they are not (like with 99% 0s and 1% 1s), this threshold is completely inappropriate; the center of density of the distribution of probabilities will center roughly on the prior probability level (0.01 for the 1% response rate problem). I show some examples of this in my data mining course that makes this more clear.

So what can one do? If one thresholds at 0.01 rather than 0.5, one gets a nice confusion matrix out of the classification problem. Of course if you use a ROC curve, Lift Chart or Gains Chart to assess your model, you don't worry about thresholding anyway.

Which brings me to the conversation with Tim Manns. I'm glad he tried it out himself, though I don't think one has to make the target variable continuous to make this work. Tim did his testing in Clementine, but the same holds for any other data mining software tool. What Tim's trick does is correct: if you make the [0,1] target variable numeric, you can build a neural network just fine and the predicted value is "exposed". In Clementine, if you keep it as a "flag" variable, you would threshold the propensity value ($NRP-target).

So, read Tim's post (and his other posts!). This trick can be used with nearly any tool--I've done it with Matlab and Tibco Spotfire Miner, among others).

Now, if tools would only include an option to threshold the propensity at 0.5 or the prior probability (or more precisely, the proportion in the training data).

Thursday, November 12, 2009

San Diego Forum on Analytics -- review

I just got back from the 1/2 day Forum on Analytics in San Diego, and included a keynote by Wayne Peacock (now with Inevit, bur formerly VP of BI at Netflix), who spoke on how pervasive analytics was and is at Netflix, covering areas as diverse as finance, customer service, marketing, network optimization, operations, and product development. It was particularly interesting to me that as of 2006, their data warehouse was not in place, but instead the had a "data landfill" (term of the day for me!). The other quote from his talk that I found provocative was related to their web site, "If the web site doesn't go down once a year, we aren't pushing hard enough." However, this is changing somewhat because of their online content delivery, where websites going down have a much bigger downside!

The rest of the morning contained 3 panel discussions, which was interesting in of itself to see what topics were considered most important: Mining Biodata, Web 3.0, and Job Opportunities in Analytics.

During the Biodata panel, Nancy Miller Latimer of Accelrys, Inc. mentioned in passing a software tool that ehy have developed to do essential visual programming of biodata; it looks like the typical Clementine/Enterprise Miner/Tibco Spotfire Miner/Polyanalyst (and in so many other tools, including Statistica and Weka) interface for doing data prep, but their tool is specific for biodata, including loading technical papers, chemical structure data, etc. I've been fascinated for years by the relatively parallel paths taken by the bioinformatics/cheminformatics world and the data mining world: very similar ideas, but very different toolsets because of the very different characteristics of the data. Much was said about the future of sequencing of the human genome: 2 humans in 2007, 6+ in 2008, perhaps 150 in 2009 and growing exponentially (faster than Moore's law). There was talk of the $1000 human sequence soon.

The Web 3.0 panel included 2 folks from Intuit touting a facebook campaign done to grow use of Turbotax virally. Interesting stuff, but I'm still dubious of the effect of social networking on all but the under 30 crowd. I think I'll finally begin to tweet, but only out of curiosity, not because I expect anything of business value from it. Is it inevitable that Facebook, Twitter, and Youtube will become mainstream ways to develop business? For me? I don't see how for me yet.

Lastly, on the analytics jobs in San Diego...there are over 100 analytics companies in San Diego (most of them undoubtedly small or micro, like me), and there was an evangelistic cry for San Diego to become an analytics cluster in the U.S. I think this is actually possible, and has been the case to some degree for some time now. I had forgotten about the Keylime (a San Diego web company) being purchased by Yahoo, and Websidestory being purchased by Omniture. Of course Fair Isaacs and HNC were discussed as well. Time will tell, and right now, things are tough all around, though Kanani Masterson of TriStaff Group said there were currently 225 analytics / web analytics job openings, so things aren't completely dead.

All in all, it was a lot to pack into a morning.

Wednesday, October 28, 2009

Predictive Analytics World, part 1

After attending Predictive Analytics World (PAW) last week, I must say that I'm still impressed with the conference, especially for practitioners.

Eric Siegel's description of uplift modeling in the opening session was another example of a practical (and in this case, relatively new) approach to predictive modeling. I only heard about uplift modeling for the first time (to my discredit) at the February PAW, and almost had a company implement it this past summer were it not for a re-org that killed the modeling efforts.

The R community had another strong showing, with REvolution being there, and another R useR meeting. I'm amazed at the influence of R in the data mining world. It makes me want to become fluent in R! Just on the list.

The keynotes by Usama Fayyad and Stephen Baker were every bit as good as one would expect, but it was the interactions with attendees that impressed me most. The talk I gave received great questions about the practice of using ensembles by several folks who were planning on using this technique with their own data. It's this practical side to the conference that I liked.

Friday, July 17, 2009

For Do-It-Yourself Types

Recently, I came across the Web site of mloss.org ("machine learning open source software"), which houses a collection of software components which will be of interest to inventive data miners. Spanning a variety of languages and algorithm types, the collection can be filtered and searched from the Web site. Good hunting!

Tuesday, June 30, 2009

New Data Mining Book Out

The new Nisbet, Elder, and Miner book is out now, and has been receiving good reviews on Amazon. A sampling of the 6 reviews so far (all 5 stars):

The "Handbook of Statistical Analysis & Data Mining Applications" is the finest book I have seen on the subject. It is not only a beautifully crafted book, with numerous color graphs, chart, tables, and screen shots, but the statistical discussion is both clear and comprehensive.

This is an extraordinary book. So often within this field books are offered as bibles only to fall short. This book does not and delivers a wide array of information and useful tips for the beginner and veteran data miner.

What I like about this book is that it embeds those methods in a broader context, that of the philosophy and structure of data mining writ large, especially as the methods are used in the corporate world. To me, it was really helpful in thinking like a data miner, especially as it involves the mix of science and art.

This is one of the few, of many, data mining books that delivers what it promises.

It has a great mix of data mining principles with step-by-step solutions (case studies) using data mining software, such as Clementine, Enterprise Miner and Statistica. It is this practical approach to data mining that fills a void in the current selection of books in the marketplace (and there are many great data mining books out there).

For some, the benefit of the book will be the case studies on Fraud Detection or Text MIning. For others, seeing how to solve problems using Enterprise Miner (or Clementine or Statistica) will be of most benefit, operating almost like a users manual. I most appreciated the first chapter on the history of statistics (Nisbet), Model Complexity and Ensembles (Elder) and the 10 Data Mining Mistakes (Elder).

One more quote, this from the second forward in the book:

This volume is not a theoretical treatment of the subject -- the authors themselves recommend other books for this -- but rather contains a description of data mining principles and techniques in a series of “knowledge-transfer” sessions, where examples from real data mining projects illustrate the main ideas. This aspect of the book makes it most valuable for practitioners, whether novice or more experienced.

The Handbook of Statistical Analysis and Data Mining Applications is an exceptional book that should be on every data miner's bookshelf, or better yet, found lying open next to the computer.

-- Dean Abbott, Abbott Analytics

Monday, May 18, 2009

Is analytics a winner in a recession?

Even in a recession, analytics can (and should) do well. I am often asked how the economy has effected me, and my quick answer is that "it doesn't effect me", mostly because I am a small, sole proprietorship. In general though bad economic times can be good for consultants as corporations shed employees and look for a way to perform their analytics tasks efficiently without having to take on longer-term commitments.

The way it is put in a recent Business Week article is this (they describe Business Intelligence software rather than data mining software, but the principles are certainly similar):

Interest in business intelligence software is on the rise, analysts say, as economic woes force companies to pursue profit by delving deeper into the information already at their fingertips. "There's a tremendous pressure on cost containment, on developing accurate forecasts of sales and expenses and trying to align the expense stream with projected revenue stream," says John Van Decker, research vice-president at research firm Gartner (IT).

And where software is purchased, there is usually many times more the cost of the software in training and consulting to help understand better how to use the software,

Add in other essential services, and a company can expect to spend more on BI than for other types of software, Evelson says. "For every dollar you spend on business intelligence software, you better expect to spend five to seven times as much on services," such as ensuring it jells with the rest of the company's software, he says.

But even with software, unless there is clear thinking about the problems that need to be solved, and which ones can be solved realistically (or impacted) with analytics, the software will just sit, doing nothing useful. This is surely a factor in the divide between potential capabilities in analytics (i.e., software on the shelf) and benefits attained by analytics:

Still, about two-thirds of large U.S. companies believe they need to improve their analytical capabilities and only half believe they are spending enough on business analytics, according to an Accenture (ACN) survey of 250 executives that was released in December. In it, about 57% of companies said they don't have a beneficial, consistently updated, companywide analytical capability, and 72% are working to increase their company's use of business analytics. Today, only 60% of major decisions are based on analytics, according to the survey, while 40% are based on intuition.

The better consultants work themselves out of jobs, rather than perpetuating the problems. (check out despair.com for tons of hilarious posters).

Just more information that these are good times for data mining.

Saturday, April 25, 2009

Taking Assumptions With A Grain Of Salt

Occasionally, I come across descriptions of clustering or modeling techniques which include mention of "assumptions" being made by the algorithm. The "assumption" of normal errors from the linear model in least-squares regression is a good example. The "assumption" of Gaussian-distributed classes in discriminant analysis is another. I imagine that such assertions must leave novices with some questions and hesitation. What happens if these assumptions are not met? Can techniques ever be used if their assumptions are not tested and met? How badly can the assumption be broken before things go horribly wrong? It is important to understand the implications of these assumptions, and how they affect analysis.

In fact, the assumptions being made are made by the theorist who designed the algorithm, not the algorithm itself. Most often, such assumptions are necessary for some proof of optimality to hold. Considering myself the practical sort, I do not worry too much about these assumptions. What matters to me and my clients is how well the model works in practice (which can be assessed via test data), not how well its assumptions are met. Generally, such assumptions are rarely, if ever, strictly met in practice, and most of these algorithms do reasonably well even under such circumstances. A particular modeling algorithm may well be the best one available, despite not having its assumptions met.

My advice is to be aware of these assumptions to better understand the behavior of the algorithms one is using. Evaluate the performance of a specific modeling technique, not by looking back to its assumptions, but by looking forward to expected behavior, as indicated by rigorous out-of-sample and out-of-time testing.

Thursday, April 02, 2009

Why normalization matters with K-Means

A question about K-means clustering in Clementine was posted here. I thought I knew the answer, but took the opportunity to prove it to myself.

I took the KDD-Cup 98 data and just looked at four fields: Age, NumChild, TARGET_D (the amount the recaptured lapsed donors gave) and LASTGIFT. I took only four to make the problem simpler, and chose variables that had relatively large differences in mean values (where normalization might matter). Also, another problem with the two monetary variables is that they are both skewed positively (severely so).

The following image shows the results of two clustering runs: the first with raw data, the second with normalized data using the Clementine K-Means algorithm. The normalization consisted of log transforms (for TARGET_D and LASTGIFT) and z-scores for all (the log transformed fields, AGE and NUMCHILD). I used the default of 5 clusters.

Here are the results in tabular form. Note that I'm reporting unnormalized values for the "normalized" clusters even though the actual clusters were formed by the normalized values. This is purely for comparative purposes.

Note that:
1) the results are different, as measure by counts in each cluster
2) the unnormalized clusters are dominated by TARGET_D and LASTGIFT--one cluster contains the large values and the remaining have little variance.
3) AGE and NUMCHILD have some similar breakouts (40s with more children and 40s with fewer children for example).

So, the conclusion is (to answer the original question) K-Means in Clementine does not normalize the data. Since Euclidean distance is used, the clusters will be influenced strongly by the magnitudes of the variables, especially by outliers. Normalizing removes this bias. However, whether or not one desires this removal of bias depends on what one wants to find: sometimes if one would want a variable to influence the clusters more, one could manipulate the clusters precisely in this way, by increasing the relative magnitude of these fields.

One last issue that I didn't explore here, is the effects of correlated variables (LASTGIFT and TARGET_D to some degree here). It seems to me that correlated variables will artificially bias the clusters toward natural groupings of those variables, though I have never proved the extent of this bias in a controlled way (maybe someone can point to a paper that shows this clearly).

Wednesday, April 01, 2009

Graphing Considered Dangerous

In my posting of Jun-25-2007, To Graph Or Not To Graph , I made the case (tentatively) that graphs weren't all they're cracked up to be, and provoked some lively discussion in the Comments section here. In his Apr-01-2009 posting, Why tables are really much better than graphs on the Statistical Modeling, Causal Inference, and Social Science Web log, Andrew Gelman makes a much more forceful case against graphs. Readers may find Gelman's arguments of interest.

I am not "anti-graph", but do think that graphs are often used when other tools (test statistics, tables, etc.) would have been a better choice, and graphs are certainly frequently misused. Thoughts?

Thursday, March 19, 2009

How many software packages are too much?

I just saw a question at SmartDataCollective about how many data mining packages one needs. He writes,
we found out that a particular client is using THREE Data Mining softwares. Not statistical softwares or the base versions, but the complete, very expensive Data Mining softwares – SAS EM, SPSS Clementine and KXEN.

I was like, “Wow!!! But do you really need 3 Data Mining softwares???” Our initial questions and the client’s answers confirmed that inconsistent data formats was not the reason as the client already has a BI/DW system. Their reason? Well, they have the opinion that some algorithms/techniques in a particular DM software is much better and accurate than the same algorithms/techniques in another DM software.

I believe there are truly good reasons to have more than one data mining software package. Each tool has its own strengths and weaknesses. As one example, Affinium Model is very good at building hundreds or even thousands of models automatically, whereas Tibco S+ (formerly Insightful Miner) only builds one model at a time. On the other hand, the flexibility of Miner in data preparation, sampling, and settings for building models is much richer than Model. I like to have several tools around for these kind of reasons.

A second reason to have (or to be proficient in) multiple tools as an analytics consultant is that you can plug into nearly any organization if they have tools they want you to use. Currently, I'm working on projects that are using Clementine, Matlab, Statistica, and Insightful Miner. Last year I worked with a customer that was using CART (Salford Systems) and Oracle Data Miner, Polyanalyst, and even briefly IBM Intelligent Miner.

However, except for very rare circumstances, the algorithms themselves are not appreciably different from tool to tool. Yes I know that some tools have extra knobs and options, but backprop is backprop, the Gini index is the Gini index, Entropy is Entropy. The only reason I would have both KXEN and SAS/EM or Clementine is if I wanted the automation of KXEN sometimes, and the full control of of EM or Clementine (it is hard for me to imagine why I would want both Clementine and EM--any takers on this one?).

Monday, March 16, 2009

eMetrics Conference

Early-bird pricing ends Friday for the May 4-7 eMetrics conference in San Jose. You get a 12% discount if you use the promo code ABBOTT12 (don't worry, I don't get anything except the satisfaction that a reader of this blog got a discount). I can't go, but hope to get to one before too long.

Predictive Analytics Webinar

I'm participating in a free webinar through The Modeling Agency tomorrow at 4pm EDT (1pm PDT) for anyone interested in listening in. Tony Rathburn is doing the first technical part, and I follow with about 20 minutes of vignettes. If you do listen in, feel free to post comments here on the content (all critiques welcomed!) We'll repeat the webinar on April 7th and April 22nd.

Sunday, March 08, 2009

Some Interesting Analyses

I find it interesting to learn what other people are working on. To me, the applications can be as interesting as the technology- even if they're not saving millions or curing cancer. Some of these analyses could be a bit more rigorous, but they do suggest avenues for further research, and at least they aren't boring! Here are a few things I've run across in cyberspace recently:

Is Warhammer Balanced?

MLB Payroll Efficiency, 2006-2008

Wired magazine: issue 17.02

Analysis of the price of a piece of a lego set

Modeling Win Probability for a College Basketball Game

Saturday, March 07, 2009

Data Mining: Does It Get Any Better Than This?

The article Doing the Math to Find the Good Jobs appeared in the Jan-26-2009 issue of The Wall Street Journal, listing the top 3 "best" jobs (of 200 studied) as:

1. Mathematician
2. Actuary
3. Statistician

I assume that "data miner" fits in somewhere among these, yipee!

Tuesday, February 17, 2009

Maybe these will be great days for data miners!

While perusing the NC State Institute for Advanced Analytics site (to follow up on the previous post on data mining education), I noticed a link to US News and World Reports career guide, one of which describes how data mining is an "ahead of the curve" career for 2009. While the example is quite limited that is mentioned, it is interesting that data mining is getting such national recognition. Maybe we're in the right industry after all!

Saturday, February 14, 2009

Could these be great days for data miners?

In a recent article on cfo.com, Data Mining in the Meltdown: the Last, Best Hope? the author describes how data quality is the key to future success of businesses. But data quality by itself is not enough,
Of course, data quality matters little if a company is focusing on the wrong measures. The best companies adopt a customer-oriented definition of data quality and recognize that all items of data are not created equal...
In other words, the business objective phase (in the CRISP-DM way of viewing things) is critical. I would add that building models that are assessed in a manner commensurate with the business objective is every bit as important. If you build a series of regression models and take the one with the best R^2, you have very little idea from that metric whether or not the model will do anything productive. One must score and assess the model to reflect the business objective.

The author gets at this idea indirectly with this comment:
For every key performance indicator (KPI), for example, companies should be tracking a key risk indicator (KRI), Friend says. "You plan not just for results, but for contingencies. What happens if sales are down 20 percent?"
In other words, there may just be significant asymmetric costs to incorprate in the scoring of models. I'll be bringing this up at Predictive Analytics World this week; it is arguably one of the biggest mistakes made by modelers.

Tuesday, February 10, 2009

Can you learn data mining in undergraduate or graduate school?

I was recently asked by a former student from one of my data mining courses if a particular program was a good one to learn data mining (it happened to be this one, from NC State). It raises an interesting question: how much can data mining be learned from a book or a course?

Some of the best data miners I have met did not have any statistics course in their past, nor (for some) any higher level mathematics. For my part, I was a computational mathematics major undergrad, and applied math for my masters, but never took a stats course either (though I did take and TA a probability course). That stated, I always recommend in my courses that folks become familiar with basic statistics; one book I have recommended is linked in the book recommendations section--The Cartoon Guide to Statistics. Since I have never taken a college or graduate data mining course, I can't comment directly. My concern is that they are too theoretical (how the algorithms work) rather than practical (how to handle data problems, how to pose proper questions to be addressed by data mining, etc.).

I'm willing to be persuaded though, so if you have experience with good, practical data mining curricula, please let me know.

Saturday, January 31, 2009

Predictive Analytics World

There is a new predictive analytics conference coming up Feb 18-19 in San Francisco called Predictive Analytics World. I'm very much looking forward to it in the hopes that it will appeal to the data mining / predictive analytics practitioner.

I'll be presenting a case study I worked on with TN Marketing using ensembles of logistic regression models. Also, I'll be on a panel discussion on Cross-Industry Challenges and Solutions in Predictive Analytics.

Hope to see some of you there!

Text Mining and Regular Expressions

I've been spending quite a lot of time in the bowels of a text mining project recently, mostly in the text/concept extraction phase. We're using the SPSS Text Mining tool for the work so far. (As a quick aside, the text mining book I've enjoyed reading the most in recent months is the Weiss, Indurkhya, Zhang, and Damerau)

The most difficult part of the project has been that all of the text is really customized lingo--a language of its own as presented in the notes sections of the documents we are reading. Therefore, we can't use the typical linguistic extraction techinques, and rather are relying heavily on regular expressions. That certainly takes me back a few years! I used to use regular expressions mostly in shell programming (Bourne, CShell, Korn Shell and later BASH).

I must say it has been very productive, though it also makes me appreciate language rules that don't exist in any consistent way with our notes. As I am able, I'll post on more specifics on this project.

Regarding books on regular expressions, I found the unix books weren't quite so good on this topic. However, the O'Reilly Mastering Regular Expressions book is quite good.