Tuesday, December 29, 2009

2009 Retrospective

I was thinking about top data mining trends in 2009, and searched for what others thought about it. I'll combine a few 2009 "top 3" lists here, including top trends (as described at Enterprise Regulars here), and posts here that generated the most buzz.

First, the top data mining news story was IBM's purchase of SPSS. It will be very interesting to see if this continues the trend toward integration of Business Intelligence and Predictive Analytics that one sees with SAS, Tibco and now IBM/SPSS.

The Enterprise Regulars post included a few interesting 2010 trends (but since data mining is all about using historical data to make predictions of future behavior, assuming past behavior will continue). In particular, there are 4 mentioned that were of interest to me:
  1. The holy grail of the predictive, real-time enterprise (his #2)
  2. SaaS / Cloud BI Tools will steal significant revenue from on-premise vendors but also fight for limited oxygen amongst themselves. (his #5)
  3. Advanced Visualization will continue to increase in depth and relevance to broader audiences. (his #7)
  4. Open Source offerings will continue to make in-roads against on-premise offerings. (his #8)
I agree with his #2 and #7 (integration of BI/PA and visualization). Several customers I work with are trying to integrate predictive analytics into the database to make better decisions. The difference now is that there is also interest in integrating this process with other data-centric (BI) operations to provide the right information to decision-makers with the right level of granularity (detail). This is typically a combination of creating the ability to perform ad hoc queries along with examining the results (rankings and projections) from predictive analytics.

However,I have not seen Cloud computing and Open source take off from the perspective of customers I work with. The latter two certainly have generated buzz, and in the courses I teach, there is considerable interest in open source computing (R in particular), but it has still be interest rather than action. I expect though that as the allure of data mining and predictive analytics extends its reach deeper into organizations, the need for inexpensive tools (in dollars) will result in increased use of the open source and free tools, such as R, RapidMiner, Weka, Tanagra, Orange, Knime, and others. Lastly, from this blog, the top posts of 2009 were
  1. Why normalization matters with K-Means
  2. How many software packages are too much?
  3. Data Mining: Does it get any better than this?
  4. Text Mining and Regular Expressions

Happy New Year!

Tuesday, December 15, 2009

Overlap in the Business Intelligence / Predictive Analytics Space

I've received considerable feedback on the post Business Intelligence vs. Business Analytics, which has also caused me to think more about the BI space and its overlap with data mining (DM) / predictive analytics (PA) / business analytics (BA). One place to look for this, of course, is with Gartner, how they define Business Intelligence, and which vendors overlap between these industries. (I think of this in much same way as I do DM; I look to data miners to define themselves and what they do rather than to other industries and how they define data mining).

I found the Gartner Magic Quadrant for Business Intelligence in 2009 here, and was very curious to understand (1) how they define BI, and which BI players are also big players in the data mining space. Answering the first question, data analysis in the BI world is defined here as comprising four parts: OLAP, visualization, scorecards, and data mining. So DM in this view is a subset of BI.

Second, the key players in the quadrant interestingly contains only a few vendors I would consider to be top data mining vendors: SAS, Oracle, IBM (Cognos), and Microsoft in the "Leaders" category, and Tibco in the visionaries category. Of these, only SAS (with Enterprise Miner) and Microsoft (SQL Server) showed up in the top 10 of the Rexer Analytics 2008 software tool survey, though Tibco showed up in the top 20 (with Tibco Spotfire Miner).

I think this emphasizes again that BI and DM/PA/BA approach analysis differently, even if the end result is the same (a scorecard, dashboard, report, or transactional decisioning system).

Sunday, December 06, 2009

Business Analytics vs. Business Intelligence

I used to be one that thought the term "data mining" would stay as the description of the kind of analytic work I do. To a large degree it has, but there are always new spins on things, and it seems that quite often in the business world, Predictive Analytics or Business Analytics are the terms of the day.

I just came across this post from the Smart Data Collective: OLAP is Dead (Long Live Analytics), which had some fascinating graphs on hits related to the phrases OLAP and Analytics. The first shows the steady decline of OLAP as a searched term to the point where even the OLAP report has been renamed to The BI Verdict. Meanwhile, "analytics" has been increasing steadily in hits. SAS even touts themselves as leaders in "Business Analytics" now.

Which brings me to the question in the title of this post. It seems to me that Business Intelligence has taken over the role that OLAP and dashboarding used to take on (at least in the circles I worked in). Is there a difference between Business Intelligence and Business Analytics? James Taylor, someone whom I respect tremendously, doesn't think so.
As SAS talked about its business analytics framework it became clear that they envision the results of data mining and predictive analytics (where they genuinely have offerings superior to almost everyone) will be delivered in reports or dashboards. This is what I have somewhat dismissively called "predictive reporting" and while it is better than purely historical reporting, it does not do much to make every decision analytically based as it leaves out the decisions made by machines (which don't read reports) and those made by people with too little time to read a report (most call center or retail staff, for instance) or no skill at interpreting it.

I guess I just don't see the difference between BI and BA...

If all of business analytics is reduced to "predictive reporting", then I can see why some might consider it no more than business intelligence. But even so, are they the same? I don't mean are the results the same either. For that matter, the final decisions from analytics for say classification look just the same as a human decision (buy or not buy? fraud or not?). But is the process the same? I would argue "no". Much of the power of predictive analytics comes from the automation in searching for and assessing nonlinearities, interaction effects, and combinatorics relating observables to outcomes. So, rather than manually assessing these, one automates the process through the use of "decision trees", "neural networks", or some other algorithm. So the difference lies in efficiency in the process.

Now how the predictive information is used, in a report, as part of an automated system or in some other way, is a critically important question, but independent of how the decisions are generated.

Tuesday, December 01, 2009

Computer Science and Theology

I have been reading a book by Don Knuth called Things a Computer Scientist Rarely Talks About (Center for the Study of Language and Information - Lecture Notes)--a very good read for those of you interested in theology as well as analytics. This post is not about the theology of the book (as interesting as that is to me), but rather the reason described in this book for his writing of another book called 3:16, a study of all the 3:16 verses in the Bible. In his chapter on randomized testing (I like to think of model ensembles here), he describes how random sampling is a good way to get an idea of the content of "stuff", whether computer science assignments (he actually does this--randomly take page X of a project and look at that in depth), or understanding books (like the Bible). His 3:16 book takes this verse from every book in the Bible to get a sense of the overall message of the Bible. He admittedly chose 3:16 because of John 3:16 so that he would get at least one great verse, but this was a concession to making the book marketable.

At first I wasn't a big fan of this idea. After all, it is a small sample, But he describes how he then studied these verses in depth. Whereas his prior understanding of the Bible was vague and general (which has its positive points), this exercise led also to a deeper (albeit narrow) understanding as well. I recommend this approach very much.

What does this have to do with analytics? Data Mining often is viewed as a way to get the gist of your data, see the big picture, understand patterns through summarized views. But just as important is the deep view, looking at a few examples (prototypes) in depth. In the text mining project I'm working on right now, while we extract "concepts", much of our time is also spent tracing a few text blocks through the processing to understand in detail why the analytics is working the way it does. I'm a "both / and" kind of guy, so this suits me well; big picture analytics as well as deep dives into record-level descriptions.

Monday, November 23, 2009

Stratified Sampling vs. Posterior Probability Thresholds

One of the great things about conference like the recent Predictive Analytics World is how many technical interactions one has with top practitioners; this past October was no exception. One such interaction was with Tim Manns who blogs here. We were talking about Clementine and what to do with small populations of 1s in the target variable, which prompted me to jump onto my soapbox with an issue that I had never read about, but which occurs commonly in data mining problems such as response modeling and fraud detection.

The setup goes something like this: you have 1% responders, you build models, and the model "says" every record is a 0. My explanation for this was always that errors in classification models take place when the same pattern of inputs can produce both outcomes. In this situation, what is the best guess? The most commonly occurring output variable value. If you have 99% 0s, that is most likely a 0, and therefore data mining tools will produce the answer "0". The common solution to this is to resample the data (stratify) so that one has equal numbers of 0s and 1s in the data, and then rebuild the model. While this is true, it misses an important factor.

I can't claim credit for this (thanks Marie!). I was working on a consulting project with a statistician, and when we were building logistic regression models, I recommended resampling so we don't have the "model calls everything a 0" problem. She seemed puzzled by this, and asked why not threshold at the prior probability level. It was clear right away that this is true, and I've been doing it ever since (with logistic regression or neural networks in particular).

What was she saying? First, it needs to be stated that no algorithm produces "decisions". Logistic regression produces probabilities. Neural networks produce confidence values (though I just had a conversation with one of the smartest machine learning guys I know who talked about neural networks producing true probabilities--maybe I'll blog on this more another time). The decisions that one sees ("all records are called 0s") are produced by the software, interpreting the probabilities or confidence values by thresholding them at 0.5. It is always 0.5. I don't think I've ever found a data mining software package that doesn't threshold at 0.5, in fact. So the software expects the prior probabilities of 0s and 1s to be equal. When they are not (like with 99% 0s and 1% 1s), this threshold is completely inappropriate; the center of density of the distribution of probabilities will center roughly on the prior probability level (0.01 for the 1% response rate problem). I show some examples of this in my data mining course that makes this more clear.

So what can one do? If one thresholds at 0.01 rather than 0.5, one gets a nice confusion matrix out of the classification problem. Of course if you use a ROC curve, Lift Chart or Gains Chart to assess your model, you don't worry about thresholding anyway.

Which brings me to the conversation with Tim Manns. I'm glad he tried it out himself, though I don't think one has to make the target variable continuous to make this work. Tim did his testing in Clementine, but the same holds for any other data mining software tool. What Tim's trick does is correct: if you make the [0,1] target variable numeric, you can build a neural network just fine and the predicted value is "exposed". In Clementine, if you keep it as a "flag" variable, you would threshold the propensity value ($NRP-target).

So, read Tim's post (and his other posts!). This trick can be used with nearly any tool--I've done it with Matlab and Tibco Spotfire Miner, among others).

Now, if tools would only include an option to threshold the propensity at 0.5 or the prior probability (or more precisely, the proportion in the training data).

Thursday, November 12, 2009

San Diego Forum on Analytics -- review

I just got back from the 1/2 day Forum on Analytics in San Diego, and included a keynote by Wayne Peacock (now with Inevit, bur formerly VP of BI at Netflix), who spoke on how pervasive analytics was and is at Netflix, covering areas as diverse as finance, customer service, marketing, network optimization, operations, and product development. It was particularly interesting to me that as of 2006, their data warehouse was not in place, but instead the had a "data landfill" (term of the day for me!). The other quote from his talk that I found provocative was related to their web site, "If the web site doesn't go down once a year, we aren't pushing hard enough." However, this is changing somewhat because of their online content delivery, where websites going down have a much bigger downside!

The rest of the morning contained 3 panel discussions, which was interesting in of itself to see what topics were considered most important: Mining Biodata, Web 3.0, and Job Opportunities in Analytics.

During the Biodata panel, Nancy Miller Latimer of Accelrys, Inc. mentioned in passing a software tool that ehy have developed to do essential visual programming of biodata; it looks like the typical Clementine/Enterprise Miner/Tibco Spotfire Miner/Polyanalyst (and in so many other tools, including Statistica and Weka) interface for doing data prep, but their tool is specific for biodata, including loading technical papers, chemical structure data, etc. I've been fascinated for years by the relatively parallel paths taken by the bioinformatics/cheminformatics world and the data mining world: very similar ideas, but very different toolsets because of the very different characteristics of the data. Much was said about the future of sequencing of the human genome: 2 humans in 2007, 6+ in 2008, perhaps 150 in 2009 and growing exponentially (faster than Moore's law). There was talk of the $1000 human sequence soon.

The Web 3.0 panel included 2 folks from Intuit touting a facebook campaign done to grow use of Turbotax virally. Interesting stuff, but I'm still dubious of the effect of social networking on all but the under 30 crowd. I think I'll finally begin to tweet, but only out of curiosity, not because I expect anything of business value from it. Is it inevitable that Facebook, Twitter, and Youtube will become mainstream ways to develop business? For me? I don't see how for me yet.

Lastly, on the analytics jobs in San Diego...there are over 100 analytics companies in San Diego (most of them undoubtedly small or micro, like me), and there was an evangelistic cry for San Diego to become an analytics cluster in the U.S. I think this is actually possible, and has been the case to some degree for some time now. I had forgotten about the Keylime (a San Diego web company) being purchased by Yahoo, and Websidestory being purchased by Omniture. Of course Fair Isaacs and HNC were discussed as well. Time will tell, and right now, things are tough all around, though Kanani Masterson of TriStaff Group said there were currently 225 analytics / web analytics job openings, so things aren't completely dead.

All in all, it was a lot to pack into a morning.

Wednesday, October 28, 2009

Predictive Analytics World, part 1

After attending Predictive Analytics World (PAW) last week, I must say that I'm still impressed with the conference, especially for practitioners.

Eric Siegel's description of uplift modeling in the opening session was another example of a practical (and in this case, relatively new) approach to predictive modeling. I only heard about uplift modeling for the first time (to my discredit) at the February PAW, and almost had a company implement it this past summer were it not for a re-org that killed the modeling efforts.

The R community had another strong showing, with REvolution being there, and another R useR meeting. I'm amazed at the influence of R in the data mining world. It makes me want to become fluent in R! Just on the list.

The keynotes by Usama Fayyad and Stephen Baker were every bit as good as one would expect, but it was the interactions with attendees that impressed me most. The talk I gave received great questions about the practice of using ensembles by several folks who were planning on using this technique with their own data. It's this practical side to the conference that I liked.

Friday, July 17, 2009

For Do-It-Yourself Types

Recently, I came across the Web site of mloss.org ("machine learning open source software"), which houses a collection of software components which will be of interest to inventive data miners. Spanning a variety of languages and algorithm types, the collection can be filtered and searched from the Web site. Good hunting!

Tuesday, June 30, 2009

New Data Mining Book Out

The new Nisbet, Elder, and Miner book is out now, and has been receiving good reviews on Amazon. A sampling of the 6 reviews so far (all 5 stars):

The "Handbook of Statistical Analysis & Data Mining Applications" is the finest book I have seen on the subject. It is not only a beautifully crafted book, with numerous color graphs, chart, tables, and screen shots, but the statistical discussion is both clear and comprehensive.


This is an extraordinary book. So often within this field books are offered as bibles only to fall short. This book does not and delivers a wide array of information and useful tips for the beginner and veteran data miner.


What I like about this book is that it embeds those methods in a broader context, that of the philosophy and structure of data mining writ large, especially as the methods are used in the corporate world. To me, it was really helpful in thinking like a data miner, especially as it involves the mix of science and art.


This is one of the few, of many, data mining books that delivers what it promises.


It has a great mix of data mining principles with step-by-step solutions (case studies) using data mining software, such as Clementine, Enterprise Miner and Statistica. It is this practical approach to data mining that fills a void in the current selection of books in the marketplace (and there are many great data mining books out there).

For some, the benefit of the book will be the case studies on Fraud Detection or Text MIning. For others, seeing how to solve problems using Enterprise Miner (or Clementine or Statistica) will be of most benefit, operating almost like a users manual. I most appreciated the first chapter on the history of statistics (Nisbet), Model Complexity and Ensembles (Elder) and the 10 Data Mining Mistakes (Elder).

One more quote, this from the second forward in the book:

This volume is not a theoretical treatment of the subject -- the authors themselves recommend other books for this -- but rather contains a description of data mining principles and techniques in a series of “knowledge-transfer” sessions, where examples from real data mining projects illustrate the main ideas. This aspect of the book makes it most valuable for practitioners, whether novice or more experienced.


The Handbook of Statistical Analysis and Data Mining Applications is an exceptional book that should be on every data miner's bookshelf, or better yet, found lying open next to the computer.

-- Dean Abbott, Abbott Analytics

Monday, May 18, 2009

Is analytics a winner in a recession?

Even in a recession, analytics can (and should) do well. I am often asked how the economy has effected me, and my quick answer is that "it doesn't effect me", mostly because I am a small, sole proprietorship. In general though bad economic times can be good for consultants as corporations shed employees and look for a way to perform their analytics tasks efficiently without having to take on longer-term commitments.

The way it is put in a recent Business Week article is this (they describe Business Intelligence software rather than data mining software, but the principles are certainly similar):

Interest in business intelligence software is on the rise, analysts say, as economic woes force companies to pursue profit by delving deeper into the information already at their fingertips. "There's a tremendous pressure on cost containment, on developing accurate forecasts of sales and expenses and trying to align the expense stream with projected revenue stream," says John Van Decker, research vice-president at research firm Gartner (IT).

And where software is purchased, there is usually many times more the cost of the software in training and consulting to help understand better how to use the software,

Add in other essential services, and a company can expect to spend more on BI than for other types of software, Evelson says. "For every dollar you spend on business intelligence software, you better expect to spend five to seven times as much on services," such as ensuring it jells with the rest of the company's software, he says.

But even with software, unless there is clear thinking about the problems that need to be solved, and which ones can be solved realistically (or impacted) with analytics, the software will just sit, doing nothing useful. This is surely a factor in the divide between potential capabilities in analytics (i.e., software on the shelf) and benefits attained by analytics:

Still, about two-thirds of large U.S. companies believe they need to improve their analytical capabilities and only half believe they are spending enough on business analytics, according to an Accenture (ACN) survey of 250 executives that was released in December. In it, about 57% of companies said they don't have a beneficial, consistently updated, companywide analytical capability, and 72% are working to increase their company's use of business analytics. Today, only 60% of major decisions are based on analytics, according to the survey, while 40% are based on intuition.



The better consultants work themselves out of jobs, rather than perpetuating the problems. (check out despair.com for tons of hilarious posters).



Just more information that these are good times for data mining.