Thursday, November 12, 2009

San Diego Forum on Analytics -- review

I just got back from the 1/2 day Forum on Analytics in San Diego, and included a keynote by Wayne Peacock (now with Inevit, bur formerly VP of BI at Netflix), who spoke on how pervasive analytics was and is at Netflix, covering areas as diverse as finance, customer service, marketing, network optimization, operations, and product development. It was particularly interesting to me that as of 2006, their data warehouse was not in place, but instead the had a "data landfill" (term of the day for me!). The other quote from his talk that I found provocative was related to their web site, "If the web site doesn't go down once a year, we aren't pushing hard enough." However, this is changing somewhat because of their online content delivery, where websites going down have a much bigger downside!

The rest of the morning contained 3 panel discussions, which was interesting in of itself to see what topics were considered most important: Mining Biodata, Web 3.0, and Job Opportunities in Analytics.

During the Biodata panel, Nancy Miller Latimer of Accelrys, Inc. mentioned in passing a software tool that ehy have developed to do essential visual programming of biodata; it looks like the typical Clementine/Enterprise Miner/Tibco Spotfire Miner/Polyanalyst (and in so many other tools, including Statistica and Weka) interface for doing data prep, but their tool is specific for biodata, including loading technical papers, chemical structure data, etc. I've been fascinated for years by the relatively parallel paths taken by the bioinformatics/cheminformatics world and the data mining world: very similar ideas, but very different toolsets because of the very different characteristics of the data. Much was said about the future of sequencing of the human genome: 2 humans in 2007, 6+ in 2008, perhaps 150 in 2009 and growing exponentially (faster than Moore's law). There was talk of the $1000 human sequence soon.

The Web 3.0 panel included 2 folks from Intuit touting a facebook campaign done to grow use of Turbotax virally. Interesting stuff, but I'm still dubious of the effect of social networking on all but the under 30 crowd. I think I'll finally begin to tweet, but only out of curiosity, not because I expect anything of business value from it. Is it inevitable that Facebook, Twitter, and Youtube will become mainstream ways to develop business? For me? I don't see how for me yet.

Lastly, on the analytics jobs in San Diego...there are over 100 analytics companies in San Diego (most of them undoubtedly small or micro, like me), and there was an evangelistic cry for San Diego to become an analytics cluster in the U.S. I think this is actually possible, and has been the case to some degree for some time now. I had forgotten about the Keylime (a San Diego web company) being purchased by Yahoo, and Websidestory being purchased by Omniture. Of course Fair Isaacs and HNC were discussed as well. Time will tell, and right now, things are tough all around, though Kanani Masterson of TriStaff Group said there were currently 225 analytics / web analytics job openings, so things aren't completely dead.

All in all, it was a lot to pack into a morning.

Wednesday, October 28, 2009

Predictive Analytics World, part 1

After attending Predictive Analytics World (PAW) last week, I must say that I'm still impressed with the conference, especially for practitioners.

Eric Siegel's description of uplift modeling in the opening session was another example of a practical (and in this case, relatively new) approach to predictive modeling. I only heard about uplift modeling for the first time (to my discredit) at the February PAW, and almost had a company implement it this past summer were it not for a re-org that killed the modeling efforts.

The R community had another strong showing, with REvolution being there, and another R useR meeting. I'm amazed at the influence of R in the data mining world. It makes me want to become fluent in R! Just on the list.

The keynotes by Usama Fayyad and Stephen Baker were every bit as good as one would expect, but it was the interactions with attendees that impressed me most. The talk I gave received great questions about the practice of using ensembles by several folks who were planning on using this technique with their own data. It's this practical side to the conference that I liked.

Friday, July 17, 2009

For Do-It-Yourself Types

Recently, I came across the Web site of mloss.org ("machine learning open source software"), which houses a collection of software components which will be of interest to inventive data miners. Spanning a variety of languages and algorithm types, the collection can be filtered and searched from the Web site. Good hunting!

Tuesday, June 30, 2009

New Data Mining Book Out

The new Nisbet, Elder, and Miner book is out now, and has been receiving good reviews on Amazon. A sampling of the 6 reviews so far (all 5 stars):

The "Handbook of Statistical Analysis & Data Mining Applications" is the finest book I have seen on the subject. It is not only a beautifully crafted book, with numerous color graphs, chart, tables, and screen shots, but the statistical discussion is both clear and comprehensive.


This is an extraordinary book. So often within this field books are offered as bibles only to fall short. This book does not and delivers a wide array of information and useful tips for the beginner and veteran data miner.


What I like about this book is that it embeds those methods in a broader context, that of the philosophy and structure of data mining writ large, especially as the methods are used in the corporate world. To me, it was really helpful in thinking like a data miner, especially as it involves the mix of science and art.


This is one of the few, of many, data mining books that delivers what it promises.


It has a great mix of data mining principles with step-by-step solutions (case studies) using data mining software, such as Clementine, Enterprise Miner and Statistica. It is this practical approach to data mining that fills a void in the current selection of books in the marketplace (and there are many great data mining books out there).

For some, the benefit of the book will be the case studies on Fraud Detection or Text MIning. For others, seeing how to solve problems using Enterprise Miner (or Clementine or Statistica) will be of most benefit, operating almost like a users manual. I most appreciated the first chapter on the history of statistics (Nisbet), Model Complexity and Ensembles (Elder) and the 10 Data Mining Mistakes (Elder).

One more quote, this from the second forward in the book:

This volume is not a theoretical treatment of the subject -- the authors themselves recommend other books for this -- but rather contains a description of data mining principles and techniques in a series of “knowledge-transfer” sessions, where examples from real data mining projects illustrate the main ideas. This aspect of the book makes it most valuable for practitioners, whether novice or more experienced.


The Handbook of Statistical Analysis and Data Mining Applications is an exceptional book that should be on every data miner's bookshelf, or better yet, found lying open next to the computer.

-- Dean Abbott, Abbott Analytics

Monday, May 18, 2009

Is analytics a winner in a recession?

Even in a recession, analytics can (and should) do well. I am often asked how the economy has effected me, and my quick answer is that "it doesn't effect me", mostly because I am a small, sole proprietorship. In general though bad economic times can be good for consultants as corporations shed employees and look for a way to perform their analytics tasks efficiently without having to take on longer-term commitments.

The way it is put in a recent Business Week article is this (they describe Business Intelligence software rather than data mining software, but the principles are certainly similar):

Interest in business intelligence software is on the rise, analysts say, as economic woes force companies to pursue profit by delving deeper into the information already at their fingertips. "There's a tremendous pressure on cost containment, on developing accurate forecasts of sales and expenses and trying to align the expense stream with projected revenue stream," says John Van Decker, research vice-president at research firm Gartner (IT).

And where software is purchased, there is usually many times more the cost of the software in training and consulting to help understand better how to use the software,

Add in other essential services, and a company can expect to spend more on BI than for other types of software, Evelson says. "For every dollar you spend on business intelligence software, you better expect to spend five to seven times as much on services," such as ensuring it jells with the rest of the company's software, he says.

But even with software, unless there is clear thinking about the problems that need to be solved, and which ones can be solved realistically (or impacted) with analytics, the software will just sit, doing nothing useful. This is surely a factor in the divide between potential capabilities in analytics (i.e., software on the shelf) and benefits attained by analytics:

Still, about two-thirds of large U.S. companies believe they need to improve their analytical capabilities and only half believe they are spending enough on business analytics, according to an Accenture (ACN) survey of 250 executives that was released in December. In it, about 57% of companies said they don't have a beneficial, consistently updated, companywide analytical capability, and 72% are working to increase their company's use of business analytics. Today, only 60% of major decisions are based on analytics, according to the survey, while 40% are based on intuition.



The better consultants work themselves out of jobs, rather than perpetuating the problems. (check out despair.com for tons of hilarious posters).



Just more information that these are good times for data mining.

Saturday, April 25, 2009

Taking Assumptions With A Grain Of Salt

Occasionally, I come across descriptions of clustering or modeling techniques which include mention of "assumptions" being made by the algorithm. The "assumption" of normal errors from the linear model in least-squares regression is a good example. The "assumption" of Gaussian-distributed classes in discriminant analysis is another. I imagine that such assertions must leave novices with some questions and hesitation. What happens if these assumptions are not met? Can techniques ever be used if their assumptions are not tested and met? How badly can the assumption be broken before things go horribly wrong? It is important to understand the implications of these assumptions, and how they affect analysis.

In fact, the assumptions being made are made by the theorist who designed the algorithm, not the algorithm itself. Most often, such assumptions are necessary for some proof of optimality to hold. Considering myself the practical sort, I do not worry too much about these assumptions. What matters to me and my clients is how well the model works in practice (which can be assessed via test data), not how well its assumptions are met. Generally, such assumptions are rarely, if ever, strictly met in practice, and most of these algorithms do reasonably well even under such circumstances. A particular modeling algorithm may well be the best one available, despite not having its assumptions met.

My advice is to be aware of these assumptions to better understand the behavior of the algorithms one is using. Evaluate the performance of a specific modeling technique, not by looking back to its assumptions, but by looking forward to expected behavior, as indicated by rigorous out-of-sample and out-of-time testing.

Thursday, April 02, 2009

Why normalization matters with K-Means

A question about K-means clustering in Clementine was posted here. I thought I knew the answer, but took the opportunity to prove it to myself.

I took the KDD-Cup 98 data and just looked at four fields: Age, NumChild, TARGET_D (the amount the recaptured lapsed donors gave) and LASTGIFT. I took only four to make the problem simpler, and chose variables that had relatively large differences in mean values (where normalization might matter). Also, another problem with the two monetary variables is that they are both skewed positively (severely so).

The following image shows the results of two clustering runs: the first with raw data, the second with normalized data using the Clementine K-Means algorithm. The normalization consisted of log transforms (for TARGET_D and LASTGIFT) and z-scores for all (the log transformed fields, AGE and NUMCHILD). I used the default of 5 clusters.

Here are the results in tabular form. Note that I'm reporting unnormalized values for the "normalized" clusters even though the actual clusters were formed by the normalized values. This is purely for comparative purposes.


















Note that:
1) the results are different, as measure by counts in each cluster
2) the unnormalized clusters are dominated by TARGET_D and LASTGIFT--one cluster contains the large values and the remaining have little variance.
3) AGE and NUMCHILD have some similar breakouts (40s with more children and 40s with fewer children for example).

So, the conclusion is (to answer the original question) K-Means in Clementine does not normalize the data. Since Euclidean distance is used, the clusters will be influenced strongly by the magnitudes of the variables, especially by outliers. Normalizing removes this bias. However, whether or not one desires this removal of bias depends on what one wants to find: sometimes if one would want a variable to influence the clusters more, one could manipulate the clusters precisely in this way, by increasing the relative magnitude of these fields.

One last issue that I didn't explore here, is the effects of correlated variables (LASTGIFT and TARGET_D to some degree here). It seems to me that correlated variables will artificially bias the clusters toward natural groupings of those variables, though I have never proved the extent of this bias in a controlled way (maybe someone can point to a paper that shows this clearly).

Wednesday, April 01, 2009

Graphing Considered Dangerous

In my posting of Jun-25-2007, To Graph Or Not To Graph , I made the case (tentatively) that graphs weren't all they're cracked up to be, and provoked some lively discussion in the Comments section here. In his Apr-01-2009 posting, Why tables are really much better than graphs on the Statistical Modeling, Causal Inference, and Social Science Web log, Andrew Gelman makes a much more forceful case against graphs. Readers may find Gelman's arguments of interest.

I am not "anti-graph", but do think that graphs are often used when other tools (test statistics, tables, etc.) would have been a better choice, and graphs are certainly frequently misused. Thoughts?

Thursday, March 19, 2009

How many software packages is too much?

I just saw a question at SmartDataCollective about how many data mining packages one needs. He writes,
we found out that a particular client is using THREE Data Mining softwares. Not statistical softwares or the base versions, but the complete, very expensive Data Mining softwares – SAS EM, SPSS Clementine and KXEN.

I was like, “Wow!!! But do you really need 3 Data Mining softwares???” Our initial questions and the client’s answers confirmed that inconsistent data formats was not the reason as the client already has a BI/DW system. Their reason? Well, they have the opinion that some algorithms/techniques in a particular DM software is much better and accurate than the same algorithms/techniques in another DM software.

I believe there are truly good reasons to have more than one data mining software package. Each tool has its own strengths and weaknesses. As one example, Affinium Model is very good at building hundreds or even thousands of models automatically, whereas Tibco S+ (formerly Insightful Miner) only builds one model at a time. On the other hand, the flexibility of Miner in data preparation, sampling, and settings for building models is much richer than Model. I like to have several tools around for these kind of reasons.

A second reason to have (or to be proficient in) multiple tools as an analytics consultant is that you can plug into nearly any organization if they have tools they want you to use. Currently, I'm working on projects that are using Clementine, Matlab, Statistica, and Insightful Miner. Last year I worked with a customer that was using CART (Salford Systems) and Oracle Data Miner, Polyanalyst, and even briefly IBM Intelligent Miner.

However, except for very rare circumstances, the algorithms themselves are not appreciably different from tool to tool. Yes I know that some tools have extra knobs and options, but backprop is backprop, the Gini index is the Gini index, Entropy is Entropy. The only reason I would have both KXEN and SAS/EM or Clementine is if I wanted the automation of KXEN sometimes, and the full control of of EM or Clementine (it is hard for me to imagine why I would want both Clementine and EM--any takers on this one?).

Monday, March 16, 2009

eMetrics Conference

Early-bird pricing ends Friday for the May 4-7 eMetrics conference in San Jose. You get a 12% discount if you use the promo code ABBOTT12 (don't worry, I don't get anything except the satisfaction that a reader of this blog got a discount). I can't go, but hope to get to one before too long.