<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href="http://www.blogger.com/styles/atom.css" type="text/css"?><feed xmlns='http://www.w3.org/2005/Atom' xmlns:openSearch='http://a9.com/-/spec/opensearchrss/1.0/' xmlns:georss='http://www.georss.org/georss' xmlns:gd='http://schemas.google.com/g/2005' xmlns:thr='http://purl.org/syndication/thread/1.0'><id>tag:blogger.com,1999:blog-5652924</id><updated>2012-01-28T20:23:18.851-08:00</updated><category term='ethics'/><category term='Do'/><category term='data mining software'/><category term='data mining training'/><category term='data mining'/><category term='documentation'/><category term='books'/><category term='DIY'/><category term='theology'/><category term='Rexer Analytics'/><category term='privacy'/><category term='art'/><category term='analytics'/><category term='algorithms'/><category term='eigenanalysis'/><category term='open source'/><category term='data mining data'/><category term='bioinformatics'/><category term='organizational'/><category term='data mining perceptions'/><category term='trends'/><category term='classification'/><category term='data mining survey'/><category term='practice'/><category term='job'/><category term='link analysis'/><category term='spam'/><category term='data reduction'/><category term='special values'/><category term='missing values'/><category term='Do Not'/><category term='Data evaluation'/><category term='humor'/><category term='baseball'/><category term='torture'/><category term='business'/><category term='plot'/><category term='idempotnence'/><category term='Mac-PC compatibility'/><category term='predictive analytics'/><category term='data mining conferences'/><category term='survey analysis'/><category term='outliers'/><category term='graphics'/><category term='parody'/><category term='predictive'/><category term='hate'/><category term='ensembles'/><category term='decisions'/><category term='error metrics'/><category term='whimsical'/><category term='business understanding'/><category term='critical junctures'/><category term='integration'/><category term='problem deifition'/><category term='software'/><category term='insurance'/><category term='regular expressions'/><category term='modeling'/><category term='statistics'/><category term='testing'/><category term='data mining degree'/><category term='data mining books'/><category term='conferences'/><category term='competitions'/><category term='Data understanding'/><category term='model performance'/><category term='MSE'/><category term='data mining users'/><category term='rare events'/><category term='text mining'/><category term='graphs'/><category term='Dorian Pyle'/><category term='business intelligence'/><category term='graph'/><category term='orgranizations'/><category term='risk'/><category term='data preparation'/><category term='Business analytics'/><category term='data selection'/><category term='decision trees'/><category term='data visualization'/><category term='survey'/><category term='model selection'/><category term='graphing'/><category term='feature selection'/><category term='Powerpoint'/><category term='prediction'/><category term='science'/><category term='principal components analysis'/><category term='computer science'/><category term='logistic regression'/><category term='theory'/><category term='idempotent'/><category term='distributions'/><category term='programming'/><category term='random'/><category term='practitioner'/><category term='careers'/><category term='chart'/><category term='webinars'/><category term='literature'/><category term='principal component'/><category term='data mining education'/><category term='economics'/><category term='jobs'/><category term='model assessment'/><category term='KDD'/><category term='data mining vs. statistics'/><category term='career'/><category term='missing data'/><category term='theorist'/><category term='ROC'/><category term='machine learning'/><category term='uplift'/><category term='Predictive Analytics World'/><category term='R'/><category term='PCA'/><category term='sampling'/><category term='dm radio'/><category term='morality'/><title type='text'>Data Mining and Predictive Analytics</title><subtitle type='html'>Tips, tricks, and comments in data mining and predictive analytics, including data preprocessing, visualization, modeling, and model deployment.

Hosted by Dean Abbott, Abbott Analytics</subtitle><link rel='http://schemas.google.com/g/2005#feed' type='application/atom+xml' href='http://abbottanalytics.blogspot.com/feeds/posts/default'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default?max-results=100'/><link rel='alternate' type='text/html' href='http://abbottanalytics.blogspot.com/'/><link rel='hub' href='http://pubsubhubbub.appspot.com/'/><link rel='next' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default?start-index=101&amp;max-results=100'/><author><name>Dean Abbott</name><uri>http://www.blogger.com/profile/16818000233889520746</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><generator version='7.00' uri='http://www.blogger.com'>Blogger</generator><openSearch:totalResults>155</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>100</openSearch:itemsPerPage><entry><id>tag:blogger.com,1999:blog-5652924.post-4047565637769503625</id><published>2012-01-05T16:44:00.000-08:00</published><updated>2012-01-05T16:44:42.277-08:00</updated><title type='text'>Top 5 Posts from 2011</title><content type='html'>By far, the most visited post of 2011 was the "&lt;a href="http://abbottanalytics.blogspot.com/2011/06/what-do-data-miners-need-to-learn.html"&gt;What Do Data Miners Need to Learn&lt;/a&gt;" post from June. &lt;br /&gt;&lt;br /&gt;The top five visited posts that were first posted in 2011 are (with actual ranks for all posts):&lt;br /&gt;1. &lt;a href="http://abbottanalytics.blogspot.com/2011/06/what-do-data-miners-need-to-learn.html"&gt;What Do Data Miners Need to Learn&lt;/a&gt;&lt;br /&gt;2. &lt;a href="http://abbottanalytics.blogspot.com/2011/11/statistical-rules-of-thumb-part-iii.html"&gt;Statistical Rules of Thumb, Part III&lt;/a&gt;&lt;br /&gt;3. &lt;a href="http://abbottanalytics.blogspot.com/2011/04/statistical-rules-of-thumb-part-ii.html"&gt;Statistical Rules of Thumb, Part II&lt;/a&gt;&lt;br /&gt;4. &lt;a href="http://abbottanalytics.blogspot.com/2011/05/number-of-hidden-layer-neurons-to-use.html"&gt;Number of Hidden Layer Neurons to Use&lt;/a&gt;&lt;br /&gt;5. &lt;a href="http://abbottanalytics.blogspot.com/2011/03/statistics-need-for-integration.html"&gt;Statistics: The Need for Integration&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;The top six viewed posts in 2011 originally created prior to 2011 were:&lt;br /&gt;1. &lt;a href="http://abbottanalytics.blogspot.com//2009/04/why-normalization-matters-with-k-means.html"&gt;Why Normalization Matters with K-Means&lt;/a&gt; (2009)&lt;br /&gt;2. &lt;a href="http://abbottanalytics.blogspot.com/2006/11/free-and-inexpensive-data-mining.html"&gt;Free and Inexpensive Data Mining Software&lt;/a&gt; (2006)&lt;br /&gt;3. &lt;a href="http://abbottanalytics.blogspot.com/2008/04/data-mining-data-sets.html"&gt;Data Mining Data Sets&lt;/a&gt; (2008)&lt;br /&gt;4. &lt;a href="http://abbottanalytics.blogspot.com/2009/02/can-you-learn-data-mining-in.html"&gt;Can you Learn Data Mining in Undergraduate or Graduate School&lt;/a&gt; (2009)&lt;br /&gt;5. &lt;a href="http://abbottanalytics.blogspot.com/2007/05/quotes-from-moneyball.html"&gt;Quotes from Moneyball&lt;/a&gt; (2007)&lt;br /&gt;6. &lt;a href="http://abbottanalytics.blogspot.com/2009/12/business-analytics-vs-business.html"&gt;Business Analytics vs. Business Intelligence&lt;/a&gt; (2009)&lt;br /&gt;&lt;br /&gt;The "Free Data Mining Tools" post is understandably relatively popular, even after 5 years. The Moneyball quotes has a particularly high bounce rate. I'm most surprised that the K-Means normalization post has remained popular for so long.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5652924-4047565637769503625?l=abbottanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://abbottanalytics.blogspot.com/feeds/4047565637769503625/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5652924&amp;postID=4047565637769503625' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/4047565637769503625'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/4047565637769503625'/><link rel='alternate' type='text/html' href='http://abbottanalytics.blogspot.com/2012/01/top-5-posts-from-2011.html' title='Top 5 Posts from 2011'/><author><name>Dean Abbott</name><uri>http://www.blogger.com/profile/16818000233889520746</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5652924.post-7559753344179494645</id><published>2011-12-28T18:25:00.000-08:00</published><updated>2011-12-28T21:42:36.777-08:00</updated><title type='text'>Models Behaving Badly</title><content type='html'>I just read a fascinating book review in the Wall Street Journal &lt;a href="http://online.wsj.com/article/SB10001424052970203430404577094760894401548.html?KEYWORDS=Physics+Envy"&gt;Physics Envy: Models Behaving Badly&lt;/a&gt;. The author of the book, Emanuel Derman (former head of Quantitative Analsis at Goldman Sachs) argues that the financial models involved human beings and therefore were inherently brittle: as human behavior changed, the models failed. "in physics you're playing against God, and He doesn't change His laws very often. In finance, you're playing against God's creatures." &lt;br /&gt;&lt;br /&gt;I'll agree with Derman that whenever human beings are in the loop, data suffers. People change their minds based on information not available to the models.&lt;br /&gt;&lt;br /&gt;I also agree that human behavioral modeling is not the same as physical modeling. We can use the latter to provide motivation and even mathematics for human behavioral modeling, but we should not take this too far.  A simple example is this: purchase decisions sometimes depend not on the person's propensity to purchase alone, but also on whether or not they had an argument that morning, or if they just watched a great movie. There is an emotional component that data cannot reflect. People therefore behave in ways that on the surface are contradictory, seemingly "random", which is way response rates of 1% can be "good".&lt;br /&gt;&lt;br /&gt;However, I bristle a bit at the the emphasis on the physics analogy. In closed systems, models can explain everything. But once one opens up the world, even physical models are imperfect because they often do not incorporate &lt;i&gt;all&lt;/i&gt; the information available. For example, missile guidance is based on pure physics: move a surface on a wing and one can change the trajectory of the missile. There are equations of motion that describe exactly where the missile will go. There is no mystery here. &lt;br /&gt;&lt;br /&gt;However, all operational missile guidances systems are "closed loop"; the guidance command sequence is not completely scheduled but is updated throughout the flight. Why? To compensate for unexpected effects of the guidance commands, often due to ballistic winds, thermal gradients, or other effects on the physical system. It is the closed-loop corrections that make missile guidance work. The exact same principal applies to your car's cruise control, chasing down a fly ball in baseball, or even just walking down the street. &lt;br /&gt;&lt;br /&gt;For a predictive model to be useful long-term, it needs updating to correct for changes in the population the models are applied to, whether the models be for customer acquisition, churn, fraud detection, or any model. The "closed-loop" typical in data mining is called "model updating" and is critical for long-term modeling success. &lt;br /&gt;&lt;br /&gt;The question then becomes this: can the models be updated quickly enough to compensate for changes in the population? If a missile can only be updated at 10Hz (10x / sec.) but uncertainties effect the trajectory significantly in milliseconds, the closed-loop actions may be insufficient to compensate. If your predictive can only be updated monthly, but your customer behavior changes significantly on a weekly basis, your models will be behind perpetually. Measuring the effectiveness of model predictions is therefore critical in determining the frequency of model updating necessary in your organization.&lt;br /&gt;&lt;br /&gt;To be fair, until I read the book I have no quibble with the arguments. The arguments here are based solely on the book review and some ideas they prompted in my mind. I'd welcome comments from anyone who has read the book already.&lt;br /&gt;&lt;br /&gt;The book can be found on amazon &lt;a href="http://www.amazon.com/Models-Behaving-Badly-Confusing-Illusion-Reality-Disaster/dp/1439164983/ref=sr_1_1?ie=UTF8&amp;qid=1325124923&amp;sr=8-1"&gt;here&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;UPDATE: Aaron Lai &lt;a href="https://sites.google.com/site/aaron00page/home/mypapers/ParadigmLost.pdf?attredirects=0&amp;d=1"&gt;wrote an article for CFA Magazine&lt;/a&gt; on the same topic, also quoting Derman. I commend the article to all (note: this is a PDF file download).&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5652924-7559753344179494645?l=abbottanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://abbottanalytics.blogspot.com/feeds/7559753344179494645/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5652924&amp;postID=7559753344179494645' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/7559753344179494645'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/7559753344179494645'/><link rel='alternate' type='text/html' href='http://abbottanalytics.blogspot.com/2011/12/models-behaving-badly.html' title='Models Behaving Badly'/><author><name>Dean Abbott</name><uri>http://www.blogger.com/profile/16818000233889520746</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5652924.post-7514769420610119420</id><published>2011-11-04T15:36:00.000-07:00</published><updated>2011-11-04T15:36:41.741-07:00</updated><title type='text'>Statistical Rules Of Thumb, part III: Always Visualize the Data</title><content type='html'>As I perused &lt;a href="http://www.amazon.com/Statistical-Rules-Thumb-Probability-Statistics/dp/0470144483/ref=sr_1_1?ie=UTF8&amp;qid=1320444722&amp;sr=8-1"&gt;Statistical Rules of Thumb&lt;/a&gt; again, as I do from time to time, I came across this gem. (note: I live in CA, so get no money from these amazon links).&lt;br /&gt;&lt;br /&gt;Van Belle uses the term "Graph" rather than "Visualize", but it is the same idea. The point is to visualize &lt;i&gt;in addition to&lt;/i&gt; computing summary statistics. Summaries are useful, but can be deceiving; any time you summarize data you will lose some information unless the distributions are well behaved. The scatterplot, histogram, box and whiskers plot, etc. can reveal ways the summaries can fool you. I've seen these as well, especially variables with outliers or that are bi- or tri-modal.&lt;br /&gt;&lt;br /&gt;One of the most famous examples of this effect is &lt;a href="http://en.wikipedia.org/wiki/Anscombe's_quartet"&gt;Anscombe's Quartet&lt;/a&gt;. I'm including the Wikipedia image of the plots here:&lt;br /&gt;&lt;br /&gt; &lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://2.bp.blogspot.com/-v4Q0d2DdoMg/TrRnqRP6osI/AAAAAAAAAIc/Dh2Uoe6XM0Q/s1600/800px-Anscombe%2527s_quartet_3.svg.png" imageanchor="1" style="margin-left:1em; margin-right:1em"&gt;&lt;img border="0" height="291" width="400" src="http://2.bp.blogspot.com/-v4Q0d2DdoMg/TrRnqRP6osI/AAAAAAAAAIc/Dh2Uoe6XM0Q/s400/800px-Anscombe%2527s_quartet_3.svg.png" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;All four datasets have the same mean x values, y values, x standard deviation, y standard deviation, x-y pearson correlation coefficient, and regression line of y, so the summaries don't tell the differences in the data.&lt;br /&gt;&lt;br /&gt;I use correlations a lot to get the gist of the relationships in the data, and I've seen how correlations can deceive. In one project, we had 30K data points with a correlation of 0.9+. When we removed just 100 of these data points (the largest magnitudes of x and y), the correlation shrunk to 0.23.&lt;br /&gt;&lt;br /&gt;Most data mining software has ways to visualize data easily now. Avail yourself to them to avoid subsequent surprises in your data.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5652924-7514769420610119420?l=abbottanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://abbottanalytics.blogspot.com/feeds/7514769420610119420/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5652924&amp;postID=7514769420610119420' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/7514769420610119420'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/7514769420610119420'/><link rel='alternate' type='text/html' href='http://abbottanalytics.blogspot.com/2011/11/statistical-rules-of-thumb-part-iii.html' title='Statistical Rules Of Thumb, part III: Always Visualize the Data'/><author><name>Dean Abbott</name><uri>http://www.blogger.com/profile/16818000233889520746</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://2.bp.blogspot.com/-v4Q0d2DdoMg/TrRnqRP6osI/AAAAAAAAAIc/Dh2Uoe6XM0Q/s72-c/800px-Anscombe%2527s_quartet_3.svg.png' height='72' width='72'/><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5652924.post-2516252329487969788</id><published>2011-07-29T17:02:00.000-07:00</published><updated>2011-07-29T21:23:41.334-07:00</updated><title type='text'>Yet another "Wisdom of Crowds" success</title><content type='html'>I was at the Federal Building downtown San Diego for a consulting job, and met some representatives for a life and disability insurance company who were giving away a big-screen HD TV for the individual who came closest to guessing the number of M&amp;Ms (chocolate and peanut butter filled) in a container. Because they do this often, I won't show the specific container they use. &lt;br /&gt;&lt;br /&gt;I offered to make a guess of the total, but only if I could see all of the guesses so far. I was drawing from the &lt;a href="http://www.amazon.com/Wisdom-Crowds-James-Surowiecki/dp/0385721706"&gt;Wisdom of Crowds&lt;/a&gt; example from Chapter 1 of the book where a set of independent guesses tend to outperform even an expert's best guess. I've done the same experiment many times in data mining courses I've taught, and have found the same phenomenon.&lt;br /&gt;&lt;br /&gt;I collected data from 77 individuals (including myself) shown here (sorted for convenience, but this makes no difference in the analysis):&lt;br /&gt;37&lt;br /&gt;625&lt;br /&gt;772&lt;br /&gt;784&lt;br /&gt;875&lt;br /&gt;888&lt;br /&gt;903&lt;br /&gt;929&lt;br /&gt;983&lt;br /&gt;987&lt;br /&gt;1001&lt;br /&gt;1015&lt;br /&gt;1040&lt;br /&gt;1080&lt;br /&gt;1080&lt;br /&gt;1124&lt;br /&gt;1245&lt;br /&gt;1250&lt;br /&gt;1450&lt;br /&gt;1500&lt;br /&gt;1536&lt;br /&gt;1596&lt;br /&gt;1600&lt;br /&gt;1774&lt;br /&gt;1875&lt;br /&gt;1929&lt;br /&gt;1972&lt;br /&gt;1976&lt;br /&gt;1995&lt;br /&gt;2000&lt;br /&gt;2012&lt;br /&gt;2033&lt;br /&gt;2143&lt;br /&gt;2150&lt;br /&gt;2200&lt;br /&gt;2221&lt;br /&gt;2235&lt;br /&gt;2251&lt;br /&gt;2321&lt;br /&gt;2331&lt;br /&gt;2412&lt;br /&gt;2500&lt;br /&gt;2500&lt;br /&gt;2550&lt;br /&gt;2571&lt;br /&gt;2599&lt;br /&gt;2672&lt;br /&gt;2714&lt;br /&gt;2735&lt;br /&gt;2777&lt;br /&gt;2777&lt;br /&gt;2803&lt;br /&gt;2832&lt;br /&gt;2873&lt;br /&gt;2931&lt;br /&gt;3001&lt;br /&gt;3101&lt;br /&gt;3250&lt;br /&gt;3333&lt;br /&gt;3362&lt;br /&gt;3500&lt;br /&gt;3500&lt;br /&gt;3501&lt;br /&gt;3501&lt;br /&gt;3583&lt;br /&gt;3661&lt;br /&gt;3670&lt;br /&gt;3697&lt;br /&gt;3832&lt;br /&gt;3872&lt;br /&gt;4280&lt;br /&gt;4700&lt;br /&gt;4797&lt;br /&gt;5205&lt;br /&gt;5225&lt;br /&gt;5257&lt;br /&gt;9886&lt;br /&gt;10000&lt;br /&gt;187952&lt;br /&gt;&lt;br /&gt;Note there are a few flakey ones in the lot. The last two were easy to spot (so I put them at the bottom of my list). The idea of course is to just take the average of the guesses.&lt;br /&gt;&lt;br /&gt;Average all: 4932&lt;br /&gt;Average all without 37 and 187932: 2626&lt;br /&gt;&lt;br /&gt;Then I looked at the histogram and decided that the guesses close to 10000 were also too flaky to include:&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://2.bp.blogspot.com/-NXcDTuExiB4/TjNHy3qpeGI/AAAAAAAAAH8/N2BR8GVsy_g/s1600/mnms_histogram1.png" imageanchor="1" style="clear:left; float:left;margin-right:1em; margin-bottom:1em"&gt;&lt;img border="0" height="378" width="400" src="http://2.bp.blogspot.com/-NXcDTuExiB4/TjNHy3qpeGI/AAAAAAAAAH8/N2BR8GVsy_g/s400/mnms_histogram1.png" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;So I removed all data points greater than 8000, which took away 2 samples, leaving this histogram and a mean of 2436.&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://3.bp.blogspot.com/-h68PznfHHOs/TjNINiiLgiI/AAAAAAAAAIE/f9IasMygFRg/s1600/mnms_histogram2.png" imageanchor="1" style="clear:left; float:left;margin-right:1em; margin-bottom:1em"&gt;&lt;img border="0" height="378" width="400" src="http://3.bp.blogspot.com/-h68PznfHHOs/TjNINiiLgiI/AAAAAAAAAIE/f9IasMygFRg/s400/mnms_histogram2.png" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;So now for the outcome:&lt;br /&gt;Actual Count: 2464&lt;br /&gt;Average of trimmed sample: 2436 (error 28)&lt;br /&gt;Best individual guess: 2500 (error 36)&lt;br /&gt;&lt;br /&gt;So amazingly, the average won, though I wouldn't have been disappointed at all if it finished 3rd or 4th because it still would have been a great guess.&lt;br /&gt;&lt;br /&gt;Wisdom of Crowds wins again!&lt;br /&gt;&lt;br /&gt;PS I reported to the insurance agents a guess of 2423 because I had omitted my original guess (provided before looking at any other guesses--2550 if you must know) and my co-worker's guess of 3250, so these helped bring up the mean a bit. The Average would have lost (barely) if I had not included them.&lt;br /&gt;&lt;br /&gt;PPS So how will they split the winnings since two guessed the same value? I won't recommend the saw approach. I hope they ask each of the two guessers to either modify their guess, and require they modify their guess by at least one.&lt;br /&gt;&lt;br /&gt;PPPS Note: the charts were done using &lt;a href="http://www.jmp.com/software/pro/"&gt;JMP Pro 9&lt;/a&gt; for the Macintosh&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5652924-2516252329487969788?l=abbottanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://abbottanalytics.blogspot.com/feeds/2516252329487969788/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5652924&amp;postID=2516252329487969788' title='7 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/2516252329487969788'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/2516252329487969788'/><link rel='alternate' type='text/html' href='http://abbottanalytics.blogspot.com/2011/07/yet-another-wisdom-of-crowds-success.html' title='Yet another &quot;Wisdom of Crowds&quot; success'/><author><name>Dean Abbott</name><uri>http://www.blogger.com/profile/16818000233889520746</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://2.bp.blogspot.com/-NXcDTuExiB4/TjNHy3qpeGI/AAAAAAAAAH8/N2BR8GVsy_g/s72-c/mnms_histogram1.png' height='72' width='72'/><thr:total>7</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5652924.post-1809764257291892486</id><published>2011-06-13T15:56:00.000-07:00</published><updated>2011-06-27T17:09:55.598-07:00</updated><title type='text'>What do Data Miners Need to Learn?</title><content type='html'>I've been asked by several folks recently what they need to learn to succeed in data mining and predictive analytics. This is a different twist on the question I also get, namely what degree should one get to be a good (albeit "green") data miner. Usually, the latter question gets the answer "it doesn't matter" because I know so many great data miners without a statistics or mathematics degree. Understandably, there are many non-stats/math degrees that have a very strong statistics or mathematics component, such as psychology, political science, and engineering to name a few. But then again, you don't necessarily have to load up on the stats/math courses in these disciplines either.&lt;br /&gt;&lt;br /&gt;So the question of "what to learn" applies across majors whether undergraduate or graduate. Of course statistics and machine learning courses are directly applicable. However, the answer I've been giving recently to the question what do new data miners need to learn (assuming they will learn algorithms) have centered around two other topics: databases and business. &lt;br /&gt;&lt;br /&gt;I had no specific coursework or experience in either when I began my career. In the 80s, databases were not as commonplace in the DoD world where I began my career; we usually worked with flat files provided to us by a customer, even if these files were quite large. Now, most customers I work with have their data stored in databases or data marts, and as a result, we data miners often must lean on DBAs or an IT layer of people to get at the data. This would be fine except that (1) the data that is provided to data miners is often not the complete data we need or at least would like to have before building models, (2) we sometimes won't know how valuable data is until we look at it, and (3) communication with IT is often slow and laden with political issues inherent in many organizations. &lt;br /&gt;&lt;br /&gt;On the other hand, IT is often reticent to give analysts significant freedom to query databases because of the harm they can do (wise!) because data miners have in general a poor understanding of how databases work and which queries are dangerous or computationally expensive.&lt;br /&gt;&lt;br /&gt;Therefore, I am becoming more of the opinion that a masters program in data mining, or a data mining certificate program should contain at least one course on databases, which should contain at least some database design component, but for the most part should emphasize a users perspective). It is probably more realistic to require this for a degree than a certificate, but could be included in both. I know that for me, in considering new hires, this would be provide a candidate an advantage for me if he or she had SQL or SAS experience.&lt;br /&gt;&lt;br /&gt;For the second issue, business experience, there are some that might be concerned that "experience" is  too narrow for a degree program. After all, if someone has experience in building response models, what good would that do for Paypal if they are looking for building fraud models? My reply is "a lot"! Building models on real data (meaning messy) to solve a real problem (meaning identifying a target variable that conveys the business decision to be improved) requires a thought process that isn't related to knowing algorithms or data. &lt;br /&gt;&lt;br /&gt;Building "real-world" models requires a translation of business objectives to data mining objectives (as described in the Business Understanding section of &lt;a href="http://www.crisp-dm.org"&gt;CRISP-DM&lt;/a&gt;, &lt;a href="http://www.crisp-dm.org/CRISPWP-0800.pdf"&gt;pdf here&lt;/a&gt;). When I have interviewed young data miners in the past, it is those who have had to go through this process that are better prepared to begin the job right away, and it is those who recognize the value here who do better at solving problems in a way that impacts decisions rather than finding cool, innovative solutions that never see the light of day. (UPDATE: the crisp-dm.org site is no longer up--see comments section. The CRISP-DM 1.0 document however can still be downloaded &lt;a href="ftp://ftp.software.ibm.com/software/analytics/spss/support/Modeler/Documentation/14/UserManual/CRISP-DM.pdf"&gt;here&lt;/a&gt;, with higher resolution graphics, by the way!)&lt;br /&gt;&lt;br /&gt;My challenge to the universities who are adding degree programs in data mining and predictive analytics, or are offering Certificate programs is then to include courses on how to access data (databases), and how to solve problems (business objectives, perhaps by offering a practicum with a local company).&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5652924-1809764257291892486?l=abbottanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://abbottanalytics.blogspot.com/feeds/1809764257291892486/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5652924&amp;postID=1809764257291892486' title='15 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/1809764257291892486'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/1809764257291892486'/><link rel='alternate' type='text/html' href='http://abbottanalytics.blogspot.com/2011/06/what-do-data-miners-need-to-learn.html' title='What do Data Miners Need to Learn?'/><author><name>Dean Abbott</name><uri>http://www.blogger.com/profile/16818000233889520746</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>15</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5652924.post-2384287810495853873</id><published>2011-05-05T21:27:00.000-07:00</published><updated>2011-05-05T21:27:27.605-07:00</updated><title type='text'>Number of Hidden Layer Neurons to Use</title><content type='html'>In the &lt;a href="http://linkd.in/k3OCAf"&gt;linkedin.com Artificial Neural Networks &lt;/a&gt; group, a question arose about how many hidden neurons one should choose. I've never found a fully satisfactory answer to this, but there is quite a lot of guesses and rules of thumb out there.&lt;br /&gt;&lt;br /&gt;I've always like Warren Sarle's &lt;a href="ftp://ftp.sas.com/pub/neural/FAQ3.html#A_hu"&gt;neural network FAQ&lt;/a&gt; that includes a discussion on this topic. &lt;br /&gt;&lt;br /&gt;There is another reference on the web that I agree with only about 50%, but the references are excellent: &lt;a href="http://www.faqs.org/faqs/ai-faq/neural-nets/part3/section-10.html"&gt;http://www.faqs.org/faqs/ai-faq/neural-nets/part3/section-10.html&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;My personal preference is to use software that experiments with multiple architectures and selects the one that performs best on held-out data. Better still are the algorithms that also select (i.e. prune) inputs as well. As I teach in my courses, I've spent far too many hours in my life selection neural network architectures and re-training, so I'd much rather let the software do it for me.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5652924-2384287810495853873?l=abbottanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://abbottanalytics.blogspot.com/feeds/2384287810495853873/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5652924&amp;postID=2384287810495853873' title='8 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/2384287810495853873'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/2384287810495853873'/><link rel='alternate' type='text/html' href='http://abbottanalytics.blogspot.com/2011/05/number-of-hidden-layer-neurons-to-use.html' title='Number of Hidden Layer Neurons to Use'/><author><name>Dean Abbott</name><uri>http://www.blogger.com/profile/16818000233889520746</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>8</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5652924.post-6274289495370726630</id><published>2011-04-25T22:34:00.000-07:00</published><updated>2011-04-25T22:34:49.415-07:00</updated><title type='text'>Statistical Rules of Thumb, part II</title><content type='html'>A while back, &lt;a href="http://abbottanalytics.blogspot.com/2008/10/two-books-of-interest.html"&gt;Will Dwinnell posted&lt;/a&gt; on two books, one of which is one of my favorites as well: &lt;br /&gt;&lt;blockquote&gt;&lt;iframe src="http://rcm.amazon.com/e/cm?lt1=_blank&amp;bc1=000000&amp;IS2=1&amp;bg1=FFFFFF&amp;fc1=000000&amp;lc1=0000FF&amp;t=dataminiandpr-20&amp;o=1&amp;p=8&amp;l=as4&amp;m=amazon&amp;f=ifr&amp;ref=ss_til&amp;asins=0470144483" style="width:120px;height:240px;" scrolling="no" marginwidth="0" marginheight="0" frameborder="0"&gt;&lt;/iframe&gt;&lt;/blockquote&gt;&lt;br /&gt;Will mentioned a few general topics covered in the book, but I thought I would mention two specific ones that I agree with wholeheartedly.&lt;br /&gt;&lt;br /&gt;7.3: Always Graph the Data&lt;br /&gt;In this section he quotes &lt;a href="http://www.edwardtufte.com/bboard/q-and-a-fetch-msg?msg_id=0003mW&amp;topic_id=1"&gt;E.R. Tufte&lt;/a&gt; as follows (Abbott quoting van Belle quoting Tufte):&lt;br /&gt;&lt;blockquote&gt;Graphical Excellence is that which gives the viewer the greatest number of ideas in the shortest time with the least ink in the shortest space.&lt;/blockquote&gt;&lt;br /&gt;I'm not so sure I agree with the superlatives, I certainly agree with the gist that excellence in graphics is parsimonious, clear, insightful, and informationally rich. Contrast this to another rule of thumb:&lt;br /&gt;&lt;br /&gt;7.4: Never use a Pie Chart&lt;br /&gt;well, that's not exactly rocket science; pie charts have lots of detractors...The only thing worse than a pie chart is a 3-D pie chart!&lt;br /&gt;&lt;br /&gt;7.6: Stacked Barcharts are Worse than Bargraphs.&lt;br /&gt;Perhaps the biggest problem with stacked bar graphs (such as the one here) is that you cannot see clearly the comparison between the colored values in the bins. &lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://3.bp.blogspot.com/-zGa1sgmkKPQ/TbZXsZ49tJI/AAAAAAAAAGQ/Own8nk14CDs/s1600/Histogram%2Bof%2BTransactionDate.jpg" imageanchor="1" style="clear:left; float:left;margin-right:1em; margin-bottom:1em"&gt;&lt;img border="0" height="204" width="400" src="http://3.bp.blogspot.com/-zGa1sgmkKPQ/TbZXsZ49tJI/AAAAAAAAAGQ/Own8nk14CDs/s400/Histogram%2Bof%2BTransactionDate.jpg" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;(a good summary of why they are problematic is in Stephen Few's Newletter, which you can download &lt;a href="http://www.perceptualedge.com/articles/visual_business_intelligence/displays_for_combining_time-series_and_part-to-whole.pdf"&gt;here&lt;/a&gt;)&lt;br /&gt;&lt;br /&gt;I have found that data shown in a chart like this can be shown better in a table, perhaps with some conditional formatting (in Excel) or other color coding to push the eye toward the key differences in values. For continuous data, this often means binning a variable (akin to the histogram) and creating a cross-tab. The key is clarity--make the table so that the key information is obvious.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5652924-6274289495370726630?l=abbottanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://abbottanalytics.blogspot.com/feeds/6274289495370726630/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5652924&amp;postID=6274289495370726630' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/6274289495370726630'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/6274289495370726630'/><link rel='alternate' type='text/html' href='http://abbottanalytics.blogspot.com/2011/04/statistical-rules-of-thumb-part-ii.html' title='Statistical Rules of Thumb, part II'/><author><name>Dean Abbott</name><uri>http://www.blogger.com/profile/16818000233889520746</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/-zGa1sgmkKPQ/TbZXsZ49tJI/AAAAAAAAAGQ/Own8nk14CDs/s72-c/Histogram%2Bof%2BTransactionDate.jpg' height='72' width='72'/><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5652924.post-3060120501331422949</id><published>2011-04-19T13:58:00.000-07:00</published><updated>2011-04-19T13:59:29.450-07:00</updated><title type='text'>Rexer Analytics data mining survey</title><content type='html'>Rexer Analytics, a data mining consulting firm, is conducting their 5th annual survey of the analytic behaviors, views and preferences of data mining professionals. I urge all of you to respond to the survey and help us all understand better the nature of the data mining and predictive analytics industry. The following text contains their instructions and overview.&lt;br /&gt;&lt;br /&gt;If you want to skip the verbage and just get on with the survey, use code  RL3X1 and go &lt;a href="http://www.RexerAnalytics.com/Data-Miner-Survey-2011-Intro2.html"&gt;here&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;&lt;blockquote&gt;Your responses are completely confidential: no information you provide on the survey will be shared with anyone outside of Rexer Analytics.  All reporting of the survey findings will be done in the aggregate, and no findings will be written in such a way as to identify any of the participants.  This research is not being conducted for any third party, but is solely for the purpose of Rexer Analytics to disseminate the findings throughout the data mining community via publication, conference presentations, and personal contact. &lt;br /&gt;&lt;br /&gt;To participate, please click on the link below and enter the access code in the space provided.  The survey should take approximately 20 minutes to complete.  Anyone who has had this email forwarded to them should use the access code in the forwarded email.&lt;br /&gt;&lt;br /&gt;Survey Link:  &lt;a href="www.RexerAnalytics.com/Data-Miner-Survey-2011-Intro2.html"&gt;www.RexerAnalytics.com/Data-Miner-Survey-2011-Intro2.html&lt;/a&gt;&lt;br /&gt;Access Code:  RL3X1 &lt;br /&gt;&lt;br /&gt;If you would like a summary of last year’s or this year’s findings emailed to you, there will be a place at the end of the survey to leave your email address.  You can also email us directly (DataMinerSurvey@RexerAnalytics.com) if you have any questions about this research or to request research summaries.  Here are links to the highlights of the previous years’ surveys.  Contact us if you want summary reports from any of these years.&lt;br /&gt;-- 2010 survey highlights:  http://www.rexeranalytics.com/Data-Miner-Survey-Results-2010.html&lt;br /&gt;-- 2009 survey highlights:  http://www.rexeranalytics.com/Data-Miner-Survey-Results-2009.html&lt;br /&gt;-- 2008 survey highlights:  http://www.rexeranalytics.com/Data-Miner-Survey-Results-2008.html&lt;br /&gt;-- 2007 survey highlights:  http://www.rexeranalytics.com/Data-Miner-Survey-Results.html&lt;br /&gt;&lt;br /&gt;Thank you for your time.  We hope this research program continues to provide useful information to the data mining community.  &lt;br /&gt;&lt;br /&gt;Sincerely,&lt;br /&gt;&lt;br /&gt;Karl Rexer, PhD&lt;br /&gt;&lt;blockquote&gt;&lt;/blockquote&gt;&lt;/blockquote&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5652924-3060120501331422949?l=abbottanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://abbottanalytics.blogspot.com/feeds/3060120501331422949/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5652924&amp;postID=3060120501331422949' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/3060120501331422949'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/3060120501331422949'/><link rel='alternate' type='text/html' href='http://abbottanalytics.blogspot.com/2011/04/rexer-analytics-data-mining-survey.html' title='Rexer Analytics data mining survey'/><author><name>Dean Abbott</name><uri>http://www.blogger.com/profile/16818000233889520746</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5652924.post-3276486919908409952</id><published>2011-04-11T16:36:00.000-07:00</published><updated>2011-04-11T16:36:15.916-07:00</updated><title type='text'>Predictive Models are not Statistical Models — JT on EDM</title><content type='html'>This post was first posted on &lt;a href="http://jtonedm.com/2011/04/11/predictive-models-are-not-statistical-models/"&gt;Predictive Models are not Statistical Models — JT on EDM&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;My friend and colleague James Taylor asked me last week to comment on a question regarding statistics vs. predictive analytics. The bulk of my reply is on &lt;a href="http://jtonedm.com/2011/04/11/predictive-models-are-not-statistical-models/"&gt;James' blog&lt;/a&gt;; my fully reply is here, re-worked from my initial response to clarify some points further.&lt;br /&gt;&lt;br /&gt;I have always love reading the green "Sage" books, such as &lt;a href="http://www.amazon.com/gp/product/080394263X/ref=as_li_ss_tl?ie=UTF8&amp;tag=dataminiandpr-20&amp;linkCode=as2&amp;camp=1789&amp;creative=390957&amp;creativeASIN=080394263X"&gt;Understanding Regression Assumptions (Quantitative Applications in the Social Sciences)&lt;/a&gt;&lt;img src="http://www.assoc-amazon.com/e/ir?t=&amp;l=as2&amp;o=1&amp;a=080394263X" width="1" height="1" border="0" alt="" style="border:none !important; margin:0px !important;" /&gt;&lt;br /&gt;or &lt;a href="http://www.amazon.com/gp/product/0761916725/ref=as_li_ss_tl?ie=UTF8&amp;tag=dataminiandpr-20&amp;linkCode=as2&amp;camp=1789&amp;creative=390957&amp;creativeASIN=0761916725"&gt;Missing Data (Quantitative Applications in the Social Sciences)&lt;/a&gt;&lt;img src="http://www.assoc-amazon.com/e/ir?t=&amp;l=as2&amp;o=1&amp;a=0761916725" width="1" height="1" border="0" alt="" style="border:none !important; margin:0px !important;"/&gt; because they are brief, cover a single topic, and are well-written. As a data miner though, I am also somewhat amused reading them because they are obviously written by statisticians with the mindset that the &lt;i&gt;model is king&lt;/i&gt;. This means that we either pre-specify a model (the hypothesis) or require the model be fully interpretable, fully representing the process we are modeling. When the model is king, it's as if there is a model in the ether that we as modelers must find, and if we get coefficients in the  model "wrong", or if the model errors are "wrong", we have to rebuild the data and then the model to get it all right.&lt;br /&gt;&lt;br /&gt;In data mining and predictive analytics, the &lt;i&gt;data is king&lt;/i&gt;. These models often impute the models from the data (decision trees do this), or even if they only fit coefficients (like neural networks), it's the accuracy that matters rather than the coefficients. Often, in the data mining world, we won't have to explain precisely &lt;i&gt;why&lt;/i&gt; individuals behave as they do so long as we can explain generally &lt;i&gt;how&lt;/i&gt; they will behave. Model interpretation is often related to describing trends (sensitivity or importance of variables).&lt;br /&gt;&lt;br /&gt;I have always found David Hand's summaries of the two disciplines very useful, such as this one &lt;a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.104.5404&amp;rep=rep1&amp;type=pdf"&gt;here&lt;/a&gt;; I found that he had a healthy respect for both disciplines.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5652924-3276486919908409952?l=abbottanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://abbottanalytics.blogspot.com/feeds/3276486919908409952/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5652924&amp;postID=3276486919908409952' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/3276486919908409952'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/3276486919908409952'/><link rel='alternate' type='text/html' href='http://abbottanalytics.blogspot.com/2011/04/predictive-models-are-not-statistical.html' title='Predictive Models are not Statistical Models — JT on EDM'/><author><name>Dean Abbott</name><uri>http://www.blogger.com/profile/16818000233889520746</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5652924.post-576044526316813303</id><published>2011-03-29T09:43:00.000-07:00</published><updated>2011-03-29T10:07:23.921-07:00</updated><title type='text'>Analyzing the Results of Analysis</title><content type='html'>Sometimes, the output of analytical tools can be voluminous and complicated.  Making sense of it sometimes requires, well, analysis.  Following are two examples of applying our tools to their own output.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Model Deployment Verification&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;From time to time, I have deployed predictive models on a vertical application in the finance industry which is not exactly "user friendly".  I have virtually no access to the actual deployment and execution processes, and am largely limited to examination the production mode output, as implemented on the system in question.  &lt;br /&gt;&lt;br /&gt;As sometimes happens, the model output does not match my original specification.  While the actual deployment is not my individual responsibility, it very much helps if I can indicate where the likely problem is.  As these models are straightforward linear or generalized linear models (with perhaps a few input data transformations), I have found it useful to calculate the correlation between each of the input variables and the difference between the deployed model output and my own calculated model output.  The logic is that input variables with a higher correlation with the deployment error are more likely to be calculated incorrectly.  While this trick is not a cure-all, it quickly identifies in 80% or more of cases the culprit data elements.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Model Stability Over Time&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;A bedrock premise of all analytical work is that the future will resemble the past.  After all, if the rules of the game keep changing, then there's little point in learning them.  Specifically in predictive modeling, this premise requires that the relationship between input and output variables must remain sufficiently stable for discovered models to continue to be useful in the future.&lt;br /&gt;&lt;br /&gt;In a recent analysis, I discovered that models universally exhibited a substantial drop in test performance, when comparing out-of-time to (in-time) out-of-sample.  The relationships between at least some of my candidate input variables and the target variable are presumably changing over time.  In an effort to minimize this issue, I attempted to determine which variables were most susceptible.  I calculated the correlation between each candidate predictor and the target, both for an early time-frame and for a later one.&lt;br /&gt;&lt;br /&gt;My thinking was that variables whose correlation changed the most across time were the least stable and should be avoided.  Note that I was looking for changes in correlation, and not whether correlations were strong or weak.  Also, I regarded strengthening correlations just as suspect as weakening ones: The idea is for the model to perform consistently over time.&lt;br /&gt;&lt;br /&gt;In the end, avoiding the use of variables which exhibited "correlation slide" did weaken model performance, but did ensure that performance did not deteriorate so drastically out-of-time.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Final Thought&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;It is interesting to see how useful analytical tools can be when applied to the analytical process itself.  I note that solutions like the ones described here need not use fancy tools: Often simple calculations of means, standard deviation and correlations are sufficient.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5652924-576044526316813303?l=abbottanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://abbottanalytics.blogspot.com/feeds/576044526316813303/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5652924&amp;postID=576044526316813303' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/576044526316813303'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/576044526316813303'/><link rel='alternate' type='text/html' href='http://abbottanalytics.blogspot.com/2011/03/analyzing-results-of-analysis.html' title='Analyzing the Results of Analysis'/><author><name>Will Dwinnell</name><uri>http://www.blogger.com/profile/03379859054257561952</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='28' src='http://2.bp.blogspot.com/_aTiM0lwqgJ4/TQPgGn46JMI/AAAAAAAAAC4/X2lS2gskiUw/S220/Will%2Bportrait%2BMay-09-2010.jpg'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5652924.post-8117537045903714336</id><published>2011-03-06T04:40:00.000-08:00</published><updated>2011-03-10T16:30:36.825-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='statistics'/><category scheme='http://www.blogger.com/atom/ns#' term='integration'/><category scheme='http://www.blogger.com/atom/ns#' term='data mining'/><category scheme='http://www.blogger.com/atom/ns#' term='orgranizations'/><category scheme='http://www.blogger.com/atom/ns#' term='organizational'/><category scheme='http://www.blogger.com/atom/ns#' term='business'/><title type='text'>Statistics: The Need for Integration</title><content type='html'>I'd like to revisit an issue we covered here, way back in 2007: &lt;a href="http://abbottanalytics.blogspot.com/2007/10/statistics-why-do-so-many-hate-it.html"&gt;Statistics: Why Do So Many Hate It?&lt;/a&gt;.  Recent comments made to me, both in private conversation ("Statistics?  I hated that class in college!"), and in print prompt me to reconsider this issue.&lt;br /&gt;&lt;br /&gt;One thing which occurs to me is that many people have a tendency to think of statistics in an isolated way.  This world view keeps statistics at bay, as something which is done separately from other business activities, and, importantly, which is done and &lt;b&gt;understood &lt;/b&gt; only by the statisticians.  This is very far from the ideal which I suggest, in which statistics (including data mining) are much more integrated with the business processes of which they are a part.&lt;br /&gt;&lt;br /&gt;In my opinion, this is a strange way to frame statistics.  As an analog, imagine if, when asked to produce a report, a business team turned to their "English guy", with the expectation that he did &lt;b&gt;all the writing&lt;/b&gt;.  I am not suggesting that everyone needs to do the heavy lifting that data miners do, but that people who don't accept some responsibility for data mining's contribution to the business process.  Managers, for example, who throw up their hands with the excuse that "they are not numbers people" forfeit control over an important part of their business function.  It is healthier for everyone involved, I submit, if statistics moves away from being a black art, and statisticians become less of an arcane priesthood.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5652924-8117537045903714336?l=abbottanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://abbottanalytics.blogspot.com/feeds/8117537045903714336/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5652924&amp;postID=8117537045903714336' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/8117537045903714336'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/8117537045903714336'/><link rel='alternate' type='text/html' href='http://abbottanalytics.blogspot.com/2011/03/statistics-need-for-integration.html' title='Statistics: The Need for Integration'/><author><name>Will Dwinnell</name><uri>http://www.blogger.com/profile/03379859054257561952</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='28' src='http://2.bp.blogspot.com/_aTiM0lwqgJ4/TQPgGn46JMI/AAAAAAAAAC4/X2lS2gskiUw/S220/Will%2Bportrait%2BMay-09-2010.jpg'/></author><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5652924.post-185844777966817260</id><published>2011-02-23T22:44:00.000-08:00</published><updated>2011-02-23T22:44:26.351-08:00</updated><title type='text'>The Power of Prescience: Achieving Lift with Predictive Analytics</title><content type='html'>I'll be participating in the DM Radio broadcast tomorrow, &lt;a href="http://www.information-management.com/dmradio/-10019256-1.html"&gt;The Power of Prescience: Achieving Lift with Predictive Analytics&lt;/a&gt; Thursday, Feb 23 at 3pm ET. The best practices that we will be discussing include:&lt;br /&gt;&lt;blockquote&gt;1) properly define the problem to be solved (don’t shoot in the dark); 2) identify a key target variable to predict (must be a good decision-making metric in the company); 3) determine what “good” means, success-wise (what is the baseline for success?); 4) identify the appropriate data that can aid in prediction. There’s also: 5) finding the right algorithms, but this doesn’t matter unless 1-4 are nailed. &lt;/blockquote&gt;&lt;br /&gt;&lt;br /&gt;I also plan on talking about the importance of proper perspective in building models. While we  &lt;span style="font-style:italic;"&gt;want&lt;/span&gt; predictive models to be good, even excellent, but in the end, we &lt;span style="font-style:italic;"&gt;need&lt;/span&gt; the models to improve decision-making over what is done currently. I'm not advocating low expectations, just reasonable expectations.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5652924-185844777966817260?l=abbottanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://abbottanalytics.blogspot.com/feeds/185844777966817260/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5652924&amp;postID=185844777966817260' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/185844777966817260'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/185844777966817260'/><link rel='alternate' type='text/html' href='http://abbottanalytics.blogspot.com/2011/02/power-of-prescience-achieving-lift-with.html' title='The Power of Prescience: Achieving Lift with Predictive Analytics'/><author><name>Dean Abbott</name><uri>http://www.blogger.com/profile/16818000233889520746</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5652924.post-6591436701248665564</id><published>2011-02-16T20:23:00.000-08:00</published><updated>2011-02-17T07:04:09.591-08:00</updated><title type='text'>The Judgement of Watson: Mathematics Wins!</title><content type='html'>Tom Davenport argues in this HBR article &lt;a href="http://blogs.hbr.org/davenport/2011/02/why_im_pulling_for_watson.html"&gt;Why I'm Pulling for Watson - Tom Davenport - Harvard Business Review&lt;/a&gt; that &lt;br /&gt;&lt;blockquote&gt;I want Watson to win. Why? It's elementary: my dear Watson is a triumph of human ingenuity. In other words, there is no way humans can lose this competition. Watson also illustrates that the knowledge, judgment, and insights of the smartest humans can be embedded into automated systems. I suspect that those automated systems will ultimately be used to make better decisions in many domains, and interact with humans in a much more intelligent way. If computers can persuade Alex Trebek that they're very smart—and that's what he said about Watson—they'll be able to interact effectively with almost any human with a problem to solve.&lt;br /&gt;&lt;/blockquote&gt;&lt;br /&gt;While this is true, I don't agree that Watson itself is using "judgement" or "making decisions". It appears to me that it is a very nice search engine that incorporates NLP to make these searches more relevant. It isn't giving opinions, synthesizing information to create innovative ideas, or making inferences through extrapolation, all things humans do on a regular basis. This has long been one of my complaints about the way neural networks were described: they "learn", they "think", they "make inferences". No, they are a nonlinear function that finds weights via gradient descent searches. The no more "learn" than logistic regression "learns". &lt;br /&gt;&lt;br /&gt;A lot of the hype gets back to the old "hard AI" vs. "soft AI" debates that have been going on for decades. I appreciated very much the book by Roger Penrose on this subject, &lt;a href="http://www.amazon.com/gp/redirect.html?ie=UTF8&amp;location=http%3A%2F%2Fwww.amazon.com%2Freview%2F0195106466%3Fie%3DUTF8%26ref_%3Dpd_sim_b_cm_cr_acr_img_1%26showViewpoints%3D1&amp;tag=dataminiandpr-20&amp;linkCode=ur2&amp;camp=1789&amp;creative=390957"&gt;Shadows of the Mind: A Search for the Missing Science of Consciousness&lt;/a&gt;. &lt;br /&gt;&lt;br /&gt;This isn't to minimize the incredible feat IBM has accomplished with Watson, or on a simpler level, the feats of decision-making that can be performed with nonlinear mathematics in neural networks or support vector machines. These are phenomenal accomplishments that are awe inspiring mathematically, and on a more practical level will assist us all in the future with improved ability to automate decision-making. Of course, these kinds of decisions are those that do not require innovation or judgement, but can be codified mathematically. Every time I check out at an automatic teller at Home Depot, deposit checks at an ATM, or even make an amazon purchase, I'm reminded of the depth of technology that makes these complex transactions simple to the user. Watson is the beginning of the next leap in this ongoing technological march forward, all created by enterprising humans who have been able to break down complex behavior into repeatable, reliable, and flexible algorithmic steps.&lt;br /&gt;&lt;br /&gt;In the end, I agree with Mr. Davenport, "So whether the humans or Watson win, it means that humans have come out on top."&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5652924-6591436701248665564?l=abbottanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://abbottanalytics.blogspot.com/feeds/6591436701248665564/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5652924&amp;postID=6591436701248665564' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/6591436701248665564'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/6591436701248665564'/><link rel='alternate' type='text/html' href='http://abbottanalytics.blogspot.com/2011/02/judgement-of-watson-mathematics-wins.html' title='The Judgement of Watson: Mathematics Wins!'/><author><name>Dean Abbott</name><uri>http://www.blogger.com/profile/16818000233889520746</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5652924.post-6298251570849255739</id><published>2011-02-08T06:29:00.000-08:00</published><updated>2011-02-08T06:29:18.892-08:00</updated><title type='text'>Predictive Analytics Innovation</title><content type='html'>The &lt;a href="http://www.theiegroup.com/Predictive_Analytics/Overview.html"&gt;Predictive Analytics Summit&lt;/a&gt;, a relative newcomer to the Predictive Analytics conference circuit, will be held in San Diego on Feb 24-25. At the first Summit in San Francisco last Fall, I enjoyed several of the talks and the networking. This time I will be presenting a fraud detection case study.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5652924-6298251570849255739?l=abbottanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://abbottanalytics.blogspot.com/feeds/6298251570849255739/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5652924&amp;postID=6298251570849255739' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/6298251570849255739'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/6298251570849255739'/><link rel='alternate' type='text/html' href='http://abbottanalytics.blogspot.com/2011/02/predictive-analytics-innovation.html' title='Predictive Analytics Innovation'/><author><name>Dean Abbott</name><uri>http://www.blogger.com/profile/16818000233889520746</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5652924.post-9131985789620829156</id><published>2011-02-07T23:36:00.000-08:00</published><updated>2011-02-07T23:36:05.464-08:00</updated><title type='text'>Webinar with James Taylor -- 10 Best Practices in Operational Analytics</title><content type='html'>I'll be presenting a webinar with &lt;a href="http://decisionmanagementsolutions.com/index.php?option=com_content&amp;view=section&amp;layout=blog&amp;id=4&amp;Itemid=103"&gt;James Taylor&lt;/a&gt; this Wednesday at 10AM PST entitled &lt;a href="http://decisionmanagementsolutions.com/index.php?option=com_content&amp;view=article&amp;id=126&amp;Itemid=134"&gt;"10 best practices in operational analytics"&lt;/a&gt;. &lt;br /&gt;&lt;blockquote&gt;One of the most powerful ways to apply advanced analytics is by putting them to work in operational systems. Using analytics to improve the way every transaction, every customer, every website visitor is handled is tremendously effective. The multiplicative effect means that even small analytic improvements add up to real business benefit.&lt;br /&gt;&lt;br /&gt;In this session James Taylor, CEO of Decision Management Solutions, and Dean Abbott of Abbott Analytics will provide you with 10 best practices to make sure you can effectively build and deploy analytic models into you operational systems.&lt;/blockquote&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5652924-9131985789620829156?l=abbottanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://abbottanalytics.blogspot.com/feeds/9131985789620829156/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5652924&amp;postID=9131985789620829156' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/9131985789620829156'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/9131985789620829156'/><link rel='alternate' type='text/html' href='http://abbottanalytics.blogspot.com/2011/02/webinar-with-james-taylor-10-best.html' title='Webinar with James Taylor -- 10 Best Practices in Operational Analytics'/><author><name>Dean Abbott</name><uri>http://www.blogger.com/profile/16818000233889520746</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5652924.post-2538411581278329842</id><published>2011-01-28T21:15:00.000-08:00</published><updated>2011-01-28T21:15:17.163-08:00</updated><title type='text'>Predictive Analytics World Early-bird ends Monday</title><content type='html'>The earlybird special for Predictive Analytics World / San Francisco ends January 31, 2011 which saves you $200 on the conference rate and $100 on any workshop, including my &lt;a href="http://www.predictiveanalyticsworld.com/sanfrancisco/2011/handson_predictive_analytics.php"&gt;Hands-On Predictive Analytics using SAS Enterprise Miner&lt;/a&gt; on March 17th.&lt;br /&gt;&lt;br /&gt;More details on the 7 workshops can be found &lt;a href="http://www.predictiveanalyticsworld.com/blog/?p=325"&gt;here&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;Hope to see you there!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5652924-2538411581278329842?l=abbottanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://abbottanalytics.blogspot.com/feeds/2538411581278329842/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5652924&amp;postID=2538411581278329842' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/2538411581278329842'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/2538411581278329842'/><link rel='alternate' type='text/html' href='http://abbottanalytics.blogspot.com/2011/01/predictive-analytics-world-early-bird.html' title='Predictive Analytics World Early-bird ends Monday'/><author><name>Dean Abbott</name><uri>http://www.blogger.com/profile/16818000233889520746</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5652924.post-7995378147213823286</id><published>2011-01-27T20:33:00.000-08:00</published><updated>2011-01-27T20:41:18.385-08:00</updated><title type='text'>Do analytics books sell?</title><content type='html'>Kevin Hillstrom has a fascinating post on brief, technical ebooks (Amazon singles) sold on &lt;a href="http://www.businesswire.com/news/home/20110126006018/en/Kindle-Singles"&gt;Amazon&lt;/a&gt; here: &lt;a href="http://blog.minethatdata.com/2011/01/amazon-singles.html?utm_source=feedburner&amp;amp;utm_medium=twitter&amp;amp;utm_campaign=Feed%3A+MineThatData+%28Kevin+Hillstrom%27s+MineThatData%29&amp;amp;utm_content=Twitter"&gt;Kevin Hillstrom: MineThatData: Amazon Singles&lt;/a&gt;. His points: interesting content is what sells. Length doesn't matter, but these ebooks are typically  less than 50 pages. Price doesn't matter. &lt;br /&gt;&lt;br /&gt;Should I jump in? Should you?&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5652924-7995378147213823286?l=abbottanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://abbottanalytics.blogspot.com/feeds/7995378147213823286/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5652924&amp;postID=7995378147213823286' title='5 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/7995378147213823286'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/7995378147213823286'/><link rel='alternate' type='text/html' href='http://abbottanalytics.blogspot.com/2011/01/do-analytics-books-sell.html' title='Do analytics books sell?'/><author><name>Dean Abbott</name><uri>http://www.blogger.com/profile/16818000233889520746</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>5</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5652924.post-5472144605991158165</id><published>2011-01-22T08:41:00.000-08:00</published><updated>2011-01-22T08:42:53.521-08:00</updated><title type='text'>Doing Data Mining Out of Order</title><content type='html'>I like the &lt;a href="http://www.crisp-dm.org/"&gt;CRISP-DM&lt;/a&gt; process model for data mining, teach from it, and use it on my projects. I commend it to practitioners and managers routinely as an aid during any data mining project. However, while the process sequence is generally the one I use, I don't always; data mining often requires more creativity and "art" to re-work the data than we would like; it would be very nice if we could create a checklist and just run through the list on every project! But unfortunately data doesn't always cooperate in this way, and we therefore need to adapt to the specific data problems so that the data is better prepared.&lt;br /&gt;&lt;br /&gt;For example, on a current financial risk project I am working, the customer is building data for predictive analytics for the first time. The customer is data savvy, but new to predictive analytics, so we've had to iterate several times on how the data is pulled and rolled up out of the database. In particular, target variable has had to be cleaned up because of historic coding anomalies. &lt;br /&gt;&lt;br /&gt;One primary question to resolve for this project is an all-too-common debate over what is the right level of aggregation: do we use transactional data even though some customers have many transactions and some have few, or do we roll data up to the customer level to build customer risk models. (A transaction-based model will score each transaction for risk, whereas a customer-based model will score, daily, the risk associated with each customer given the new transactions that have been added.) There are advantages and disadvantages to both, but in this case, we are building a customer-centric risk model for reasons that make sense in this particular business context. &lt;br /&gt;&lt;br /&gt;Back to the CRISP-DM process and why it is advantageous to deviate from CRISP-DM. In this project, we jumped from Business Understanding and the beginnings of Data Understanding straight to Modeling. I think in this case, I would call it "modeling" (small 'm') because we weren't building models to predict risk, but rather to understand the target variable better. We were not sure exactly how clean the data was to begin with, especially the definition of the target variable, because no one had ever looked at the data in aggregate before, only on a single customer-by-customer basis. By building models, and seeing some fields that predict the target variable "too well", we have been able to identify historic data inconsistencies and miscoding. &lt;br /&gt;&lt;br /&gt;Now that we have the target variable better defined, I'm going back to the data understanding and data prep stages to complete those stages properly, and this is changing how the data will be prepped in addition to modifying the definition of the target variable. It's also much more enjoyable to build models than do data prep, so for me this was a "win-win" anyway!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5652924-5472144605991158165?l=abbottanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://abbottanalytics.blogspot.com/feeds/5472144605991158165/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5652924&amp;postID=5472144605991158165' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/5472144605991158165'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/5472144605991158165'/><link rel='alternate' type='text/html' href='http://abbottanalytics.blogspot.com/2011/01/doing-data-mining-out-of-order.html' title='Doing Data Mining Out of Order'/><author><name>Dean Abbott</name><uri>http://www.blogger.com/profile/16818000233889520746</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5652924.post-9205274869894594915</id><published>2010-11-04T10:17:00.000-07:00</published><updated>2010-11-04T12:01:10.703-07:00</updated><title type='text'>Predictive Analytics Summit - Analytics Titles</title><content type='html'>I'm at the &lt;a href="http://www.theiegroup.com/Predictive_Analytics/Overview.html"&gt;Predictive Analytics Summit&lt;/a&gt; in San Francisco. It is interesting to see the titles of Analytics people at the conference (&lt;a href="http://www.theiegroup.com/Predictive_Analytics/Speakers.html"&gt;here&lt;/a&gt;). They include CTO/Senior/Manager/VP of a variety of analytics variants: Predictive Analytics, Marketing Analytics, just Analytics, Data Analytics, Research &amp; Analytics, Quant Research, etc. Others not here but that I've seen include Business Analytics and the variety of Data Mining titles. &lt;br /&gt;&lt;br /&gt;There has been a lot of hype about data mining and predictive analytics being a great field to be in. It's interesting to me that (1) Predictive Analytics is so often part of the title now, lending credence to this term becoming a standard term companies use, and (2) the variety of ways quantitive modeling is described. &lt;br /&gt;&lt;br /&gt;This conference is just one of many taking place in a short time period, including &lt;a href="http://predictiveanalyticsworld.com/"&gt;Predictive Analytics World&lt;/a&gt;, &lt;a href="http://www.sas.com/events/dmconf/"&gt;SAS M2010&lt;/a&gt;, &lt;a href="http://www-01.ibm.com/software/data/2010-conference/business-analytics/"&gt;IBM Information on Demand&lt;/a&gt;, and &lt;a href="http://www.teradata-partners.com/Conference"&gt;Teradata Partners conference&lt;/a&gt;, the &lt;a href="http://www.sdsic.org/supermath-conference.aspx"&gt;SuperMath Conference&lt;/a&gt; in San Diego, and the &lt;a href="http://www.sfbayacm.org/?p=1854"&gt;ACM Data Mining Bootcamp&lt;/a&gt; in San Jose. Too many to attend all of them!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5652924-9205274869894594915?l=abbottanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://abbottanalytics.blogspot.com/feeds/9205274869894594915/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5652924&amp;postID=9205274869894594915' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/9205274869894594915'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/9205274869894594915'/><link rel='alternate' type='text/html' href='http://abbottanalytics.blogspot.com/2010/11/predictive-analytics-summit-analytics.html' title='Predictive Analytics Summit - Analytics Titles'/><author><name>Dean Abbott</name><uri>http://www.blogger.com/profile/16818000233889520746</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5652924.post-2506367606464804808</id><published>2010-10-28T21:17:00.000-07:00</published><updated>2010-10-28T21:17:48.160-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='statistics'/><category scheme='http://www.blogger.com/atom/ns#' term='humor'/><title type='text'>A humorous explanation of p-values</title><content type='html'>After Will's great post on sample sizes that referenced the youtube video entitled &lt;a href="http://www.youtube.com/watch?v=3opThn_v6rs&amp;feature=related"&gt;Statistics vs. Marketing&lt;/a&gt;, I found an equally funny and informative explanation on p-values &lt;a href="http://www.youtube.com/watch?v=ax0tDcFkPic&amp;feature=related"&gt;here&lt;/a&gt;. &lt;br /&gt;&lt;br /&gt;Aside from the esoteric explanations of what a p-value is, there is a point that I make often with customers that statistical significance (from p-values) is not the same thing as operational significance; just because you find a p-value of less than 0.05 doesn't mean the result is useful for anything! Enjoy.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5652924-2506367606464804808?l=abbottanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://abbottanalytics.blogspot.com/feeds/2506367606464804808/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5652924&amp;postID=2506367606464804808' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/2506367606464804808'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/2506367606464804808'/><link rel='alternate' type='text/html' href='http://abbottanalytics.blogspot.com/2010/10/humorous-explanation-of-p-values.html' title='A humorous explanation of p-values'/><author><name>Dean Abbott</name><uri>http://www.blogger.com/profile/16818000233889520746</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5652924.post-4538774729117816715</id><published>2010-10-28T20:57:00.000-07:00</published><updated>2010-10-28T20:57:17.140-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='humor'/><category scheme='http://www.blogger.com/atom/ns#' term='programming'/><title type='text'>From the Archives: A Synopsis of Programming Languages</title><content type='html'>A departure from the usual data mining and predictive analytics posts...&lt;br /&gt;&lt;br /&gt;I was looking at old articles I clipped from the 80s, and came across my favorite programming article from the days I used to program a lot (mostly C, some FORTRAN, sh, csh, tcsh). This one from the C Advisor by &lt;a href="http://en.wikipedia.org/wiki/Ken_Arnold"&gt;Ken Arnold&lt;/a&gt; I found funny then, and still do now. I don't know where these are archived, so I'll just quote an excerpt here:&lt;br /&gt;&lt;br /&gt;C advisor article by Ken Arnold from years and years ago quoting Richard Curtis&lt;br /&gt;&lt;br /&gt;&lt;blockquote&gt;• FORTRAN was like the fifties: It's rigid and procedural, and doesn't even distinguish between cases. It's motto is "Do my thing".&lt;br /&gt;• C is a real sixties language, because it just doesn't care. It doesn't type check, and it lets you get into as much trouble as you can--you own your own life. C's motto: "Do your own thing".&lt;br /&gt;• Pascal is the seventies. It tries to seize control of the wild and woolly sixties, without getting too restrictive. It thus ends up pleasing no one. It's full of self-justification and self-importance--going from C to Pascal is like going from Janis Joplin to Donna Summer. It is smooth and flashy and useless for major work--truly the John Travolta of programming languages. The Pascal motto is: "Do your thing  my way".&lt;br /&gt;• ADA is the eighties. There is no overarching philosophy; everything is possible, but there is no ethical compass to tell you what ought to be done. (Actually, I know of two things you can't do in ADA, but I'm not telling for fear they'll be added.) It reflects the eighties notion of freedom, which is that you are free to do anything, as long as you do it the way the government wants you to--that is, in ADA. It's credo: "Do anything anyway you want".&lt;/blockquote&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5652924-4538774729117816715?l=abbottanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://abbottanalytics.blogspot.com/feeds/4538774729117816715/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5652924&amp;postID=4538774729117816715' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/4538774729117816715'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/4538774729117816715'/><link rel='alternate' type='text/html' href='http://abbottanalytics.blogspot.com/2010/10/from-archives-synopsis-of-programming.html' title='From the Archives: A Synopsis of Programming Languages'/><author><name>Dean Abbott</name><uri>http://www.blogger.com/profile/16818000233889520746</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5652924.post-8708924091072333439</id><published>2010-10-24T10:43:00.000-07:00</published><updated>2010-10-24T15:36:03.225-07:00</updated><title type='text'>The Data Budget</title><content type='html'>Larger quantities of data permit greater precision, greater certainty and more detail in analysis.  As observation counts increase, standard errors decrease and the opportunity for more detailed- perhaps more segmented- analysis rises.  These are things which are obvious to even junior analysts: The standard error of the mean is calculated as the standard deviation divided by the square root of the observation count.&lt;br /&gt;&lt;br /&gt;This general idea may seem obvious when spoken aloud, but it is something which many non-technical people seem to give little thought.  Ask any non-technical client whether more data will provide a better answer, and the response will be in the affirmative.  It is a simple trend to understand.&lt;br /&gt;&lt;br /&gt;However, people who do not analyze data for a living do not necessarily think about such things in precise terms.  On too many occasions, I have listened to managers or other customers indicate that they wanted to examine data set X and test Y things.  Without performing any calculations, I had strong suspicions that it would not be feasible to test Y things, given the meager size of data set X.  Attempts to explain this have been met with various responses.  To be fair, some of them were constructive acknowledgments of this unfortunate reality, and new expectations were established.  In other cases, I was forced to be the insistent bearer of bad news.&lt;br /&gt;&lt;br /&gt;In one such &lt;a href="http://www.youtube.com/watch?v=3opThn_v6rs"&gt;situation&lt;/a&gt;, a data set with less than twenty thousand observations was to be divided among about a dozen direct mail treatments.  Expected response rates were typically in the single-digit percents, meaning that only a few hundred observations would be available for analysis.  Treatments were to be compared based on various business metrics (customer spending, etc.).  Given the small number of respondents and high variability of this data, I realized that this was unlikely to be productive.  I eventually gave up trying to explain the futility of this exercise, and resigned myself to listening to biweekly explanations the noisy graphs and summaries.  One day, though, I noticed that one of the cells contained a single observation!  Yes, much energy and attention was devoted to tracking this "cell" of one individual, which of course would have no predictive value whatsoever.&lt;br /&gt;&lt;br /&gt;It is important for data analysts to make clear the limitations of our craft.  One such limitation is the necessity of sufficient data from which to draw reasonable and useful conclusions.  It may be helpful to indicate this important requirement as the &lt;i&gt;data budget&lt;/i&gt;: "Given the quality and volume of our historical data, we only have the data budget to answer questions about 3 segments, not 12."  Simply saying "We don't have enough data" is not effective (so I have learned through painful experience).  Referring to this issue in terms which others can appreciate may help.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5652924-8708924091072333439?l=abbottanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://abbottanalytics.blogspot.com/feeds/8708924091072333439/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5652924&amp;postID=8708924091072333439' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/8708924091072333439'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/8708924091072333439'/><link rel='alternate' type='text/html' href='http://abbottanalytics.blogspot.com/2010/10/data-budget.html' title='The Data Budget'/><author><name>Will Dwinnell</name><uri>http://www.blogger.com/profile/03379859054257561952</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='28' src='http://2.bp.blogspot.com/_aTiM0lwqgJ4/TQPgGn46JMI/AAAAAAAAAC4/X2lS2gskiUw/S220/Will%2Bportrait%2BMay-09-2010.jpg'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5652924.post-7363590301435224671</id><published>2010-10-21T14:41:00.000-07:00</published><updated>2010-10-21T14:41:11.122-07:00</updated><title type='text'>Predictive Analytics World Addresses Risk and Fraud Detection</title><content type='html'>&lt;style type="text/css"&gt;p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Helvetica}p.p2 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Helvetica; min-height: 14.0px}&lt;/style&gt;   &lt;br /&gt;&lt;div class="p1"&gt;Eric Siegel focused his plenary session on predicting and assessing risk in the enterprise, and in his usual humorous way, described how big, macro or catastrophic risk&amp;nbsp; often dominates thinking, micro or transactional risk can cost organizations more than macro risk. The micro risk is where predictive analytics is well suited, what he called data-driven micro risk management.&lt;/div&gt;&lt;div class="p2"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="p1"&gt;The point is well-taken because the most commonly used PA techniques are work better with larger data than "one of a kind" events. Micro risk can be quantified in a PA framework well.&amp;nbsp;&lt;/div&gt;&lt;div class="p2"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="p1"&gt;During the second day, an excellent talk described a fraud assessment application in the insurance industry. While the entire CRISP-DM process were covered in this talk (from Business Understanding through Deployment), there was one aspect that struck me in particular, namely the definition of the target variable to predict. Of course, the most natural target variable for fraud detection is a label indicating if a claim has been shown to be fraudulent. Fraud often has a legal aspect to it, where a claim can only be truly "fraud" after it has been prosecuted and the case closed. This&amp;nbsp; has at least two difficulties for analytics. First, it can take quite some time for a case to close, making the data one has for building fraud models lag by perhaps years from when the fraud was perpetrated. Patterns of fraud change, and thus models may perpetually be behind in identifying the fraud patterns.&amp;nbsp;&lt;/div&gt;&lt;div class="p2"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="p1"&gt;Second, a there are far fewer actual proven fraud cases compared to those that are suspicious and worthy of investigation. Cases may be dismissed or "flushed" for a variety of reasons ranging from lack of resources to investigate, statutory restrictions, and legal loopholes which do not reduce the risk for a particular claim at all, but rather just change the target variable (to 0), making these cases appear the same as benign cases.&amp;nbsp;&lt;/div&gt;&lt;div class="p2"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="p1"&gt;In this case study, the author described a process where another label for risk was used, a human-generated label that only indicated a high-enough level of suspicious behavior rather than only using actual claims fraud, a good idea in my opinion.&lt;/div&gt;&lt;div class="p2"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5652924-7363590301435224671?l=abbottanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://abbottanalytics.blogspot.com/feeds/7363590301435224671/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5652924&amp;postID=7363590301435224671' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/7363590301435224671'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/7363590301435224671'/><link rel='alternate' type='text/html' href='http://abbottanalytics.blogspot.com/2010/10/predictive-analytics-world-addresses.html' title='Predictive Analytics World Addresses Risk and Fraud Detection'/><author><name>Dean Abbott</name><uri>http://www.blogger.com/profile/16818000233889520746</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5652924.post-4331008440058235958</id><published>2010-10-15T21:58:00.001-07:00</published><updated>2010-10-15T21:58:12.428-07:00</updated><title type='text'>MicroPoll : Predictive Analytics World:What is most compelling?</title><content type='html'>&lt;img style="visibility:hidden;width:0px;height:0px;" border=0 width=0 height=0 src="http://counters.gigya.com/wildfire/IMP/CXNID=2000002.0NXC/bHQ9MTI4NzIwNTAxOTUwMiZwdD*xMjg3MjA1MDc4MDY5JnA9ODAwMTEmZD*mbj1ibG9nZ2VyJmc9MSZvPWFiYzllNDlkMTNjODQy/YTViMGQ2N2E*NzRmZGM5NjdkJm9mPTA=.gif" /&gt;&lt;iframe frameborder="0" width="100%" height="300" src="http://www.micropoll.com/a/MicroPoll?id=204182&amp;mode=html"&gt;&lt;/iframe&gt;&lt;div&gt;&lt;a href="http://www.micropoll.com/a/mpview/669955-204182"&gt;View Poll&lt;/a&gt; |&lt;a href="http://www.micropoll.com/a/mpresult/669955-204182"&gt;View Results&lt;/a&gt;&lt;/div&gt;&lt;div&gt;&lt;a href="http://www.micropoll.com"&gt;Website Polls&lt;/a&gt; PoweredBy &lt;a href="http://www.micropoll.com"&gt;MicroPoll&lt;/a&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5652924-4331008440058235958?l=abbottanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://abbottanalytics.blogspot.com/feeds/4331008440058235958/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5652924&amp;postID=4331008440058235958' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/4331008440058235958'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/4331008440058235958'/><link rel='alternate' type='text/html' href='http://abbottanalytics.blogspot.com/2010/10/micropoll-predictive-analytics-world.html' title='MicroPoll : &lt;a href=&quot;http://www.predictiveanalyticsworld.com/&quot;&gt;Predictive Analytics World&lt;/a&gt;:&lt;br&gt;&lt;b&gt;What is most compelling?&lt;/b&gt;'/><author><name>Dean Abbott</name><uri>http://www.blogger.com/profile/16818000233889520746</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5652924.post-3243282151052514597</id><published>2010-10-07T21:34:00.000-07:00</published><updated>2010-10-07T21:35:32.981-07:00</updated><title type='text'>A little math humor, and achieving clarity in explaining solutions</title><content type='html'>This is still one of my favorite cartoons of all time (by &lt;a href="http://www.sciencecartoonsplus.com/pages/gallery.php"&gt;S. Harris&lt;/a&gt;). I think we've all been there before, trying to waive our hands in place of providing a good reason for the procedures we use.&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://www.sciencecartoonsplus.com/images/miracle_sharris.gif" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="320" src="http://www.sciencecartoonsplus.com/images/miracle_sharris.gif" width="246" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;A closely related phenomenon is when you receive an explanation for a business process that is "proof by confusion", whereby the person explaining the process uses lots of buzz words and complex terminology in place of clarity, probably because the person him or herself doesn't really understand it him/herself.&lt;br /&gt;&lt;br /&gt;This is why clarifying questions are so key. I remember a &lt;a href="http://www.rpi.edu/dept/math/"&gt;professor of mathematics of mine at Rensselaer Polytechnic Institute&lt;/a&gt; named &lt;a href="http://www.rpi.edu/~isaacd/"&gt;David Isaacson&lt;/a&gt; who told a story of a graduate seminar. If you have ever experienced these seminars, there are two distinguishing features: the food, that goes quickly to those who arrive on time, and the game involved of the speaker trying to lose the graduate students during the lecture (an overstatement, but a frequently occurring outcome). Prof. Isaacson told us of a guy there who would ask dumb questions from the get-go: questions that we all knew the answer to and most folks thought were obvious. But as the lecture continued, this guy was the only one left asking questions, and of course was the only one who truly understood the lecture. What was happening is that he was constantly aligning what he thought he heard by asking for clarification. The rest &amp;nbsp;of those in the room &lt;i&gt;thought&lt;/i&gt;&amp;nbsp;they understood, but in reality did not.&lt;br /&gt;&lt;br /&gt;It reminds me to ask questions, even the dumb ones if it means forcing the one who is teaching or explaining to restate their point in different words, thus providing better opportunity for true communication.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5652924-3243282151052514597?l=abbottanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://abbottanalytics.blogspot.com/feeds/3243282151052514597/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5652924&amp;postID=3243282151052514597' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/3243282151052514597'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/3243282151052514597'/><link rel='alternate' type='text/html' href='http://abbottanalytics.blogspot.com/2010/10/little-math-humor-and-achieving-clarity.html' title='A little math humor, and achieving clarity in explaining solutions'/><author><name>Dean Abbott</name><uri>http://www.blogger.com/profile/16818000233889520746</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5652924.post-2749605993401067081</id><published>2010-09-24T03:00:00.000-07:00</published><updated>2010-09-24T03:30:37.156-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='theory'/><category scheme='http://www.blogger.com/atom/ns#' term='theorist'/><category scheme='http://www.blogger.com/atom/ns#' term='practice'/><category scheme='http://www.blogger.com/atom/ns#' term='practitioner'/><title type='text'>Theory vs. Practice</title><content type='html'>In many fields, it is common to find a gap between theorists and practitioners.  As stereotypes, theorists have a reputation for sniffing at anything which has not been optimized and proven to the nth degree, while practitioners show little interest in theory, as it "only ever works on paper".&lt;br /&gt;&lt;br /&gt;I have been amazed at both extremes of this spectrum.  Academic and standards journals seem to publish mostly articles which solve theoretical problems which will never arise in practice (but which permit solutions which are elegant or which can be optimized to some ridiculous level), or solutions which are trivial variations on previous work.  The same goes for most masters and doctoral theses.  On the other hand, I was shocked when software development colleagues (consultants: the last word in practice over theory) were unfamiliar with two's complement arithmetic.&lt;br /&gt;&lt;br /&gt;Data mining is certainly not immune to this problem.  Not long ago, I came upon technical documentation for a linear regression which had been "fixed" by a logarithmic transformation of the dependent variable.  (There is a correct way to fit coefficients in this circumstance, but that was not done in this case.)  Even more astounding was the polynomial curve fit which was applied to "undo" the log transformation, to get back to the original units!  Sadly, the practitioners in question did not even recognize the classic symptom of their error:  residuals were much larger at the high end of their plots.&lt;br /&gt;&lt;br /&gt;Data miners (statisticians, quantitative analysts, forecasters, etc.) come from a variety of fields, and enjoy diverse levels of formal training.  Grounding in theory follows suit.  The people we work for typically are capable of identifying only the most egregious technical errors in our work.  This sets the stage for potential problems.&lt;br /&gt;&lt;br /&gt;As a practitioner, I have found much that is useful in theory and suggest that it is a fountain which is worth returning to, from time to time.  Reviewing new developments in our field, searching for useful techniques and guidance will benefit data miners, regardless of their seniority.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5652924-2749605993401067081?l=abbottanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://abbottanalytics.blogspot.com/feeds/2749605993401067081/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5652924&amp;postID=2749605993401067081' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/2749605993401067081'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/2749605993401067081'/><link rel='alternate' type='text/html' href='http://abbottanalytics.blogspot.com/2010/09/theory-vs-practice.html' title='Theory vs. Practice'/><author><name>Will Dwinnell</name><uri>http://www.blogger.com/profile/03379859054257561952</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='28' src='http://2.bp.blogspot.com/_aTiM0lwqgJ4/TQPgGn46JMI/AAAAAAAAAC4/X2lS2gskiUw/S220/Will%2Bportrait%2BMay-09-2010.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5652924.post-6328583143743421770</id><published>2010-09-07T09:48:00.000-07:00</published><updated>2010-09-07T09:48:39.833-07:00</updated><title type='text'>DM Radio - Predictive Analytics and Fraud Detection</title><content type='html'>I'll be on &lt;a href="http://www.information-management.com/dmradio/-10017521-1.html"&gt;DM Radio Thursday September 9&lt;/a&gt; at 3pm EDT. Here's the blurb:&lt;br /&gt;&lt;br /&gt;&lt;blockquote&gt;&lt;span class="Apple-style-span" style="font-family: Arial; font-size: 12px; line-height: 15px;"&gt;How many ways to catch a thief? More and more, thanks to predictive analytics, data-as-a-service and other clever computing tricks. Stopping fraud in its tracks can save customers, money and more. Tune into this episode of DM Radio to find out how. We'll hear from Eric Siegel, Prediction Impact; Erick Brethenoux, SPSS; Jason Trunk, Quest Software and Dean Abbott, Abbott Analytics.&lt;/span&gt;&lt;/blockquote&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5652924-6328583143743421770?l=abbottanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://abbottanalytics.blogspot.com/feeds/6328583143743421770/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5652924&amp;postID=6328583143743421770' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/6328583143743421770'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/6328583143743421770'/><link rel='alternate' type='text/html' href='http://abbottanalytics.blogspot.com/2010/09/dm-radio-predictive-analytics-and-fraud.html' title='DM Radio - Predictive Analytics and Fraud Detection'/><author><name>Dean Abbott</name><uri>http://www.blogger.com/profile/16818000233889520746</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5652924.post-5618009279326048270</id><published>2010-09-02T22:37:00.000-07:00</published><updated>2010-09-02T22:37:17.482-07:00</updated><title type='text'>Leo Breiman quote about statisticians</title><content type='html'>One nice thing about having to move offices is that it forces you to go through old papers and folders. I found my folder containing &lt;a href="http://www.kdd.org/kdd/1997/"&gt;KDD 97 conference&lt;/a&gt; notes, including quotes in the tutorial by &lt;a href="http://www.rss.org.uk/main.asp?page=2779"&gt;David Hand&lt;/a&gt;&amp;nbsp;from Leo Breiman (1995):&lt;br /&gt;&lt;blockquote&gt;One problem in the field of statistics has been that everyone wants to be a theorist. Part of this is envy - the real sciences are based on mathematical theory. In the universities for this century, the glamor and prestige has been in mathematical models and theorems, no matter how irrelevant.&lt;/blockquote&gt;I love this quote because it highlights the divide between the practical and the elegant or sophisticated. Data mining and predictive analytics are "low-brow" sciences, empirical, and practical. That doesn't mean that the mathematics aren't important; they are very much so. But while we wait for the elegances of a theory to trickle down to us, we still need solutions.&lt;br /&gt;&lt;br /&gt;In courses I teach, one of my objectives is to take the mathematics of the algorithms and translate the practical meaning of what they do into understandable pieces so that practitioners can manipulate learning rates and hidden units, gini and two-ing, radial kernels and polynomials kernels. Understanding backprop isn't important to most&amp;nbsp;practitioners, but understanding how one can improve the performance of backprop is very much a key topic for practitioners.&lt;br /&gt;&lt;br /&gt;We need more Breimans to pave the way toward practical innovations in predictive modeling.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5652924-5618009279326048270?l=abbottanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://abbottanalytics.blogspot.com/feeds/5618009279326048270/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5652924&amp;postID=5618009279326048270' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/5618009279326048270'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/5618009279326048270'/><link rel='alternate' type='text/html' href='http://abbottanalytics.blogspot.com/2010/09/leo-breiman-quote-about-statisticians.html' title='Leo Breiman quote about statisticians'/><author><name>Dean Abbott</name><uri>http://www.blogger.com/profile/16818000233889520746</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5652924.post-8215707726530686345</id><published>2010-08-24T23:39:00.000-07:00</published><updated>2010-08-24T23:39:09.626-07:00</updated><title type='text'>Predictive Models are Only as Good as Their Acceptance by Decision-Makers</title><content type='html'>I have been reminded in the past couple weeks working with customers that in many applications of data mining and predictive analytics, unless the stakeholders of predictive models understand what the models are doing, they are utterly useless. When rules from a decision tree, no matter how statistically significant, don't resonate with domain experts, they won't be believed. Arguments that "the model wouldn't have picked this rule if it wasn't really there in the data" makes no difference when the rule doesn't make sense.&lt;br /&gt;&lt;br /&gt;There is always a tradeoff in these cases between the "best" model (i.e., most accurate by some measure) and the "best understood" model (i.e., the one that gets the "ahhhs" from the domain experts). We can coerce models toward the transparent rather than the statistically significant by removing fields that perform well but don't contribute to the story the models tell about the data.&lt;br /&gt;&lt;br /&gt;I know what some of you are thinking: if the rule or pattern found by the model is that good, we must try to find the reason for its inclusion, make the case for it, find a surrogate meaning, or just demand it be included because it is so good! I trust the algorithms and our ability to assess if the algorithms are finding something "real" compared with those "happenstance" occurrences. But not all stakeholders share our trust, and it is our job to translate the message for them so that their confidence in the models approaches are own.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5652924-8215707726530686345?l=abbottanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://abbottanalytics.blogspot.com/feeds/8215707726530686345/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5652924&amp;postID=8215707726530686345' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/8215707726530686345'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/8215707726530686345'/><link rel='alternate' type='text/html' href='http://abbottanalytics.blogspot.com/2010/08/predictive-models-are-only-as-good-as.html' title='Predictive Models are Only as Good as Their Acceptance by Decision-Makers'/><author><name>Dean Abbott</name><uri>http://www.blogger.com/profile/16818000233889520746</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5652924.post-4761815328755556755</id><published>2010-08-19T17:58:00.000-07:00</published><updated>2010-08-19T17:58:33.632-07:00</updated><title type='text'>Building Correlations in Clementine / Modeler</title><content type='html'>I just responded to this question on LinkedIn, Clementine group, and thought it might be of interest to a broader audience.&lt;br /&gt;&lt;br /&gt;Q:&amp;nbsp;Hi,&lt;br /&gt;Does anyone have any suggestion or any knowledge on how to make cross-correlation in the Modeler/Clementine?&lt;br /&gt;&lt;br /&gt;A:&lt;br /&gt;&lt;span class="Apple-style-span" style="color: #333333; font-family: Arial, Helvetica, 'Nimbus Sans L', sans-serif; font-size: small;"&gt;&lt;span class="Apple-style-span" style="font-size: 13px; line-height: 15px;"&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="color: #333333; font-family: Arial, Helvetica, 'Nimbus Sans L', sans-serif; font-size: small;"&gt;&amp;nbsp;I'm not so familiar with Modeler 14, but in prior versions, there was no good correlation matrix option (the Statistics node does correlations, but it is not easier to build an entire matrix)&lt;br /&gt;&lt;br /&gt;The way I do it is with the Regression node. In the expert tab, click on the Expert radio button, then the Output... button, and make sure the "Descriptions" box is checked and run the regression with all the inputs (Direction-&amp;gt;In) you want in the correlation matrix. Don't worry about having an output that is useful--if you don't have one, create a random number (Range) and use that as the output. After you Execute this, look in the Advanced tab of the gem and you will find a correlation matrix there. I usually then export it and re-import it into Excel (as an html file) where it is much easier to read and do things like color code big correlations.&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5652924-4761815328755556755?l=abbottanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://abbottanalytics.blogspot.com/feeds/4761815328755556755/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5652924&amp;postID=4761815328755556755' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/4761815328755556755'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/4761815328755556755'/><link rel='alternate' type='text/html' href='http://abbottanalytics.blogspot.com/2010/08/building-correlations-in-clementine.html' title='Building Correlations in Clementine / Modeler'/><author><name>Dean Abbott</name><uri>http://www.blogger.com/profile/16818000233889520746</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5652924.post-523790452486114446</id><published>2010-08-13T08:25:00.000-07:00</published><updated>2010-08-13T08:25:41.925-07:00</updated><title type='text'>IBM and Unica, Affinium Model and Clementine</title><content type='html'>After seeing that &lt;a href="http://bit.ly/90EBIs"&gt;IBM has purchased Unica&lt;/a&gt;&amp;nbsp;I have to wonder how this will effect Affinium Model and Clementine (I revert to the names that were used for so long here, now PredictExpress and Modeler, respectively). They are so very different in interfaces, features and deployment options that it is hard to see how they will be "joined": the big-button wizard interface vs. the block-diagram flow interface.&lt;br /&gt;&lt;br /&gt;One thing I always liked about Affinium Model was the ability to automate the building of thousands of models. Clementine now has that same capability, so that advantage is lost. To me, that leaves the biggest advantage of Affinium Model being it's language and wizards. Because it uses the language of customer analytics rather than the more technical language of data mining / predictive analytics, it was easier to teach to new analysts. Because it makes generally good decisions on data prep and preprocessing, the analyst didn't need to know a lot about sampling and data transformations to get a model out (we won't dive into &lt;i&gt;how&lt;/i&gt; good here, or how much better experts could do the data transformations and sampling).&lt;br /&gt;&lt;br /&gt;My fear is that Affinium Model will just be dropped, going the way of Darwin, PRW (the predecessor to Affinium Model), and other data mining tools that were good ideas. Time will tell.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5652924-523790452486114446?l=abbottanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://abbottanalytics.blogspot.com/feeds/523790452486114446/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5652924&amp;postID=523790452486114446' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/523790452486114446'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/523790452486114446'/><link rel='alternate' type='text/html' href='http://abbottanalytics.blogspot.com/2010/08/ibm-and-unica-affinium-model-and.html' title='IBM and Unica, Affinium Model and Clementine'/><author><name>Dean Abbott</name><uri>http://www.blogger.com/profile/16818000233889520746</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5652924.post-5593367303801994339</id><published>2010-08-02T19:52:00.000-07:00</published><updated>2010-08-02T19:52:32.746-07:00</updated><title type='text'>Is there too much data?</title><content type='html'>I was reading back over some old blog posts, and came across this quote from&amp;nbsp;&lt;a href="http://www.amazon.com/gp/product/0393324818?ie=UTF8&amp;amp;tag=dataminiandpr-20&amp;amp;linkCode=as2&amp;amp;camp=1789&amp;amp;creative=390957&amp;amp;creativeASIN=0393324818"&gt;Moneyball: The Art of Winning an Unfair Game&lt;/a&gt;&lt;img alt="" border="0" height="1" src="http://www.assoc-amazon.com/e/ir?t=dataminiandpr-20&amp;amp;l=as2&amp;amp;o=1&amp;amp;a=0393324818" style="border: none !important; margin: 0px !important;" width="1" /&gt;&lt;br /&gt;&lt;br /&gt;&lt;blockquote&gt;Intelligence about baseball statistics had become equated in the public mind with the ability to recite arcane baseball stats. What [Bill] James's wider audience had failed to understand was that the statistics were beside the point. The point was understanding; the point was to make life on earth just a bit more intelligible; and that point, somehow, had been lost. &lt;i&gt;'I wonder,' James wrote, 'if we haven't become so numbed by all these numbers that we are no longer capable of truly assimilating any knowledge which might result from them.'&lt;/i&gt; [italics mine]&lt;/blockquote&gt;&lt;span class="Apple-style-span" style="color: #333333; font-family: Georgia, serif; font-size: 13px; line-height: 16px;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="color: #333333; font-family: Georgia, serif; font-size: small;"&gt;&lt;span class="Apple-style-span" style="font-size: 13px; line-height: 16px;"&gt;I see this&amp;nbsp;phenomenon&amp;nbsp;often these days; we have so much data that we build models without thinking, hoping that the sheer volume of data and sophisticated algorithms will be enough to find the solution. But even with mounds of data, the insight still occurs often on the micro level, with individual cases or customers. The data must tell a story.&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="color: #333333; font-family: Georgia, serif; font-size: small;"&gt;&lt;span class="Apple-style-span" style="font-size: 13px; line-height: 16px;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="color: #333333; font-family: Georgia, serif; font-size: small;"&gt;&lt;span class="Apple-style-span" style="font-size: 13px; line-height: 16px;"&gt;The quote is a good reminder that no matter the size of the data, we are in the business of decisions, knowledge, and insight. Connecting the big picture (lots of data) to decisions takes more than analytics.&lt;/span&gt;&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5652924-5593367303801994339?l=abbottanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://abbottanalytics.blogspot.com/feeds/5593367303801994339/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5652924&amp;postID=5593367303801994339' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/5593367303801994339'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/5593367303801994339'/><link rel='alternate' type='text/html' href='http://abbottanalytics.blogspot.com/2010/08/is-there-too-much-data.html' title='Is there too much data?'/><author><name>Dean Abbott</name><uri>http://www.blogger.com/profile/16818000233889520746</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5652924.post-4226735954961941850</id><published>2010-07-08T20:20:00.000-07:00</published><updated>2010-07-08T20:24:38.182-07:00</updated><title type='text'>Neural Network books</title><content type='html'>I was talking with a colleague today who is taking a business-oriented data mining course, and there was a list of &lt;a href="http://en.wikipedia.org/wiki/Neural_network"&gt;neural network&lt;/a&gt; books recommended by the instructor. It was fascinating looking at the books in the list because I didn't know several of them. When I examined several of the recommended books on amazon.com, I found they contained what I would call "academic" treatments of neural networks. That means they covered all kinds of varieties of neural networks, including brain-state-in-a-box, &lt;a href="http://en.wikipedia.org/wiki/Boltzmann_machine"&gt;Boltzmann machines&lt;/a&gt;, &lt;a href="http://en.wikipedia.org/wiki/Hebbian_learning"&gt;Hebbian&lt;/a&gt; networks, &lt;a href="http://en.wikipedia.org/wiki/ADALINE"&gt;Adaline&lt;/a&gt;, &lt;a href="http://en.wikipedia.org/wiki/Adaptive_resonance_theory"&gt;ART1, ART2, &lt;/a&gt;and many more. Now I have nothing against learning about these techniques on the graduate school level, or even on the undergraduate level. But for practitioners, I see absolutely no advantage here because they aren't used in practice. Nearly always, when someone says they are building a "neural network" they mean a &lt;a href="http://en.wikipedia.org/wiki/Multi-layer_perceptron"&gt;Multi-layered perceptron&lt;/a&gt; (MLP).&lt;br /&gt;&lt;br /&gt;When I use neural networks in major software packages, such as IBM-SPSS Modeler, Statistica, Tibco Spotfire Miner, SAS Enterprise Miner, JMP, Affinium Predictive Insight, and I can go on... I am building MLPs, not ART3 models. So why teach professionals how these other algorithms work? I don't know.&lt;br /&gt;&lt;br /&gt;Now neural network experts I'm sure will find times and places to build esoteric varieties of neural nets. But because of the way most practitioners actually build neural networks, I recommend sticking with the MLP, and understanding the vast numbers of options one has just with this algorithm. This is one reason I like the Christopher Bishop&amp;nbsp;&lt;a href="http://www.amazon.com/gp/product/0198538642?ie=UTF8&amp;amp;tag=dataminiandpr-20&amp;amp;linkCode=as2&amp;amp;camp=1789&amp;amp;creative=390957&amp;amp;creativeASIN=0198538642"&gt;Neural Networks for Pattern Recognition&lt;/a&gt;&lt;img alt="" border="0" height="1" src="http://www.assoc-amazon.com/e/ir?t=dataminiandpr-20&amp;amp;l=as2&amp;amp;o=1&amp;amp;a=0198538642" style="border: none !important; margin: 0px !important;" width="1" /&gt;. Check out the table of contents--I think these topics are more helpful to understand than learning more neural network algorithms.&lt;br /&gt;&lt;br /&gt;Another option for spinning up on neural nets is the excellent &lt;a href="http://www.faqs.org/faqs/ai-faq/neural-nets/part1/preamble.html"&gt;SAS Neural Network FAQ&lt;/a&gt;&amp;nbsp;which is old, but still a very clear introduction to the subject. Finally, for backpropagation, I also like the Richard Lippmann 1987 classic "An Introduction to Computing with Neural Nets (8MB &lt;a href="http://www.cs.sfu.ca/CC/414/li/material/refs/Lippmann-ASSP-87.pdf"&gt;here&lt;/a&gt;).&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5652924-4226735954961941850?l=abbottanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://abbottanalytics.blogspot.com/feeds/4226735954961941850/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5652924&amp;postID=4226735954961941850' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/4226735954961941850'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/4226735954961941850'/><link rel='alternate' type='text/html' href='http://abbottanalytics.blogspot.com/2010/07/neural-network-books.html' title='Neural Network books'/><author><name>Dean Abbott</name><uri>http://www.blogger.com/profile/16818000233889520746</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5652924.post-8492117614907804091</id><published>2010-06-22T15:57:00.000-07:00</published><updated>2010-06-22T15:57:33.592-07:00</updated><title type='text'>Salford to Launch New Integrated Data Mining Suite</title><content type='html'>Tomorrow night is the launch of &lt;a href="http://www.salford-systems.com/"&gt;SPM (Salford Predictive Miner).&lt;/a&gt; If you are in San Diego, give them a holler to let them know you are coming. See you there!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5652924-8492117614907804091?l=abbottanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://abbottanalytics.blogspot.com/feeds/8492117614907804091/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5652924&amp;postID=8492117614907804091' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/8492117614907804091'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/8492117614907804091'/><link rel='alternate' type='text/html' href='http://abbottanalytics.blogspot.com/2010/06/salford-to-launch-new-integrated-data.html' title='Salford to Launch New Integrated Data Mining Suite'/><author><name>Dean Abbott</name><uri>http://www.blogger.com/profile/16818000233889520746</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5652924.post-5422954206238532075</id><published>2010-06-22T15:45:00.000-07:00</published><updated>2010-06-22T15:45:27.030-07:00</updated><title type='text'>A/B Testing and the Need for Clear Business Objectives</title><content type='html'>The website&amp;nbsp;http://videolectures.net/ contains a wealth of interesting lectures on a wide variety of topics, including data mining. I was reminded of one today by &lt;a href="http://videolectures.net/ronny_kohavi/"&gt;Ronny Kohavi&lt;/a&gt;&amp;nbsp;entitled &amp;nbsp;"&lt;a href="http://demo.viidea.com/cikm08_kohavi_pgtce/"&gt;Practical Guide to Controlled Experiments on the Web: Listen to Your Customers not to the HiPPO&lt;/a&gt;" It's short (only 23 minutes) and filled with some very good common-sense principles.&lt;br /&gt;&lt;br /&gt;First, it is a talk about the importance of A/B testing, or in other words, constructing experiments to learn customer behavior rather than having the experts make a best guess at how people will behave. He gives some good examples from Microsoft and the sometimes non-intuitive results from actual testing. A book he recommends is&amp;nbsp;&lt;a href="http://www.amazon.com/gp/product/0471697710?ie=UTF8&amp;amp;tag=dataminiandpr-20&amp;amp;linkCode=as2&amp;amp;camp=1789&amp;amp;creative=390957&amp;amp;creativeASIN=0471697710"&gt;Breakthrough Business Results With MVT: A Fast, Cost-Free, Secret Weapon for Boosting Sales, Cutting Expenses, and Improving Any Business Process&lt;/a&gt;&lt;img alt="" border="0" height="1" src="http://www.assoc-amazon.com/e/ir?t=dataminiandpr-20&amp;amp;l=as2&amp;amp;o=1&amp;amp;a=0471697710" style="border: none !important; margin: 0px !important;" width="1" /&gt;&lt;br /&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: Helvetica; font-size: small;"&gt;&lt;span class="Apple-style-span" style="font-size: 12px;"&gt;&lt;span class="Apple-style-span" style="font-family: Times;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;The second part of the lecture I found particularly interesting is what Kohavi calls the Overall Evaluation Criterion (OEC), or what I usually call business objectives. He included the great Lewis Carroll quote, "If you don't know where you are going, any road will take you there." I find this a common problem as well: if we don't define a business objective that truly measures the impact of the predictive models we build, we have no way of determining if they are effective or not. &amp;nbsp;This objective must be tied to the business itself. For example, Kohavi argues for using Customer Lifetime Value (CLV) rather than click-through rates as they are more tied to the bottom line.&lt;br /&gt;&lt;br /&gt;I would add that it can be useful to have two objectives that are measurable, especially if two objectives better measure the value. For example, in collections risk models, the age of the debt and the amount of the debt are both important components to risk. These are difficult to put into a single number in general, so the two-dimensional risk score can be helpful operationally.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5652924-5422954206238532075?l=abbottanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://abbottanalytics.blogspot.com/feeds/5422954206238532075/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5652924&amp;postID=5422954206238532075' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/5422954206238532075'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/5422954206238532075'/><link rel='alternate' type='text/html' href='http://abbottanalytics.blogspot.com/2010/06/ab-testing-and-need-for-clear-business.html' title='A/B Testing and the Need for Clear Business Objectives'/><author><name>Dean Abbott</name><uri>http://www.blogger.com/profile/16818000233889520746</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5652924.post-6994729890300271022</id><published>2010-06-02T19:39:00.000-07:00</published><updated>2010-06-02T19:39:45.243-07:00</updated><title type='text'>Embedded Analytics and Business Rules: The Holy Grail?</title><content type='html'>Tomorrow (Thursday) at 3pm EDT I'l&lt;a href="http://www.information-management.com/dmradio/-10017503-1.html"&gt;l be on DM Radio for the broadcast "Embedded Analytics and Business Rules: The Holy Grail?&lt;/a&gt;". &amp;nbsp;I'm not sure what the other guests are going to talk about, but my comments will resemble the talk I gave at &lt;a href="http://www.predictiveanalyticsworld.com/"&gt;Predictive Analytics World&lt;/a&gt; in February 2010 in the talk&amp;nbsp;&lt;a href="http://www.predictiveanalyticsworld.com/sanfrancisco/2010/agenda.php#day1-11"&gt;Rules Rule: Inductive Business-Rule Discovery in Text Mining&lt;/a&gt;. In this help-desk case study, we used decision trees to cherry pick interesting rules, converted them to SQL, and deployed them in a rule system that was applied transactionally, online. I emphasized the text mining portion at PAW, but the methodology was independent of that. In 2002-2003, researchers and I at the IRS applied same kind of approach to rule discovery in selecting returns for audit: use trees to find interesting rules. &lt;br /&gt;&lt;br /&gt;The reason we liked the approach was that it was a fast way to overcome two problems. First, decision trees only find the best solution to a problem (according to its measure of "good"). To obtain a richer set of terminal nodes, one can build ensembles of trees, but then one loses the interpretation. On the other hand, one can build association rules, but then you are left with perhaps thousands to tens of thousands of rules that have to be pruned back to get the gist of the key ideas. Many of the rules will be redundant (some completely identical in which records are "hit" by the rule), and it's easy to become lost in the sheer number of rules.&lt;br /&gt;&lt;br /&gt;For the Fortune 500 company, we used &lt;a href="http://salford-systems.com/cart.php"&gt;CART&lt;/a&gt;&amp;nbsp;with the &lt;a href="http://salford-systems.com/blog/view-by-tag/14-battery/"&gt;battery option&lt;/a&gt; to generate a sequence of trees (we iterated on "priors" and misclassification costs, and I think some more options as well to generate variety), and took only those terminal nodes that had sufficiently high classification accuracy. I think we could have used their hotspot analysis for this too, but I wasn't sufficiently well-versed in it at that time.&lt;br /&gt;&lt;br /&gt;If you can't join in on the radio broadcast, you can always download the mp3 later.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5652924-6994729890300271022?l=abbottanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://abbottanalytics.blogspot.com/feeds/6994729890300271022/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5652924&amp;postID=6994729890300271022' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/6994729890300271022'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/6994729890300271022'/><link rel='alternate' type='text/html' href='http://abbottanalytics.blogspot.com/2010/06/embedded-analytics-and-business-rules.html' title='Embedded Analytics and Business Rules: The Holy Grail?'/><author><name>Dean Abbott</name><uri>http://www.blogger.com/profile/16818000233889520746</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5652924.post-3632131964350791208</id><published>2010-05-27T18:20:00.000-07:00</published><updated>2010-05-27T18:20:22.163-07:00</updated><title type='text'>PAKDD-10 Data Mining Competition Winner: Ensembles Again!</title><content type='html'>The &lt;a href="http://sede.neurotech.com.br/PAKDD2010/result.do?method=load"&gt;PAKDD-10 Data Mining Competition&lt;/a&gt; results are in, and ensembles occupied the top 4 positions, and I think the top 5. The winner used Stochastic Gradient Boosting and Random Forests in &lt;a href="http://statsoft.com/products/"&gt;Statistica&lt;/a&gt;, second place a combination of logistic regression and Stochastic Gradient Boosting (and &lt;a href="http://salford-systems.com/cart.php"&gt;Salford Systems CART&lt;/a&gt; for some feature extraction). Interestingly to me, the 5th place finisher used &lt;a href="http://www.cs.waikato.ac.nz/ml/weka/"&gt;WEKA&lt;/a&gt;, an open source software tool.&lt;br /&gt;&lt;br /&gt;The problem was &lt;a href="http://sede.neurotech.com.br/PAKDD2010/login.do?method=redirecionar&amp;amp;vs_Pagina=problemcharacterization"&gt;credit risk with biased data&lt;/a&gt; for building the models, a good way to do the competition because this is the problem we usually face anyway: data was collected based on historic interactions with the company, biased by the approaches the company has used in the past rather than having a pure random sample to build models. Model performance was judged based on &amp;nbsp;&lt;a href="http://www.anaesthetist.com/mnm/stats/roc/"&gt;Area under the Curve (AUC)&lt;/a&gt;, with the &lt;a href="http://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test"&gt;KS&lt;/a&gt; distance as the tie breaker (it's not everyday I hear folks pull out the KS distance!).&lt;br /&gt;&lt;br /&gt;One submission in particular commented on the difference between how algorithms build models and the metric used to evaluate them. CART uses the Gini Index, Logistic regression the log-odds, Neural Networks minimize mean squared error (usually), none of which directly maximize AUC. But this topic is worthy of another post.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5652924-3632131964350791208?l=abbottanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://abbottanalytics.blogspot.com/feeds/3632131964350791208/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5652924&amp;postID=3632131964350791208' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/3632131964350791208'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/3632131964350791208'/><link rel='alternate' type='text/html' href='http://abbottanalytics.blogspot.com/2010/05/pakdd-10-data-mining-competition-winner.html' title='PAKDD-10 Data Mining Competition Winner: Ensembles Again!'/><author><name>Dean Abbott</name><uri>http://www.blogger.com/profile/16818000233889520746</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5652924.post-915634790763000848</id><published>2010-05-25T09:24:00.000-07:00</published><updated>2010-05-27T18:12:00.731-07:00</updated><title type='text'>The Trimmed Mean has Intuitive Appeal</title><content type='html'>I was listening to &lt;a href="http://sports.espn.go.com/espnradio/show?showId=theherd"&gt;Colin Cowherd of ESPN radio&lt;/a&gt; this morning and he made a very interesting observation that we data miners know, or at least should know and make good use of. The context was evaluating teams and programs: are they dynasties or built off of one great player or coach. Lakers? dynasty. Celtics? dynasty. Bulls? without Jordan, they have been a mediocre franchise. The Lakers without Magic are still a dynasty. The Celtics without Bird are still a dynasty.&lt;br /&gt;&lt;br /&gt;So his rule of thumb that he applied to college football programs was this: remove the best coach and the worst coach, and then assess the program. If they are still a great program, they are truly a dynasty.&lt;br /&gt;&lt;br /&gt;This is the &lt;a href="http://en.wikipedia.org/wiki/Truncated_mean"&gt;trimmed (truncated) mean&lt;/a&gt; idea that he was applying intuitively but is quite valuable in practice. When we assess customer lifetime value, if a small percentage of the customers generate 95% of the profits, examining those outliers or the long tail while valuable does not get at the general trend. When I was analyzing IRS corporate tax returns, the correlation between two line items (that I won't identify here!) was more than 90% over the 30K+ returns. But when we removed the largest 50 corporations, the correlation between these line items dropped to under 30%. Why? Because the tail drove the relationship; the overall trend didn't apply to the entire population. It is easy to be fooled by summary statistics for this reason: they assume characteristics about the data that may not be true.&lt;br /&gt;&lt;br /&gt;This all gets back to nonlinearity in the data: if outliers behave differently than the general population, assess them based on the truncated populations. If outliers exist in your data, get the gist from the trimmed mean or median to reduce the bias from the outliers. We know this intuitively, but sometimes we forget to do it and make misleading inferences.&lt;br /&gt;&lt;br /&gt;[UPDATE] I neglected to reference a former post that shows the problem of outliers in computing correlation coefficients:&amp;nbsp;&lt;a href="http://abbottanalytics.blogspot.com/2005/01/beware-of-outliers-in-computing.html"&gt;Beware of Outliers in Computing Correlations&lt;/a&gt;.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5652924-915634790763000848?l=abbottanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://abbottanalytics.blogspot.com/feeds/915634790763000848/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5652924&amp;postID=915634790763000848' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/915634790763000848'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/915634790763000848'/><link rel='alternate' type='text/html' href='http://abbottanalytics.blogspot.com/2010/05/trimmed-mean-has-intuitive-appeal.html' title='The Trimmed Mean has Intuitive Appeal'/><author><name>Dean Abbott</name><uri>http://www.blogger.com/profile/16818000233889520746</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5652924.post-1229609669057523651</id><published>2010-05-23T21:36:00.000-07:00</published><updated>2010-05-23T21:37:46.116-07:00</updated><title type='text'>Upcoming DMRadio Interview: Analytics and Business Rules</title><content type='html'>On June 3rd, a week from this Thursday, I'll be participating in my third &lt;a href="http://www.information-management.com/dmradio/"&gt;DMRadio interview&lt;/a&gt;, this time on business &lt;span id="goog_1309563069"&gt;&lt;/span&gt;&lt;span id="goog_1309563070"&gt;&lt;/span&gt;&lt;a href="http://www.blogger.com/"&gt;&lt;/a&gt;rules (the first two were related to text mining, including &lt;a href="http://www.information-management.com/dmradio/-10016035-1.html"&gt;this one last year&lt;/a&gt;). I always have found these interviews enjoyable to do. I'll probably be discussing an inductive rule discovery process I participated in with a Fortune 500 company (and &lt;a href="http://www.predictiveanalyticsworld.com/sanfrancisco/2010/agenda.php#day1-11"&gt;described at last February's Predictive Analytics World Conference&lt;/a&gt; in San Francisco). &lt;br /&gt;&lt;br /&gt;Even if you can't be there "live", you can download the interview later.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5652924-1229609669057523651?l=abbottanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://abbottanalytics.blogspot.com/feeds/1229609669057523651/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5652924&amp;postID=1229609669057523651' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/1229609669057523651'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/1229609669057523651'/><link rel='alternate' type='text/html' href='http://abbottanalytics.blogspot.com/2010/05/upcoming-dmradio-interview.html' title='Upcoming DMRadio Interview: Analytics and Business Rules'/><author><name>Dean Abbott</name><uri>http://www.blogger.com/profile/16818000233889520746</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5652924.post-4772088486518588011</id><published>2010-05-20T23:42:00.000-07:00</published><updated>2010-05-23T21:42:00.148-07:00</updated><title type='text'>Data Mining as a Top Career</title><content type='html'>More good news for data miners:&amp;nbsp;&lt;a href="http://www.signonsandiego.com/news/2010/may/19/hot-career-trends-for-college-grads-listed-in/"&gt;http://www.signonsandiego.com/news/2010/may/19/hot-career-trends-for-college-grads-listed-in/&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;blockquote&gt;&lt;span class="Apple-style-span" style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: 12px; line-height: 16px;"&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: 12px; line-height: 16px;"&gt;&lt;div style="list-style-image: initial; list-style-position: initial; list-style-type: none; margin-bottom: 16px; margin-left: 0px; margin-right: 0px; margin-top: 0px; padding-bottom: 0px; padding-left: 0px; padding-right: 0px; padding-top: 0px;"&gt;Data mining. The field involves extracting specific information or patterns from large databases. Career prospects are available in areas including advertising technology, scientific research and law enforcement.&lt;/div&gt;&lt;/span&gt;&lt;/blockquote&gt;I think they got it right: data mining (and it's siblings Predictive Analytics and Business Analytics) are growing in their appeal. But more importantly, I see organizations believing they &lt;i&gt;can&lt;/i&gt;&amp;nbsp;do it.&lt;br /&gt;&lt;br /&gt;Of course time will tell. One sign will be how many more resumes (unsolicited) I get!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5652924-4772088486518588011?l=abbottanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://abbottanalytics.blogspot.com/feeds/4772088486518588011/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5652924&amp;postID=4772088486518588011' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/4772088486518588011'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/4772088486518588011'/><link rel='alternate' type='text/html' href='http://abbottanalytics.blogspot.com/2010/05/data-mining-as-top-career.html' title='Data Mining as a Top Career'/><author><name>Dean Abbott</name><uri>http://www.blogger.com/profile/16818000233889520746</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5652924.post-4885698512205720483</id><published>2010-05-11T16:17:00.000-07:00</published><updated>2010-05-11T16:17:22.423-07:00</updated><title type='text'>web analytics and predictive analytics: comments from emetrics</title><content type='html'>I just got back from the latest (and my first) &lt;a href="http://emetrics.org/sanjose/"&gt;eMetrics conference&lt;/a&gt; in San Jose, CA last week, and was very impressed by the practical nature of the conference. It was also a quite different experience for me to be in a setting where I knew very very few people there. I was there to co-present with Angel Morales "&lt;a href="http://emetrics.org/sanjose/2010/tracks/advanced.php#waa03"&gt;Behavioral Driven Marketing Attribution&lt;/a&gt;". Angel and I are co-founders of &lt;a href="http://smarterremarketer.com/"&gt;SmarterRemarketer&lt;/a&gt;, a new web analytics company, and this solution we described is just one nut we are trying to crack in the industry.&lt;br /&gt;&lt;br /&gt;This post though is related to the overlap between web analytics and predictive analytics: very little right now. It really is a different world, and for many I spoke with, the mere mention of "predictive analytics" resulted in one of those unknowing looks back at me. In fairness, much that was spoken to me resulted in the same look!&lt;br /&gt;&lt;br /&gt;One such topic was that of "use cases", a term used over and over in talks, but one that I don't encounter in the data mining world. We describe "case studies", but a "use case" is a smaller and more specific example of something interesting or unusual in how individuals or groups of individuals interact with web sites (I hope I got that right). The key though is that this is a thread of usage. In data mining, it is more typical that predictive models are built, and then to understand why the models are the way they are, one might trace through some of the more interesting branches of a tree or unusual variable combinations in something similar to this "use case" idea.&lt;br /&gt;&lt;br /&gt;First, what to commend... The analyses I saw were quite good: customer segmentation, A/B testing, web page layout, some attribution, etc. There was a great keynote by Joe Megibow of Expedia describing how Expedia's entire web presence has changed in the past year. One of my favorite bloggers, Kevin Hillstrom of &lt;a href="http://minethatdata.blogspot.com/"&gt;MineThatData&lt;/a&gt; fame gave a presentation praising the power of conditional probabilities (very nice!). &amp;nbsp;Lastly, there was one more keynote by someone I had never heard of (not to my credit), but is obviously a great communicator and is well-known in the web analytics world,&amp;nbsp;&lt;a href="http://emetrics.org/sanjose/2010/speakers.php#kaushik"&gt;Avinash Kaushik&lt;/a&gt;. One idea I liked very much from his keynote was the long tail: the tail of the distribution of keywords that navigates to his website contains many times more visits than his top 10. In the data mining world, of course, this would push us to characterize these sparsely populated items differently so they produce more influence in any predictive models. Lots to think about.&lt;br /&gt;&lt;br /&gt;But I digress. The lack of data mining and predictive analytics at this conference begs (at least from me) the question: why not? They are swimming in data, have important business questions that need to be solved, and clearly not all of these are being solved well enough. That will be the subject of my next post.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5652924-4885698512205720483?l=abbottanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://abbottanalytics.blogspot.com/feeds/4885698512205720483/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5652924&amp;postID=4885698512205720483' title='7 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/4885698512205720483'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/4885698512205720483'/><link rel='alternate' type='text/html' href='http://abbottanalytics.blogspot.com/2010/05/web-analytics-and-predictive-analytics.html' title='web analytics and predictive analytics: comments from emetrics'/><author><name>Dean Abbott</name><uri>http://www.blogger.com/profile/16818000233889520746</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>7</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5652924.post-745679834730239242</id><published>2010-05-10T17:40:00.000-07:00</published><updated>2010-05-10T17:40:30.573-07:00</updated><title type='text'>Rexer Analytics Data Mining Survey</title><content type='html'>Calling all data miners! I encourage all to fill out the survey--it is the most complete survey of the data mining world that I am aware of. Use the link and code below, and stay tuned to see the results later in the year.&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Helvetica;"&gt;&lt;/span&gt;&lt;br /&gt;&lt;table border="0" cellpadding="0" cellspacing="0"&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td style="font: inherit;" valign="top"&gt;&lt;div&gt;&amp;nbsp;&lt;/div&gt;&lt;span style="color: #333333; font-family: Arial, sans-serif; font-size: 10pt;"&gt;Survey Link:&amp;nbsp;&lt;/span&gt;&lt;b&gt;&lt;span style="font-family: Arial, sans-serif; font-size: 10pt;"&gt;&lt;span&gt;&amp;nbsp;&lt;/span&gt;&lt;a href="http://www.rexeranalytics.com/Data-Miner-Survey-2010-Intro2.html"&gt;www.RexerAnalytics.com/Data-Miner-Survey-2010-Intro2.html&lt;/a&gt;&lt;/span&gt;&lt;/b&gt;&lt;span style="color: #333333;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="color: #333333; font-family: Arial, sans-serif; font-size: 10pt;"&gt;Access Code:&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;b&gt;&lt;span style="font-family: Arial, sans-serif; font-size: 10pt;"&gt;RS2458&lt;/span&gt;&lt;/b&gt;&lt;span style="color: #333333;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;div class="MsoNormal" style="margin-bottom: 0pt; margin-left: 0in; margin-right: 0in; margin-top: 0in;"&gt;&lt;span class="Apple-style-span" style="color: #333333; font-family: Arial, sans-serif; font-size: small;"&gt;&lt;span class="Apple-style-span" style="font-size: 13px;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin-bottom: 0pt; margin-left: 0in; margin-right: 0in; margin-top: 0in;"&gt;&lt;span class="Apple-style-span" style="color: #333333; font-family: Arial, sans-serif; font-size: small;"&gt;&lt;span class="Apple-style-span" style="font-size: 13px;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin-bottom: 0pt; margin-left: 0in; margin-right: 0in; margin-top: 0in;"&gt;&lt;span class="Apple-style-span" style="color: #333333; font-family: Arial, sans-serif; font-size: small;"&gt;&lt;span class="Apple-style-span" style="font-size: 13px;"&gt;The full description sent by Karl Rexer is below:&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin-bottom: 0pt; margin-left: 0in; margin-right: 0in; margin-top: 0in;"&gt;&lt;span class="Apple-style-span" style="color: #333333; font-family: Arial, sans-serif; font-size: small;"&gt;&lt;span class="Apple-style-span" style="font-size: 13px;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin-bottom: 0pt; margin-left: 0in; margin-right: 0in; margin-top: 0in;"&gt;&lt;span style="color: #333333; font-family: Arial, sans-serif; font-size: 10pt;"&gt;Rexer Analytics, a data mining consulting firm, is conducting our fourth annual survey of the analytic behaviors, views and preferences of data mining professionals.&amp;nbsp; We would greatly appreciate it if you would:&lt;o:p&gt;&lt;/o:p&gt;&lt;/span&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin-bottom: 0pt; margin-left: 0in; margin-right: 0in; margin-top: 0in;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin-bottom: 0pt; margin-left: 0.75in; margin-right: 0in; margin-top: 0in;"&gt;&lt;span style="color: #333333; font-family: Arial, sans-serif; font-size: 10pt;"&gt;1)&lt;/span&gt;&lt;span style="color: #333333; font-size: 7pt;"&gt;&lt;span style="font-family: 'Times New Roman';"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;span style="color: #333333; font-family: Arial, sans-serif; font-size: 10pt;"&gt;Participate in this survey, and&lt;/span&gt;&lt;span style="color: #333333;"&gt;&lt;o:p&gt;&lt;/o:p&gt;&lt;/span&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin-bottom: 0pt; margin-left: 0.75in; margin-right: 0in; margin-top: 0in;"&gt;&lt;span style="color: #333333; font-family: Arial, sans-serif; font-size: 10pt;"&gt;2)&lt;/span&gt;&lt;span style="color: #333333; font-size: 7pt;"&gt;&lt;span style="font-family: 'Times New Roman';"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;span style="color: #333333; font-family: Arial, sans-serif; font-size: 10pt;"&gt;Tell other data miners about the survey (forward this email to them).&lt;o:p&gt;&lt;/o:p&gt;&lt;/span&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin-bottom: 0pt; margin-left: 0.75in; margin-right: 0in; margin-top: 0in;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin-bottom: 0pt; margin-left: 0in; margin-right: 0in; margin-top: 0in;"&gt;&lt;span style="color: #333333; font-family: Arial, sans-serif; font-size: 10pt;"&gt;Thank you.&lt;span&gt;&amp;nbsp;&amp;nbsp;&lt;/span&gt;Forwarding the survey to others is invaluable for our “snowball sample methodology”.&lt;span&gt;&amp;nbsp;&amp;nbsp;&lt;/span&gt;It helps the survey reach a wide and diverse group of data miners.&amp;nbsp;&amp;nbsp; Thank you also to everyone who participated in previous Data Miner Surveys, and especially to the people who provided suggestions for new questions and other survey modifications.&lt;span&gt;&amp;nbsp;&amp;nbsp;&lt;/span&gt;This year’s survey incorporates many ideas from survey participants.&lt;o:p&gt;&lt;/o:p&gt;&lt;/span&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin-bottom: 0pt; margin-left: 0in; margin-right: 0in; margin-top: 0in;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin-bottom: 0pt; margin-left: 0in; margin-right: 0in; margin-top: 0in;"&gt;&lt;span style="color: #333333; font-family: Arial, sans-serif; font-size: 10pt;"&gt;Your responses are completely confidential: no information you provide on the survey will be shared with anyone outside of Rexer Analytics.&amp;nbsp; All reporting of the survey findings will be done in the aggregate, and no findings will be written in such a way as to identify any of the participants.&amp;nbsp; This research is not being conducted for any third party, but is solely for the purpose of Rexer Analytics to disseminate the findings throughout the data mining community via publication, conference presentations, and personal contact.&amp;nbsp;&lt;o:p&gt;&lt;/o:p&gt;&lt;/span&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin-bottom: 0pt; margin-left: 0in; margin-right: 0in; margin-top: 0in;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin-bottom: 0pt; margin-left: 0in; margin-right: 0in; margin-top: 0in;"&gt;&lt;span style="color: #333333; font-family: Arial, sans-serif; font-size: 10pt;"&gt;If you would like a summary of last year’s or this year’s findings emailed to you, there will be a place at the end of the survey to leave your email address.&amp;nbsp; You can also email us directly (&lt;/span&gt;&lt;span style="color: #003399; font-family: Arial, sans-serif; font-size: 10pt;"&gt;&lt;a href="mailto:DataMinerSurvey@RexerAnalytics.com"&gt;&lt;span style="color: blue;"&gt;DataMinerSurvey@RexerAnalytics.com&lt;/span&gt;&lt;/a&gt;&lt;/span&gt;&lt;span style="color: #333333; font-family: Arial, sans-serif; font-size: 10pt;"&gt;) if you have any questions about this research or to request research summaries.&lt;/span&gt;&lt;span style="color: #333333;"&gt;&lt;br /&gt;&lt;br /&gt;&lt;o:p&gt;&lt;/o:p&gt;&lt;/span&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin-bottom: 0pt; margin-left: 0in; margin-right: 0in; margin-top: 0in;"&gt;&lt;span style="color: #333333; font-family: Arial, sans-serif; font-size: 10pt;"&gt;To participate, please click on the link below and enter the access code in the space provided.&amp;nbsp; The survey should take approximately 20 minutes to complete. &amp;nbsp;Anyone who has had this email forwarded to them should use the access code in the forwarded email.&lt;/span&gt;&lt;span style="color: #333333;"&gt;&lt;br /&gt;&lt;br /&gt;&lt;o:p&gt;&lt;/o:p&gt;&lt;/span&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin-bottom: 0pt; margin-left: 0in; margin-right: 0in; margin-top: 0in;"&gt;&lt;span style="color: #333333; font-family: Arial, sans-serif; font-size: 10pt;"&gt;Survey Link:&amp;nbsp;&lt;/span&gt;&lt;b&gt;&lt;span style="font-family: Arial, sans-serif; font-size: 10pt;"&gt;&lt;span&gt;&amp;nbsp;&lt;/span&gt;&lt;a href="http://www.rexeranalytics.com/Data-Miner-Survey-2010-Intro2.html"&gt;www.RexerAnalytics.com/Data-Miner-Survey-2010-Intro2.html&lt;/a&gt;&lt;/span&gt;&lt;/b&gt;&lt;span style="color: #333333;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="color: #333333; font-family: Arial, sans-serif; font-size: 10pt;"&gt;Access Code:&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;b&gt;&lt;span style="font-family: Arial, sans-serif; font-size: 10pt;"&gt;RS2458&lt;/span&gt;&lt;/b&gt;&lt;span style="color: #333333;"&gt;&lt;br /&gt;&lt;br /&gt;&lt;o:p&gt;&lt;/o:p&gt;&lt;/span&gt;&lt;/div&gt;&lt;div class="MsoNormal" style="margin-bottom: 0pt; margin-left: 0in; margin-right: 0in; margin-top: 0in;"&gt;&lt;span style="color: #333333; font-family: Arial, sans-serif; font-size: 10pt;"&gt;Thank you for your time.&amp;nbsp; We hope the results from this survey provide useful information to the data mining community.&lt;/span&gt;&lt;span style="color: #333333;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5652924-745679834730239242?l=abbottanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://abbottanalytics.blogspot.com/feeds/745679834730239242/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5652924&amp;postID=745679834730239242' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/745679834730239242'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/745679834730239242'/><link rel='alternate' type='text/html' href='http://abbottanalytics.blogspot.com/2010/05/rexer-analytics-data-mining-survey.html' title='Rexer Analytics Data Mining Survey'/><author><name>Dean Abbott</name><uri>http://www.blogger.com/profile/16818000233889520746</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5652924.post-6202912131084372738</id><published>2010-02-17T22:08:00.000-08:00</published><updated>2010-02-17T22:08:42.441-08:00</updated><title type='text'>Predictive Analytics World Recap</title><content type='html'>Predictive Analytics World (PAW) just ended today, and here are a few thoughts on the conference.&lt;br /&gt;&lt;br /&gt;PAW was a bigger conference than October's or last February's and it definitely felt bigger. It seemed to me that there was a larger international presence as well.&lt;br /&gt;&lt;br /&gt;Major data mining software vendors included the ones you would expect (in alphabetical order to avoid any appearance of favoritism):&amp;nbsp;Salford Systems, SAS, SPSS (an IBM company), Statsoft, and Tibco. Others who were there included Netezza (a new one for me--they have an innovative approach to data storage and retrieval), SAP, Florio (another new one for me--a drag-and-drop simulation tool) and REvolution.&lt;br /&gt;&lt;br /&gt;One surprise to me was how many text mining case studies were presented. John Elder rightfully described text mining as "the wild west" of analytics in his talk and SAS introduced a new initiative in text analytics (including sentiment analysis, a topic that came up in several discussions I had with other attendees).&lt;br /&gt;&lt;br /&gt;A second theme emphasized by Eric Siegel in the keynote and discussed in a technical manner by Day 2 Keynote Kim Larsen was uplift modeling, or as Larsen described it, Net Lift modeling. This approach makes so much sense, that one should consider not just responders, but should instead set up data to be able to identify those individuals that respond &lt;i&gt;because of the marketing campaign&lt;/i&gt;&amp;nbsp;and not bother those who would respond anyway. I'm interested in understanding the particular way that Larsen approaches Net Lift models with variable selection and a variant of Naive Bayes.&lt;br /&gt;&lt;br /&gt;But for me, the key is setting up the data right and Larsen described the data particularly well. A good campaign will have a treatment set and a control set, where the treatment set gets the promotion or mailing, and the control set does not. There are several possible outcomes here. First, in the treatment set, there are those individuals who would have responded anyway, those who respond because of the campaign, and those who do not respond. For the control set, there are those who respond despite not receiving a mailing, and those who do not. The problem, of course, is that in the treatment set, you don't know which individuals would have responded if they had not been mailed, but you suspect that they look like those in the control set who responded.&lt;br /&gt;&lt;br /&gt;A third area that struck me was that of big data. There was a session (that I missed, unfortunately) on in-dateabase vs. in-cloud computing (by Neil Raden of Hired Brains), and Robert Grossman's talk on building and maintaining 10K predictive models. This latter application was one that I believe will be the approach that we move toward as data size increases, where the multiple models are customized by geography, product, demographic group, etc.&lt;br /&gt;&lt;br /&gt;I enjoyed the conference tremendously, including the conversations with attendees. One of note was the use of ensembles of clustering models that I hope will be presented at a future PAW.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5652924-6202912131084372738?l=abbottanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://abbottanalytics.blogspot.com/feeds/6202912131084372738/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5652924&amp;postID=6202912131084372738' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/6202912131084372738'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/6202912131084372738'/><link rel='alternate' type='text/html' href='http://abbottanalytics.blogspot.com/2010/02/predictive-analytics-world-recap.html' title='Predictive Analytics World Recap'/><author><name>Dean Abbott</name><uri>http://www.blogger.com/profile/16818000233889520746</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5652924.post-5906726274944549964</id><published>2010-02-17T16:41:00.000-08:00</published><updated>2010-02-17T18:28:07.278-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='data reduction'/><category scheme='http://www.blogger.com/atom/ns#' term='principal components analysis'/><category scheme='http://www.blogger.com/atom/ns#' term='data mining'/><category scheme='http://www.blogger.com/atom/ns#' term='eigenanalysis'/><category scheme='http://www.blogger.com/atom/ns#' term='principal component'/><category scheme='http://www.blogger.com/atom/ns#' term='PCA'/><title type='text'>Prinicpal Components for Modeling</title><content type='html'>&lt;b&gt;Problem Statement&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Analysts constructing predictive models frequently encounter the need to reduce the size of the available data, both in terms of variables and observations.  One reason is that data sets are now available which are far too large to be modeled directly in their entirety using contemporary hardware and software.  Another reason is that some data elements (variables) have an associated cost.  For instance, medical tests bring an economic and sometimes human cost, so it would be ideal to minimize their use if possible.  Another problem is overfitting: Many modeling algorithms will eagerly consume however much data they are fed, but increasing the size of this data will eventually produce models of increased complexity without a corresponding increase in quality.  Model deployment and maintenance, too, may be encumbered by extra model inputs, in terms of both execution time and required data preparation and storage.&lt;br /&gt;&lt;br /&gt;Naturally, the goal in &lt;i&gt;data reduction&lt;/i&gt; is to decrease the size of needed data, while maintaining (as much as is possible) model performance, this process must be performed carefully.  &lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;A Solution: Principal Components&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Selection of candidate predictor variables to retain (or to eliminate) is the most obvious way to reduce the size of the data.  If model performance is not to suffer, though, then some effective measure of each variable's usefulness in the final model must be employed- which is complicated by the correlations among predictors.  Several important procedures have been developed along these lines, such as &lt;i&gt;forward selection&lt;/i&gt;,  &lt;i&gt;backward selection&lt;/i&gt; and  &lt;i&gt;stepwise selection&lt;/i&gt;.&lt;br /&gt;&lt;br /&gt;Another possibility is &lt;i&gt;principal components analysis&lt;/i&gt; ("PCA" to his friends), which is a procedure from multivariate statistics which yields a new set of variables (the same number as before), called the &lt;i&gt;principal components&lt;/i&gt;.  Conveniently, all of the principal components are simply linear functions of the original variables.  As a side benefit, all of the principal components are completely uncorrelated.  The technical details will not be presented here (see the reference, below), but suffice it to say that if 100 variables enter PCA, then 100 new variables (called the &lt;i&gt;principal components&lt;/i&gt; come out.  You are now wondering, perhaps, where the "data reduction" is?  Simple: PCA constructs the new variables so that the first principal component exhibits the largest variance, the second principal component exhibits the second largest variance, and so on.&lt;br /&gt;&lt;br /&gt;How well this works in practice depends completely on the data.  In some cases, though, a large fraction of the total variance in the data can be compressed into a very small number of principal components.  The data reduction comes when the analyst decides to retain only the first &lt;i&gt;n&lt;/i&gt; principal components.&lt;br /&gt;&lt;br /&gt;Note that PCA does not eliminate the need for the original variables: they are all still used in the calculation of the principal components, no matter how few of the principal components are retained.  Also, statistical variance (which is what is concentrated by PCA) may not correspond perfectly to "predictive information", although it is often a reasonable approximation.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Last Words&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Many statistical and data mining software packages will perform PCA, and it is not difficult to write one's own code.  If you haven't tried this technique before, I recommend it: It is truly impressive to see PCA squeeze 90% of the variance in a large data set into a handful of variables.&lt;br /&gt;&lt;br /&gt;Note: Related terms from the engineering world: &lt;i&gt;eigenanalysis&lt;/i&gt;, &lt;i&gt;eigenvector&lt;/i&gt; and &lt;i&gt;eigenfunction&lt;/i&gt;.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Reference&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;For the down-and-dirty technical details of PCA (with enough information to allow you to program PCA), see:&lt;br /&gt;&lt;br /&gt;&lt;i&gt;Multivariate Statistical Methods: A Primer&lt;/i&gt;, by Manly (ISBN: 0-412-28620-3)&lt;br /&gt;&lt;br /&gt;Note: The first edition is adequate for coding PCA, and is at present &lt;u&gt;much&lt;/u&gt; cheaper than the second or third editions.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5652924-5906726274944549964?l=abbottanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://abbottanalytics.blogspot.com/feeds/5906726274944549964/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5652924&amp;postID=5906726274944549964' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/5906726274944549964'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/5906726274944549964'/><link rel='alternate' type='text/html' href='http://abbottanalytics.blogspot.com/2010/02/prinicpal-components-for-modeling.html' title='Prinicpal Components for Modeling'/><author><name>Will Dwinnell</name><uri>http://www.blogger.com/profile/03379859054257561952</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='28' src='http://2.bp.blogspot.com/_aTiM0lwqgJ4/TQPgGn46JMI/AAAAAAAAAC4/X2lS2gskiUw/S220/Will%2Bportrait%2BMay-09-2010.jpg'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5652924.post-1184692711214744405</id><published>2010-02-12T07:48:00.000-08:00</published><updated>2010-02-12T08:58:25.967-08:00</updated><title type='text'>Predictive Analytics World - San Francisco</title><content type='html'>The next &lt;a href="http://www.predictiveanalyticsworld.com/"&gt;Predictive Analytics World&lt;/a&gt; is coming up next week. This is a conference look forward to very much because of the attendees; I have found that at the first two PAWs, there have a been a good mix of folks who are experts and those who are spinning up on Predictive Analytics. I'll be teaching a&lt;a href="http://www.predictiveanalyticsworld.com/sanfrancisco/2010/handson_predictive_analytics.php"&gt; hands-on workshop&lt;/a&gt; Monday (using Enterprise Miner), and &lt;a href="http://www.predictiveanalyticsworld.com/sanfrancisco/2010/agenda.php#day1-11"&gt;presenting a talk &lt;/a&gt;on using trees to generate business rules for a help-desk text analytics application on Tuesda&lt;span class="Apple-style-span" style="font-family: Times, 'Times New Roman', serif;"&gt;y the 16h. You can still get the 15% discount if you use the registration code&amp;nbsp;DEANABBOTT010 in the &lt;a href="https://www.eiseverywhere.com/ereg/index.php?eventid=7934&amp;amp;"&gt;registration process&lt;/a&gt; (this is not a sales plug--I won't receive any benefit from this).&amp;nbsp;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Times, 'Times New Roman', serif;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Times, 'Times New Roman', serif;"&gt;Look me up if you are going; I will be there both days (16th and 17th).&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Times, 'Times New Roman', serif;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Helvetica;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Helvetica;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5652924-1184692711214744405?l=abbottanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://abbottanalytics.blogspot.com/feeds/1184692711214744405/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5652924&amp;postID=1184692711214744405' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/1184692711214744405'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/1184692711214744405'/><link rel='alternate' type='text/html' href='http://abbottanalytics.blogspot.com/2010/02/predictive-analytics-san-franciso.html' title='Predictive Analytics World - San Francisco'/><author><name>Dean Abbott</name><uri>http://www.blogger.com/profile/16818000233889520746</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5652924.post-7539025166584132236</id><published>2010-01-19T15:05:00.000-08:00</published><updated>2010-01-19T15:38:45.033-08:00</updated><title type='text'>Is there anything new in Predictive Analytics?</title><content type='html'>&lt;a href="http://fcw.com/"&gt;Federal Computer Wee&lt;/a&gt;k's John Zyskowski posted an article on Jan 8, 2010 on Predictive Analytics entitled "&lt;a href="http://fcw.com/articles/2010/01/11/tech-briefing-fuss-about-analytics.aspx?sc_lang=en"&gt;Deja vu all over again: Predictive analytics look forward into the past&lt;/a&gt;". (kudos for the great &lt;a href="http://www.yogiberra.com/yogi-isms.html"&gt;Yogi Berra quote&lt;/a&gt;! But beware, as Berra stated himself, "I really didn't say everything I said")&lt;br /&gt;&lt;br /&gt;Back to Predictive Analytics...Pieter Mimno is quoted as stating:&lt;br /&gt;&lt;blockquote&gt;There's nothing new about this (Predictive Analytics). It's just old techniques that are being done better.&lt;br /&gt;&lt;/blockquote&gt;To support this argument, John &lt;a href="http://fcw.com/Articles/2010/01/11/TECH-BRIEFING-fuss-about-analytics.aspx?sc_lang=en&amp;amp;Page=2"&gt;quotes me&lt;/a&gt;&amp;nbsp;related to work done at DFAS 10 years ago.&amp;nbsp;Is this true? Is there nothing new in predictive analytics? If it isn't true, what is new?&lt;br /&gt;&lt;br /&gt;I think what is new is not algorithms, but a better integration of data mining software in the business environment, primarily in two places: on the front end and on the back end. On the front end, data mining tools are better at connecting to databases now compared to 10 years ago, and provide the analyst better tools for assessing the data coming into the software. This has always been a big hurdle, and was the reason that at KDD 1999 in San Diego, the panel discussion on "&lt;a href="http://www.google.com/url?q=http://www.sigkdd.org/explorations/issue1-2/kohavi.pdf&amp;amp;ei=GzdWS6GYDJOIsgOm9Lz5AQ&amp;amp;sa=X&amp;amp;oi=nshc&amp;amp;resnum=1&amp;amp;ct=result&amp;amp;cd=1&amp;amp;ved=0CAsQzgQoAA&amp;amp;usg=AFQjCNF37lUUdA6PHECCs6xb55iJ889plA"&gt;Data Mining into Vertical Solutions&lt;/a&gt;" concluded that data mining functionality would be integrated into the database to a large degree. But while it hasn't happened quite the way it was envisioned 10 years ago, it is clearly much easier to do now.&lt;br /&gt;&lt;br /&gt;On the back end, I believe the most significant step forward in data mining tools has been giving the analyst the ability to assess models in a manner consistent with the business objectives of the model. So rather than comparing models based on R^2 or overall classification accuracy, most tools give you the ability to generate an ROI chart, or a ROC curve, or build a custom model assessment engine based on rank-ordered model predictions. This means that when we convey what models are doing to decision makers, we can do so in the language they understanding and not force them to understand how good an R^2 of 0.4 really is. And then, data mining tools are to a greater degree producing scoring code that is usable outside of the tool itself by creating SQL code, SAS code, C or Java, or &lt;a href="http://dmg.org/"&gt;PMML&lt;/a&gt;. What I'm waiting for next is for vendors to provide PMML or other code for all the data prep one does in the tool prior to the model itself; typically, PMML code is generated only for the model itself.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5652924-7539025166584132236?l=abbottanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://abbottanalytics.blogspot.com/feeds/7539025166584132236/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5652924&amp;postID=7539025166584132236' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/7539025166584132236'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/7539025166584132236'/><link rel='alternate' type='text/html' href='http://abbottanalytics.blogspot.com/2010/01/is-there-anything-new-in-predictive.html' title='Is there anything new in Predictive Analytics?'/><author><name>Dean Abbott</name><uri>http://www.blogger.com/profile/16818000233889520746</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5652924.post-4702946964690657979</id><published>2010-01-10T01:42:00.000-08:00</published><updated>2010-01-23T07:10:12.480-08:00</updated><title type='text'>Counting Observations</title><content type='html'>Data is fodder for the data mining process.  One fundamental aspect of the data we analyze is its size, which is most often characterized by the number of observations and the number of variables in the given set of data- typically measured as counts of "rows and columns", respectively.  It is worth taking a closer look at this, though, as questions such as "Do we have enough data?" depend on an apt measure of how much data we have.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Outcome Distributions&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;In many predictive modeling situations, cases are spread fairly evenly among the possible outcomes, but this is not always true.  Many fraud detection problems, for instance, involve extreme class imbalance: target class cases (known frauds) may represent a small fraction of 1% of the available records.  Despite having many &lt;i&gt;total observations&lt;/i&gt; of customer behavior, observations of &lt;i&gt;fraudulent behavior&lt;/i&gt; may be rather sparse.  Data miners who work in the fraud detection field are acutely aware of this issue and characterize their data sets not just by 'total number of observations', but also by 'observations of the behavior of interest'.  When assessing an existing data set, or specifying a new one, such an analyst generally employ both counts.&lt;br /&gt;&lt;br /&gt;Numeric outcome variables may also suffer from this problem.  Most numeric variables are not uniformly distributed, and areas in which outcome data is sparse- for instance, long tails of high personal income- are areas which may be poorly represented in models derived from that data.&lt;br /&gt;&lt;br /&gt;With both class and numeric outcomes, it might be argued that outcome values which are infrequent are, by definition, less important.  This may or may not be so, depending on the modeling process and our priorities.  If the model is expected to perform well on the top personal income decile, then data should be evaluated by how many cases fall in that range, not just on the total observation count.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Predictor Distributions&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Issues of coverage occur on the input variable side, as well.  Keeping in mind that generalization is the goal of discovered models, the total record count by itself seems inadequate when, for example, data are drawn from a process which has (or may have) a seasonal component.  Having 250,000 records in a single data set sounds like many, but if they are only drawn from October, November and December, then one might reasonably take the perspective that only 3 "observations" of monthly behavior are represented, out of 12 possibilities.  In fact, (assuming some level of stability from year to year) one could argue that not only should all 12 calendar months be included, but that they should be drawn from multiple historical years, so that there are multiple observations for each calendar month.&lt;br /&gt;&lt;br /&gt;Other groupings of cases in the input space may also be important.  For instance, of hundreds of observations of retail sales may be observed, but if only from 25 salespeople out of a sales force of 300, then the simple record count as "observation count" may be deceiving.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Validation Issues&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Observations as aggregates of single records should be considered during the construction of train/test data, as well.  When pixel-level data are drawn from images for the construction of a pixel level classifier, for instance, it makes sense to avoid having pixels from a given image serve as training observations, and other pixels from that same image serve as validation observations.  Entire images should be labeled as "train" or "test", and pixels drawn as observations according, to avoid "cheating" during model construction, based on the inherent redundancy in image data.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Conclusion&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;This posting has only briefly touched on some of the issues which arise when attempting to measure the volume of data in one's possession, and has not explored yet more subtle concepts such as sampling techniques, observation weighting or model performance measures.  Hopefully though, it gives the reader some things to think about when assessing data sets in terms of their size and quality.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5652924-4702946964690657979?l=abbottanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://abbottanalytics.blogspot.com/feeds/4702946964690657979/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5652924&amp;postID=4702946964690657979' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/4702946964690657979'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/4702946964690657979'/><link rel='alternate' type='text/html' href='http://abbottanalytics.blogspot.com/2010/01/counting-observations.html' title='Counting Observations'/><author><name>Will Dwinnell</name><uri>http://www.blogger.com/profile/03379859054257561952</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='28' src='http://2.bp.blogspot.com/_aTiM0lwqgJ4/TQPgGn46JMI/AAAAAAAAAC4/X2lS2gskiUw/S220/Will%2Bportrait%2BMay-09-2010.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5652924.post-3581747450603427236</id><published>2010-01-06T18:04:00.001-08:00</published><updated>2010-01-06T18:32:06.228-08:00</updated><title type='text'>Data Mining and Terrorism... Counterpoint</title><content type='html'>In a recent posting to this Web log (&lt;a href="http://abbottanalytics.blogspot.com/2010/01/data-mining-and-privacyagain.html"&gt;Data Mining and Privacy...again, Jan-04-2010&lt;/a&gt;), Dean Abbott made several points regarding the use of data mining to counter terrorism, and related privacy issues.  I'd like to address the question of the usefulness of data mining in this application.&lt;br /&gt;&lt;br /&gt;Dean quoted Bruce Schneier's argument against data mining's use in anti-terrorism programs.  The specific technical argument that Schneier has made (and he is not alone in this) is: Automatic classification systems are unlikely to be effective at identifying individual terrorists, since terrorists are so rare.  Schneier concludes that the rate of "false positives" could never be made low enough for such a system to work effectively.&lt;br /&gt;&lt;br /&gt;As far as this &lt;u&gt;specific&lt;/u&gt; technical line of thought goes, I agree absolutely, and doubt that any competent data analyst would disagree.  It is the extension of this argument to the much broader conclusion that data mining is not a fruitful line of inquiry for those seeking to oppose terrorists that I take issue with.&lt;br /&gt;&lt;br /&gt;Many (most?) computerized classification systems in practice output probabilities, as opposed to simple class predictions.  Owners of such systems use them to &lt;b&gt;prioritize&lt;/b&gt; their efforts (think of database marketers who sort name lists to find the so many who are most likely to respond to an offer).  Classifiers need not be perfect to be useful, and portraying them as such is what I call the "&lt;i&gt;Minority Report&lt;/i&gt; strawman".&lt;br /&gt;&lt;br /&gt;Beyond this, data mining has been used to great effect in rooting out other criminal behaviors, such as money laundering, which are associated with terrorism.  While those who practice our art against terrorism are unlikely to be forthcoming about their work, it is not difficult to imagine data mining systems other than classifiers being used in this struggle, such as analysis on networks of associates of terrorists.&lt;br /&gt;&lt;br /&gt;It would take considerable naivety to believe that present computer systems could be trained to throw up red flags on a small number of individuals, previously unknown to be terrorists, with any serious degree of reliability.  Given the other chores which data mining systems may perform in this fight, I think it is equally naive to abandon that promise for an overextended technical argument.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5652924-3581747450603427236?l=abbottanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://abbottanalytics.blogspot.com/feeds/3581747450603427236/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5652924&amp;postID=3581747450603427236' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/3581747450603427236'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/3581747450603427236'/><link rel='alternate' type='text/html' href='http://abbottanalytics.blogspot.com/2010/01/data-mining-and-terrorism-counterpoint.html' title='Data Mining and Terrorism... Counterpoint'/><author><name>Will Dwinnell</name><uri>http://www.blogger.com/profile/03379859054257561952</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='28' src='http://2.bp.blogspot.com/_aTiM0lwqgJ4/TQPgGn46JMI/AAAAAAAAAC4/X2lS2gskiUw/S220/Will%2Bportrait%2BMay-09-2010.jpg'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5652924.post-5556876815381425970</id><published>2010-01-04T21:53:00.000-08:00</published><updated>2010-01-04T21:53:41.292-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Predictive Analytics World'/><category scheme='http://www.blogger.com/atom/ns#' term='data mining conferences'/><title type='text'>The Next Predictive Analytics World</title><content type='html'>Just a reminder that the next &lt;a href="http://www.predictiveanalyticsworld.com/"&gt;Predictive Analytics World&lt;/a&gt; is coming in another 6 weeks--Feb 16-17 in San Francisco. &lt;br /&gt;&lt;br /&gt;I'll be teaching a pre-conference &lt;a href="http://www.predictiveanalyticsworld.com/sanfrancisco/2010/handson_predictive_analytics.php"&gt;Hands-On Predictive Analytics&lt;/a&gt; workshop using SAS Enterprise Miner on the 15th, and presenting a &lt;a href="http://www.predictiveanalyticsworld.com/sanfrancisco/2010/agenda.php#day1-11"&gt;text mining case study&lt;/a&gt; on the 16th. &lt;br /&gt;&lt;br /&gt;For any readers here who may be going, feel free to use this discount code during registration to get a 15% discount off the 2-day conference: DEANABBOTT010&lt;br /&gt;&lt;br /&gt;Hope to see you there.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5652924-5556876815381425970?l=abbottanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='related' href='http://www.predictiveanalyticsworld.com/' title='The Next Predictive Analytics World'/><link rel='replies' type='application/atom+xml' href='http://abbottanalytics.blogspot.com/feeds/5556876815381425970/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5652924&amp;postID=5556876815381425970' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/5556876815381425970'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/5556876815381425970'/><link rel='alternate' type='text/html' href='http://abbottanalytics.blogspot.com/2010/01/next-predictive-analytics-world.html' title='The Next Predictive Analytics World'/><author><name>Dean Abbott</name><uri>http://www.blogger.com/profile/16818000233889520746</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5652924.post-4159218861584783289</id><published>2010-01-04T21:41:00.000-08:00</published><updated>2010-01-04T21:42:51.899-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='link analysis'/><category scheme='http://www.blogger.com/atom/ns#' term='rare events'/><category scheme='http://www.blogger.com/atom/ns#' term='privacy'/><title type='text'>Data Mining and Privacy...again</title><content type='html'>A google search tonight on "data mining" referred to the latest &lt;a href="http://www.dhs.gov/xlibrary/assets/privacy/privacy_rpt_datamining_2009_12.pdf"&gt;DHS Privacy Office 2009 Data Mining Report to Congress.&lt;/a&gt; I'm always nervous when I see "data mining" in titles like this, especially when linked to privacy because of the misconceptions about what data mining is and does. I have long contented that data mining only does what humans would do manually if they had enough time to do it. The concerns that most privacy advocates really are complaining about is the &lt;i&gt;data that one has available&lt;/i&gt; to make the inferences from, albeit more efficiently with data mining.&lt;br /&gt;&lt;br /&gt;What I like about this article are the common-sense comments made. Data mining on extremely rare events (such as terrorist attacks) is very difficult because there are not enough examples of the patterns to have high confidence that the predictions are not by chance. Or as it is stated in the article:&lt;br /&gt;&lt;br /&gt;&lt;blockquote&gt;Security expert Bruce Schneier explains well. When searching for a needle in a haystack, adding more "hay" does not good at all. Computers and data mining are useful only if they are looking for something relatively common compared to the database searched. For instance, out of 900 million credit card in the US, about 1% are stolen or fraudulently used every year. One in a hundred is certainly the exception rather than the rule, but it is a common enough occurrence to be worth data mining for. By contrast, the 9-11 hijackers were a 19-man needle in a 300 million person haystack, beyond the ken of even the finest super computer to seek out. Even an extremely low rate of false alarms will swamp the system.&lt;/blockquote&gt;&lt;br /&gt;Now this is true for the most commonly used data mining techniques (&lt;a href="http://en.wikipedia.org/wiki/Predictive_modelling"&gt;predictive models&lt;/a&gt; like &lt;a href="http://en.wikipedia.org/wiki/Decision_trees"&gt;decision trees&lt;/a&gt;, &lt;a href="http://en.wikipedia.org/wiki/Linear_regression"&gt;regression&lt;/a&gt;, &lt;a href="http://en.wikipedia.org/wiki/Neural_networks"&gt;neural nets&lt;/a&gt;, &lt;a href="http://en.wikipedia.org/wiki/Support_vector_machines"&gt;SVM&lt;/a&gt;). However, there are other techniques that are used to find links between interesting entities that are extremely unlikely to occur by chance. This isn't foolproof, of course, but while there will be lots of false alarms, they can still be useful. Again from the enlightened layperson:&lt;br /&gt;&lt;br /&gt;&lt;blockquote&gt;An NSA data miner &lt;a href="http://www.washingtonpost.com/wp-dyn/content/article/2006/02/04/AR2006020401373_4.html"&gt;acknowledged&lt;/a&gt;, "Frankly, we'll probably be wrong 99 percent of the time . . . but 1 percent is far better than 1 in 100 million times if you were just guessing at random."&lt;/blockquote&gt;&lt;br /&gt;It's not as if this were a new topic. From the Cato Institute, &lt;a href="http://www.cato-at-liberty.org/2007/04/19/link-analysis-and-911/"&gt;this article&lt;/a&gt; describes the same phenomenon, and links to a &lt;a href="http://jeffjonas.typepad.com/SRD-911-connections.pdf"&gt;Jeff Jonas presentation&lt;/a&gt; that describes how good investigation would have linked the 9/11 terrorists (rather than using data mining). Fair enough, but analytic techniques are still valuable in removing the chaff--those individuals or events that very uninteresting. In fact, I have found this to be a very useful approach to handling difficult problems.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5652924-4159218861584783289?l=abbottanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='related' href='http://enlightenedlayperson.blogspot.com/2010/01/data-mining-needle-in-haystack-problem.html' title='Data Mining and Privacy...again'/><link rel='replies' type='application/atom+xml' href='http://abbottanalytics.blogspot.com/feeds/4159218861584783289/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5652924&amp;postID=4159218861584783289' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/4159218861584783289'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/4159218861584783289'/><link rel='alternate' type='text/html' href='http://abbottanalytics.blogspot.com/2010/01/data-mining-and-privacyagain.html' title='Data Mining and Privacy...again'/><author><name>Dean Abbott</name><uri>http://www.blogger.com/profile/16818000233889520746</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5652924.post-5766403634177460233</id><published>2009-12-29T08:03:00.000-08:00</published><updated>2009-12-29T08:03:00.985-08:00</updated><title type='text'>2009 Retrospective</title><content type='html'>I was thinking about top data mining trends in 2009, and searched for what others thought about it. I'll combine a few 2009 "top 3" lists here, including top trends (as described at Enterprise Regulars &lt;a href="http://www.enterpriseirregulars.com/5706/the-top-10-trends-for-2010-in-analytics-business-intelligence-and-performance-management/"&gt;here&lt;/a&gt;), and posts here that generated the most buzz.&lt;br /&gt;&lt;br /&gt;First, the top data mining news story was IBM's purchase of SPSS. It will be very interesting to see if this continues the trend toward integration of Business Intelligence and Predictive Analytics that one sees with SAS, Tibco and now IBM/SPSS.&lt;br /&gt;&lt;br /&gt;The Enterprise Regulars post included a few interesting 2010 trends (but since data mining is all about using historical data to make predictions of future behavior, assuming past behavior will continue). In particular, there are 4 mentioned that were of interest to me:&lt;br /&gt;&lt;OL&gt;&lt;blockquote&gt;&lt;LI&gt;The holy grail of the predictive, real-time enterprise (his #2)&lt;br /&gt;&lt;LI&gt;SaaS / Cloud BI Tools will steal significant revenue from on-premise vendors but also fight for limited oxygen amongst themselves.  (his #5)&lt;br /&gt;&lt;LI&gt;Advanced Visualization will continue to increase in depth and relevance to broader audiences.  (his #7)&lt;br /&gt;&lt;LI&gt;Open Source offerings will continue to make in-roads against on-premise offerings. (his #8)&lt;/blockquote&gt;&lt;/OL&gt;I agree with his #2 and #7 (integration of BI/PA and visualization). Several customers I work with are trying to integrate predictive analytics into the database to make better decisions. The difference now is that there is also interest in integrating this process with other data-centric (BI) operations to provide the right information to decision-makers with the right level of granularity (detail). This is typically a combination of creating the ability to perform ad hoc queries along with examining the results (rankings and projections) from predictive analytics.&lt;br&gt;&lt;br&gt;However,I have not seen Cloud computing and Open source take off &lt;i&gt;from the perspective of customers I work with&lt;/i&gt;. The latter two certainly have generated buzz, and in the courses I teach, there is considerable interest in open source computing (R in particular), but it has still be &lt;i&gt;interest&lt;/i&gt; rather than &lt;i&gt;action&lt;/i&gt;. I expect though that as the allure of data mining and predictive analytics extends its reach deeper into organizations, the need for inexpensive tools (in dollars) will result in increased use of the open source and free tools, such as &lt;a href="http://www.r-project.org/"&gt;R&lt;/a&gt;, &lt;a href="http://rapid-i.com/content/blogsection/7/82/lang,en/"&gt;RapidMiner&lt;/a&gt;, &lt;a href="http://www.cs.waikato.ac.nz/ml/weka/"&gt;Weka&lt;/a&gt;, &lt;a href="http://eric.univ-lyon2.fr/~ricco/tanagra/en/tanagra.html"&gt;Tanagra&lt;/a&gt;, &lt;a href="http://www.ailab.si/orange/"&gt;Orange&lt;/a&gt;, &lt;a href="http://www.knime.org/"&gt;Knime&lt;/a&gt;, and others. Lastly, from this blog, the top posts of 2009 were&lt;OL&gt;&lt;LI&gt; &lt;a href="http://abbottanalytics.blogspot.com/2009/04/why-normalization-matters-with-k-means.html"&gt;Why normalization matters with K-Means&lt;/a&gt;&lt;br /&gt;&lt;LI&gt; &lt;a href="http://abbottanalytics.blogspot.com/2009/03/how-many-software-packages-is-too-much.html"&gt;How many software packages are too much?&lt;/a&gt; &lt;br /&gt;&lt;LI&gt; &lt;a href="http://abbottanalytics.blogspot.com/2009/03/data-mining-does-it-get-any-better-than.html"&gt;Data Mining: Does it get any better than this?&lt;/a&gt;&lt;br /&gt;&lt;LI&gt; &lt;a href="http://abbottanalytics.blogspot.com/2009/01/text-mining-and-regular-expressions.html"&gt;Text Mining and Regular Expressions&lt;/a&gt;&lt;br /&gt;&lt;/OL&gt;&lt;br /&gt;Happy New Year!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5652924-5766403634177460233?l=abbottanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://abbottanalytics.blogspot.com/feeds/5766403634177460233/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5652924&amp;postID=5766403634177460233' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/5766403634177460233'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/5766403634177460233'/><link rel='alternate' type='text/html' href='http://abbottanalytics.blogspot.com/2009/12/2009-retrospective.html' title='2009 Retrospective'/><author><name>Dean Abbott</name><uri>http://www.blogger.com/profile/16818000233889520746</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5652924.post-1212583816330789856</id><published>2009-12-15T12:45:00.000-08:00</published><updated>2010-01-04T21:43:31.675-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Business analytics'/><category scheme='http://www.blogger.com/atom/ns#' term='business intelligence'/><title type='text'>Overlap in the Business Intelligence / Predictive Analytics Space</title><content type='html'>I've received considerable feedback on the post &lt;a href="http://abbottanalytics.blogspot.com/2009/12/business-analytics-vs-business.html"&gt;Business Intelligence vs. Business Analytics&lt;/a&gt;, which has also caused me to think more about the BI space and its overlap with data mining (DM) / predictive analytics (PA) / business analytics (BA).  One place to look for this, of course, is with &lt;a href="http://www.gartner.com/technology/home.jsp"&gt;Gartner&lt;/a&gt;, how they define Business Intelligence, and which vendors overlap between these industries. (I think of this in much same way as I do DM; I look to data miners to define themselves and what they do rather than to other industries and how they define data mining). &lt;br /&gt;&lt;br /&gt;I found the Gartner Magic Quadrant for Business Intelligence in 2009 &lt;a href="http://analyst.gartner-bi.sapfarm.com/"&gt;here&lt;/a&gt;, and was very curious to understand (1) how they define BI, and which BI players are also big players in the data mining space. Answering the first question, data analysis in the BI world is defined here as comprising four parts: OLAP, visualization, scorecards, and data mining. So DM in this view is a subset of BI. &lt;br /&gt;&lt;br /&gt;Second, the key players in the &lt;a href="http://imagesrv.gartner.com/media-products/reprints/images/oracle/163529_0001.png"&gt;quadrant&lt;/a&gt; interestingly contains only a few vendors I would consider to be top data mining vendors: SAS, Oracle, IBM (Cognos), and Microsoft in the "Leaders" category, and Tibco in the visionaries category.  Of these, only SAS (with Enterprise Miner) and Microsoft (SQL Server) showed up in the top 10 of the &lt;a href="http://www.rexeranalytics.com"&gt;Rexer Analytics&lt;/a&gt; 2008 software tool survey, though Tibco showed up in the top 20 (with Tibco Spotfire Miner). &lt;br /&gt;&lt;br /&gt;I think this emphasizes again that BI and DM/PA/BA approach analysis differently, even if the end result is the same (a scorecard, dashboard, report, or transactional decisioning system).&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5652924-1212583816330789856?l=abbottanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://abbottanalytics.blogspot.com/feeds/1212583816330789856/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5652924&amp;postID=1212583816330789856' title='5 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/1212583816330789856'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/1212583816330789856'/><link rel='alternate' type='text/html' href='http://abbottanalytics.blogspot.com/2009/12/overlap-in-business-intelligence.html' title='Overlap in the Business Intelligence / Predictive Analytics Space'/><author><name>Dean Abbott</name><uri>http://www.blogger.com/profile/16818000233889520746</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>5</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5652924.post-3032566945419760278</id><published>2009-12-06T23:38:00.000-08:00</published><updated>2009-12-08T18:00:34.490-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Business analytics'/><category scheme='http://www.blogger.com/atom/ns#' term='business intelligence'/><title type='text'>Business Analytics vs. Business Intelligence</title><content type='html'>I used to be one that thought the term "data mining" would stay as the description of the kind of analytic work I do. To a large degree it has, but there are always new spins on things, and it seems that quite often in the business world, Predictive Analytics or Business Analytics are the terms of the day. &lt;br /&gt;&lt;br /&gt;I just came across this post from the Smart Data Collective: &lt;a href="http://smartdatacollective.com/Home/22740?utm_source=sdc_newsletter&amp;utm_medium=email&amp;utm_campaign=newsletter"&gt;OLAP is Dead (Long Live Analytics)&lt;/a&gt;, which had some fascinating graphs on hits related to the phrases OLAP and Analytics. The first shows the steady decline of OLAP as a searched term to the point where even the OLAP report has been renamed to &lt;a href="http://www.bi-verdict.com/"&gt;The BI Verdict&lt;/a&gt;. Meanwhile, "analytics" has been increasing steadily in hits. SAS even touts themselves as &lt;a href="http://www.sas.com/businessanalytics/index.html"&gt;leaders in "Business Analytics"&lt;/a&gt; now. &lt;br /&gt;&lt;br /&gt;Which brings me to the question in the title of this post. It seems to me that Business Intelligence has taken over the role that OLAP and dashboarding used to take on (at least in the circles I worked in). Is there a difference between Business Intelligence and Business Analytics? James Taylor, someone whom I respect tremendously, &lt;a href="http://www.ebizq.net/blogs/decision_management/2009/03/business_intelligence_or_busin.php"&gt;doesn't think so&lt;/a&gt;. &lt;br /&gt;&lt;blockquote&gt;As SAS talked about its business analytics framework it became clear that they envision the results of data mining and predictive analytics (where they genuinely have offerings superior to almost everyone) will be delivered in reports or dashboards. This is what I have somewhat dismissively called "predictive reporting" and while it is better than purely historical reporting, it does not do much to make every decision analytically based as it leaves out the decisions made by machines (which don't read reports) and those made by people with too little time to read a report (most call center or retail staff, for instance) or no skill at interpreting it.&lt;br /&gt;&lt;br /&gt;I guess I just don't see the difference between BI and BA...&lt;/blockquote&gt;&lt;br /&gt;If all of business analytics is reduced to "predictive reporting", then I can see why some might consider it no more than business intelligence. But even so, are they the same? I don't mean are the results the same either. For that matter, the final decisions from analytics for say classification look just the same as a human decision (buy or not buy? fraud or not?). But is the process the same? I would argue "no". Much of the power of predictive analytics comes from the automation in searching for and assessing nonlinearities, interaction effects, and combinatorics relating observables to outcomes. So, rather than manually assessing these, one automates the process through the use of "decision trees", "neural networks", or some other algorithm. So the difference lies in efficiency in the process.&lt;br /&gt;&lt;br /&gt;Now how the predictive information is used, in a report, as part of an automated system or in some other way, is a critically important question, but independent of how the decisions are generated.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5652924-3032566945419760278?l=abbottanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://abbottanalytics.blogspot.com/feeds/3032566945419760278/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5652924&amp;postID=3032566945419760278' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/3032566945419760278'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/3032566945419760278'/><link rel='alternate' type='text/html' href='http://abbottanalytics.blogspot.com/2009/12/business-analytics-vs-business.html' title='Business Analytics vs. Business Intelligence'/><author><name>Dean Abbott</name><uri>http://www.blogger.com/profile/16818000233889520746</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5652924.post-1885048679612527427</id><published>2009-12-01T06:55:00.000-08:00</published><updated>2009-12-08T18:01:13.879-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='theology'/><category scheme='http://www.blogger.com/atom/ns#' term='computer science'/><category scheme='http://www.blogger.com/atom/ns#' term='sampling'/><title type='text'>Computer Science and Theology</title><content type='html'>I have been reading a book by Don Knuth called &lt;a href="http://www.amazon.com/gp/product/157586326X?ie=UTF8&amp;tag=dataminiandpr-20&amp;linkCode=as2&amp;camp=1789&amp;creative=390957&amp;creativeASIN=157586326X"&gt;Things a Computer Scientist Rarely Talks About (Center for the Study of Language and Information - Lecture Notes)&lt;/a&gt;&lt;img src="http://www.assoc-amazon.com/e/ir?t=dataminiandpr-20&amp;l=as2&amp;o=1&amp;a=157586326X" width="1" height="1" border="0" alt="" style="border:none !important; margin:0px !important;" /&gt;--a very good read for those of you interested in theology as well as analytics. This post is not about the theology of the book (as interesting as that is to me), but rather the reason described in this book for his writing of another book called &lt;a href="http://www.amazon.com/gp/product/0895792524?ie=UTF8&amp;tag=dataminiandpr-20&amp;linkCode=as2&amp;camp=1789&amp;creative=390957&amp;creativeASIN=0895792524"&gt;3:16&lt;/a&gt;&lt;img src="http://www.assoc-amazon.com/e/ir?t=dataminiandpr-20&amp;l=as2&amp;o=1&amp;a=0895792524" width="1" height="1" border="0" alt="" style="border:none !important; margin:0px !important;" /&gt;, a study of all the 3:16 verses in the Bible. In his chapter on randomized testing (I like to think of model ensembles here), he describes how random sampling is a good way to get an idea of the content of "stuff", whether computer science assignments (he actually does this--randomly take page X of a project and look at that in depth), or understanding books (like the Bible). His 3:16 book takes this verse from every book in the Bible to get a sense of the overall message of the Bible. He admittedly chose 3:16 because of John 3:16 so that he would get at least one great verse, but this was a concession to making the book marketable.&lt;br /&gt;&lt;br /&gt;At first I wasn't a big fan of this idea. After all, it is a small sample, But he describes how he then studied these verses in depth. Whereas his prior understanding of the Bible was vague and general (which has its positive points), this exercise led also to a deeper (albeit narrow) understanding as well. I recommend this approach very much.&lt;br /&gt;&lt;br /&gt;What does this have to do with analytics? Data Mining often is viewed as a way to get the gist of your data, see the big picture, understand patterns through summarized views. But just as important is the deep view, looking at a few examples (prototypes) in depth. In the text mining project I'm working on right now, while we extract "concepts", much of our time is also spent tracing a few text blocks through the processing to understand in detail why the analytics is working the way it does. I'm a "both / and" kind of guy, so this suits me well; big picture analytics as well as deep dives into record-level descriptions.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5652924-1885048679612527427?l=abbottanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://abbottanalytics.blogspot.com/feeds/1885048679612527427/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5652924&amp;postID=1885048679612527427' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/1885048679612527427'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/1885048679612527427'/><link rel='alternate' type='text/html' href='http://abbottanalytics.blogspot.com/2009/12/computer-science-and-theology.html' title='Computer Science and Theology'/><author><name>Dean Abbott</name><uri>http://www.blogger.com/profile/16818000233889520746</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5652924.post-69285083172218793</id><published>2009-11-23T22:28:00.000-08:00</published><updated>2009-12-08T18:02:14.916-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='classification'/><category scheme='http://www.blogger.com/atom/ns#' term='sampling'/><title type='text'>Stratified Sampling vs. Posterior Probability Thresholds</title><content type='html'>One of the great things about conference like the recent &lt;a href="http://www.predictiveanalyticsworld.com"&gt;Predictive Analytics World&lt;/a&gt; is how many technical interactions one has with top practitioners; this past October was no exception. One such interaction was with Tim Manns who blogs &lt;a href="http://timmanns.blogspot.com/"&gt;here&lt;/a&gt;. We were talking about Clementine and what to do with small populations of 1s in the target variable, which prompted me to jump onto my soapbox with an issue that I had never read about, but which occurs commonly in data mining problems such as response modeling and fraud detection.&lt;br /&gt;&lt;br /&gt;The setup goes something like this: you have 1% responders, you build models, and the model "says" every record is a 0. My explanation for this was always that errors in classification models take place when the same pattern of inputs can produce both outcomes. In this situation, what is the best guess? The most commonly occurring output variable value. If you have 99% 0s, that is most likely a 0, and therefore data mining tools will produce the answer "0". The common solution to this is to resample the data (stratify) so that one has equal numbers of 0s and 1s in the data, and then rebuild the model. While this is true, it misses an important factor.&lt;br /&gt;&lt;br /&gt;I can't claim credit for this (thanks Marie!). I was working on a consulting project with a statistician, and when we were building logistic regression models, I recommended resampling so we don't have the "model calls everything a 0" problem. She seemed puzzled by this, and asked why not threshold at the prior probability level. It was clear right away that this is true, and I've been doing it ever since (with logistic regression or neural networks in particular).&lt;br /&gt;&lt;br /&gt;What was she saying? First, it needs to be stated that no algorithm produces "decisions". Logistic regression produces probabilities. Neural networks produce confidence values (though I just had a conversation with one of the smartest machine learning guys I know who talked about neural networks producing true probabilities--maybe I'll blog on this more another time). The decisions that one sees ("all records are called 0s") are produced by the software, interpreting the probabilities or confidence values by thresholding them at 0.5. It is always 0.5. I don't think I've ever found a data mining software package that doesn't threshold at 0.5, in fact. So the software expects the prior probabilities of 0s and 1s to be equal. When they are not (like with 99% 0s and 1% 1s), this threshold is completely inappropriate; the center of density of the distribution of probabilities will center roughly on the prior probability level (0.01 for the 1% response rate problem). I show some examples of this in my data mining course that makes this more clear.&lt;br /&gt;&lt;br /&gt;So what can one do? If one thresholds at 0.01 rather than 0.5, one gets a nice confusion matrix out of the classification problem. Of course if you use a ROC curve, Lift Chart or Gains Chart to assess your model, you don't worry about thresholding anyway. &lt;br /&gt;&lt;br /&gt;Which brings me to the conversation with Tim Manns. I'm glad he tried it out himself, though I don't think one has to make the target variable continuous to make this work. Tim did his testing in Clementine, but the same holds for any other data mining software tool. What Tim's trick does is correct: if you make the [0,1] target variable numeric, you can build a neural network just fine and the predicted value is "exposed". In Clementine, if you keep it as a "flag" variable, you would threshold the propensity value ($NRP-target). &lt;br /&gt;&lt;br /&gt;So, read Tim's post (and his other posts!). This trick can be used with nearly any tool--I've done it with Matlab and Tibco Spotfire Miner, among others). &lt;br /&gt;&lt;br /&gt;Now, if tools would only include an option to threshold the propensity at 0.5 or the prior probability (or more precisely, the proportion in the training data).&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5652924-69285083172218793?l=abbottanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='related' href='http://timmanns.blogspot.com/2009/11/building-neural-networks-on-unbalanced.html' title='Stratified Sampling vs. Posterior Probability Thresholds'/><link rel='replies' type='application/atom+xml' href='http://abbottanalytics.blogspot.com/feeds/69285083172218793/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5652924&amp;postID=69285083172218793' title='8 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/69285083172218793'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/69285083172218793'/><link rel='alternate' type='text/html' href='http://abbottanalytics.blogspot.com/2009/11/stratified-sampling-vs-posterior.html' title='Stratified Sampling vs. Posterior Probability Thresholds'/><author><name>Dean Abbott</name><uri>http://www.blogger.com/profile/16818000233889520746</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>8</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5652924.post-3878175530662824353</id><published>2009-11-12T14:52:00.000-08:00</published><updated>2009-12-08T18:02:37.762-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='bioinformatics'/><category scheme='http://www.blogger.com/atom/ns#' term='conferences'/><title type='text'>San Diego Forum on Analytics -- review</title><content type='html'>I just got back from the 1/2 day &lt;a href="http://www.sdsic.org/events/forum-on-analytics-2009.aspx"&gt;Forum on Analytics&lt;/a&gt; in San Diego, and included a keynote by Wayne Peacock (now with Inevit, bur formerly VP of BI at Netflix), who spoke on how pervasive analytics was and is at Netflix, covering areas as diverse as finance, customer service, marketing, network optimization, operations, and product development. It was particularly interesting to me that as of 2006, their data warehouse was not in place, but instead the had a "data landfill" (term of the day for me!). The other quote from his talk that I found provocative was related to their web site, "If the web site doesn't go down once a year, we aren't pushing hard enough." However, this is changing somewhat because of their online content delivery, where websites going down have a much bigger downside!&lt;br /&gt;&lt;br /&gt;The rest of the morning contained 3 panel discussions, which was interesting in of itself to see what topics were considered most important: Mining Biodata, Web 3.0, and Job Opportunities in Analytics. &lt;br /&gt;&lt;br /&gt;During the Biodata panel, Nancy Miller Latimer of Accelrys, Inc. mentioned in passing a software tool that ehy have developed to do essential visual programming of biodata; it looks like the typical Clementine/Enterprise Miner/Tibco Spotfire Miner/Polyanalyst (and in so many other tools, including Statistica and Weka) interface for doing data prep, but their tool is specific for biodata, including loading technical papers, chemical structure data, etc. I've been fascinated for years by the relatively parallel paths taken by the bioinformatics/cheminformatics world and the data mining world: very similar ideas, but very different toolsets because of the very different characteristics of the data. Much was said about the future of sequencing of the human genome: 2 humans in 2007, 6+ in 2008, perhaps 150 in 2009 and growing exponentially (faster than Moore's law). There was talk of the $1000 human sequence soon.&lt;br /&gt;&lt;br /&gt;The Web 3.0 panel included 2 folks from Intuit touting a facebook campaign done to grow use of Turbotax virally. Interesting stuff, but I'm still dubious of the effect of social networking on all but the under 30 crowd.  I think I'll finally begin to tweet, but only out of curiosity, not because I expect anything of business value from it. Is it inevitable that Facebook, Twitter, and Youtube will become mainstream ways to develop business? For me? I don't see how for me yet. &lt;br /&gt;&lt;br /&gt;Lastly, on the analytics jobs in San Diego...there are over 100 analytics companies in San Diego (most of them undoubtedly small or micro, like me), and there was an evangelistic cry for San Diego to become an analytics cluster in the U.S. I think this is actually possible, and has been the case to some degree for some time now. I had forgotten about the Keylime (a San Diego web company) being purchased by Yahoo, and Websidestory being purchased by Omniture. Of course Fair Isaacs and HNC were discussed as well. Time will tell, and right now, things are tough all around, though Kanani Masterson of TriStaff Group said there were currently 225 analytics / web analytics job openings, so things aren't completely dead.&lt;br /&gt;&lt;br /&gt;All in all, it was a lot to pack into a morning.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5652924-3878175530662824353?l=abbottanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://abbottanalytics.blogspot.com/feeds/3878175530662824353/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5652924&amp;postID=3878175530662824353' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/3878175530662824353'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/3878175530662824353'/><link rel='alternate' type='text/html' href='http://abbottanalytics.blogspot.com/2009/11/san-diego-forum-on-analytics-review.html' title='San Diego Forum on Analytics -- review'/><author><name>Dean Abbott</name><uri>http://www.blogger.com/profile/16818000233889520746</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5652924.post-5271187554514151031</id><published>2009-10-28T19:49:00.000-07:00</published><updated>2009-12-08T18:02:59.918-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='uplift'/><category scheme='http://www.blogger.com/atom/ns#' term='R'/><category scheme='http://www.blogger.com/atom/ns#' term='conferences'/><title type='text'>Predictive Analytics World, part 1</title><content type='html'>After attending &lt;a href="http://www.predictiveanalyticsworld.com"&gt;Predictive Analytics World (PAW) &lt;/a&gt;last week, I must say that I'm still impressed with the conference, especially for practitioners. &lt;br /&gt;&lt;br /&gt;Eric Siegel's description of &lt;a href="http://www.portraitsoftware.com/resources/white_papers/optimal-targeting-through-uplift-modeling"&gt;uplift modeling&lt;/a&gt; in the opening session was another example of a practical (and in this case, relatively new) approach to predictive modeling. I only heard about uplift modeling for the first time (to my discredit) at the February PAW, and almost had a company implement it this past summer were it not for a re-org that killed the modeling efforts. &lt;br /&gt;&lt;br /&gt;The R community had another strong showing, with REvolution being there, and another R &lt;a href="http://www.predictiveanalyticsworld.com/dc/2009/agenda.php#day1-22"&gt;useR meeting&lt;/a&gt;. I'm amazed at the influence of R in the data mining world. It makes me want to become fluent in R! Just on the list. &lt;br /&gt;&lt;br /&gt;The keynotes by Usama Fayyad and Stephen Baker were every bit as good as one would expect, but it was the interactions with attendees that impressed me most.  The talk I gave received great questions about the practice of using ensembles by several folks who were planning on using this technique with their own data. It's this practical side to the conference that I liked.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5652924-5271187554514151031?l=abbottanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://abbottanalytics.blogspot.com/feeds/5271187554514151031/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5652924&amp;postID=5271187554514151031' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/5271187554514151031'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/5271187554514151031'/><link rel='alternate' type='text/html' href='http://abbottanalytics.blogspot.com/2009/10/predictive-analytics-world-part-1.html' title='Predictive Analytics World, part 1'/><author><name>Dean Abbott</name><uri>http://www.blogger.com/profile/16818000233889520746</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5652924.post-8445369433272354721</id><published>2009-07-17T02:25:00.001-07:00</published><updated>2009-07-31T10:23:55.994-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='open source'/><category scheme='http://www.blogger.com/atom/ns#' term='DIY'/><category scheme='http://www.blogger.com/atom/ns#' term='software'/><category scheme='http://www.blogger.com/atom/ns#' term='machine learning'/><title type='text'>For Do-It-Yourself Types</title><content type='html'>Recently, I came across the Web site of &lt;a href="http://mloss.org/software/"&gt;mloss.org ("machine learning open source software")&lt;/a&gt;, which houses a collection of software components which will be of interest to inventive data miners.  Spanning a variety of languages and algorithm types, the collection can be filtered and searched from the Web site.  Good hunting!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5652924-8445369433272354721?l=abbottanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://abbottanalytics.blogspot.com/feeds/8445369433272354721/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5652924&amp;postID=8445369433272354721' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/8445369433272354721'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/8445369433272354721'/><link rel='alternate' type='text/html' href='http://abbottanalytics.blogspot.com/2009/07/for-do-it-yourself-types.html' title='For Do-It-Yourself Types'/><author><name>Will Dwinnell</name><uri>http://www.blogger.com/profile/03379859054257561952</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='28' src='http://2.bp.blogspot.com/_aTiM0lwqgJ4/TQPgGn46JMI/AAAAAAAAAC4/X2lS2gskiUw/S220/Will%2Bportrait%2BMay-09-2010.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5652924.post-4357221591381442330</id><published>2009-06-30T12:41:00.000-07:00</published><updated>2009-12-08T18:03:13.818-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='data mining books'/><title type='text'>New Data Mining Book Out</title><content type='html'>The new Nisbet, Elder, and Miner book is out now, and has been receiving good reviews on Amazon. A sampling of the 6 reviews so far (all 5 stars):&lt;br /&gt;&lt;br /&gt;&lt;blockquote&gt;The "Handbook of Statistical Analysis &amp; Data Mining Applications" is the finest book I have seen on the subject. It is not only a beautifully crafted book, with numerous color graphs, chart, tables, and screen shots, but the statistical discussion is both clear and comprehensive.&lt;/blockquote&gt;&lt;br /&gt;&lt;br /&gt;&lt;blockquote&gt;This is an extraordinary book. So often within this field books are offered as bibles only to fall short. This book does not and delivers a wide array of information and useful tips for the beginner and veteran data miner. &lt;/blockquote&gt;&lt;br /&gt;&lt;br /&gt;&lt;blockquote&gt;What I like about this book is that it embeds those methods in a broader context, that of the philosophy and structure of data mining writ large, especially as the methods are used in the corporate world. To me, it was really helpful in thinking like a data miner, especially as it involves the mix of science and art. &lt;/blockquote&gt;&lt;br /&gt;&lt;br /&gt;&lt;blockquote&gt;This is one of the few, of many, data mining books that delivers what it promises. &lt;/blockquote&gt;&lt;br /&gt;&lt;br /&gt;It has a great mix of data mining principles with step-by-step solutions (case studies) using data mining software, such as Clementine, Enterprise Miner and Statistica. It is this practical approach to data mining that fills a void in the current selection of books in the marketplace (and there are many great data mining books out there). &lt;br /&gt;&lt;br /&gt;For some, the benefit of the book will be the case studies on Fraud Detection or Text MIning. For others, seeing how to solve problems using Enterprise Miner (or Clementine or Statistica) will be of most benefit, operating almost like a users manual. I most appreciated the first chapter on the history of statistics (Nisbet), Model Complexity and Ensembles (Elder) and the 10 Data Mining Mistakes (Elder). &lt;br /&gt;&lt;br /&gt;One more quote, this from the second forward in the book:&lt;br /&gt;&lt;br /&gt;&lt;blockquote&gt;This volume is not a theoretical treatment of the subject -- the authors themselves recommend other books for this -- but rather contains a description of data mining principles and techniques in a series of “knowledge-transfer” sessions, where examples from real data mining projects illustrate the main ideas.  This aspect of the book makes it most valuable for practitioners, whether novice or more experienced.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;The Handbook of Statistical Analysis and Data Mining Applications is an exceptional book that should be on every data miner's bookshelf, or better yet, found lying open next to the computer.  &lt;br /&gt;&lt;br /&gt;-- Dean Abbott, Abbott Analytics&lt;br /&gt;&lt;/blockquote&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5652924-4357221591381442330?l=abbottanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='related' href='http://www.amazon.com/dp/0123747651?tag=dataminiandpr-20&amp;camp=213381&amp;creative=390973&amp;linkCode=as4&amp;creativeASIN=0123747651&amp;adid=13S2XXD6H1RPJFT959D2&amp;' title='New Data Mining Book Out'/><link rel='replies' type='application/atom+xml' href='http://abbottanalytics.blogspot.com/feeds/4357221591381442330/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5652924&amp;postID=4357221591381442330' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/4357221591381442330'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/4357221591381442330'/><link rel='alternate' type='text/html' href='http://abbottanalytics.blogspot.com/2009/06/new-data-mining-book-out.html' title='New Data Mining Book Out'/><author><name>Dean Abbott</name><uri>http://www.blogger.com/profile/16818000233889520746</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5652924.post-4072900012235547844</id><published>2009-05-18T12:13:00.000-07:00</published><updated>2009-12-08T18:03:41.432-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='jobs'/><category scheme='http://www.blogger.com/atom/ns#' term='business intelligence'/><title type='text'>Is analytics a winner in a recession?</title><content type='html'>Even in a  recession, analytics can (and should) do well. I am often asked how the economy has effected me, and my quick answer is that "it doesn't effect me", mostly because I am a small, sole proprietorship. In general though bad economic times can be good for consultants as corporations shed employees and look for a way to perform their analytics tasks efficiently without having to take on longer-term commitments.&lt;br /&gt;&lt;br /&gt;The way it is put in a recent &lt;a href="http://www.businessweek.com/technology/content/mar2009/tc2009032_101762.htm"&gt;Business Week article&lt;/a&gt; is this (they describe Business Intelligence software rather than data mining software, but the principles are certainly similar):&lt;br /&gt;&lt;br /&gt;&lt;blockquote&gt;Interest in business intelligence software is on the rise, analysts say, as economic woes force companies to pursue profit by delving deeper into the information already at their fingertips. "There's a tremendous pressure on cost containment, on developing accurate forecasts of sales and expenses and trying to align the expense stream with projected revenue stream," says John Van Decker, research vice-president at research firm Gartner (IT). &lt;/blockquote&gt;&lt;br /&gt;And where software is purchased, there is usually many times more the cost of the software in training and consulting to help understand better how to use the software,&lt;br /&gt;&lt;br /&gt;&lt;blockquote&gt;Add in other essential services, and a company can expect to spend more on BI than for other types of software, Evelson says. "For every dollar you spend on business intelligence software, you better expect to spend five to seven times as much on services," such as ensuring it jells with the rest of the company's software, he says. &lt;/blockquote&gt;&lt;br /&gt;But even with software, unless there is clear thinking about the problems that need to be solved, and which ones can be solved realistically (or impacted) with analytics, the software will just sit, doing nothing useful. This is surely a factor in the divide between potential capabilities in analytics (i.e., software on the shelf) and benefits attained by analytics:&lt;br /&gt;&lt;br /&gt;&lt;blockquote&gt;Still, about two-thirds of large U.S. companies believe they need to improve their analytical capabilities and only half believe they are spending enough on business analytics, according to an Accenture (ACN) survey of 250 executives that was released in December. In it, about 57% of companies said they don't have a beneficial, consistently updated, companywide analytical capability, and 72% are working to increase their company's use of business analytics. Today, only 60% of major decisions are based on analytics, according to the survey, while 40% are based on intuition. &lt;/blockquote&gt;&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://images.despair.com/products/demotivators/consulting.jpg"&gt;&lt;img style="margin: 0pt 10px 10px 0pt; float: left; cursor: pointer; width: 402px; height: 337px;" src="http://images.despair.com/products/demotivators/consulting.jpg" alt="" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;The better consultants work themselves out of jobs, rather than &lt;a href="http://despair.com/consulting.html"&gt;perpetuating the problems&lt;/a&gt;. (check out despair.com for tons of hilarious posters).&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Just more information that these are good times for data mining.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5652924-4072900012235547844?l=abbottanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='related' href='http://www.businessweek.com/technology/content/mar2009/tc2009032_101762.htm' title='Is analytics a winner in a recession?'/><link rel='replies' type='application/atom+xml' href='http://abbottanalytics.blogspot.com/feeds/4072900012235547844/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5652924&amp;postID=4072900012235547844' title='5 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/4072900012235547844'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/4072900012235547844'/><link rel='alternate' type='text/html' href='http://abbottanalytics.blogspot.com/2009/05/is-analytics-winner-in-recession.html' title='Is analytics a winner in a recession?'/><author><name>Dean Abbott</name><uri>http://www.blogger.com/profile/16818000233889520746</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>5</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5652924.post-5861702194488484094</id><published>2009-04-25T04:02:00.000-07:00</published><updated>2009-12-08T18:04:47.974-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='statistics'/><category scheme='http://www.blogger.com/atom/ns#' term='model assessment'/><title type='text'>Taking Assumptions With A Grain Of Salt</title><content type='html'>Occasionally, I come across descriptions of clustering or modeling techniques which include mention of "assumptions" being made by the algorithm.  The "assumption" of normal errors from the linear model in least-squares regression is a good example.  The "assumption" of Gaussian-distributed classes in discriminant analysis is another.  I imagine that such assertions must leave novices with some questions and hesitation.  What happens if these assumptions are not met?  Can techniques ever be used if their assumptions are not tested and met?  How badly can the assumption be broken before things go horribly wrong?  It is important to understand the implications of these assumptions, and how they affect analysis.&lt;br /&gt;&lt;br /&gt;In fact, the assumptions being made are made by the theorist who designed the algorithm, not the algorithm itself.  Most often, such assumptions are necessary for some proof of optimality to hold.  Considering myself the practical sort, I do not worry too much about these assumptions.  What matters to me and my clients is how well the model works in practice (which can be assessed via test data), not how well its assumptions are met.  Generally, such assumptions are rarely, if ever, strictly met in practice, and most of these algorithms do reasonably well even under such circumstances.  A particular modeling algorithm may well be the best one available, despite not having its assumptions met.&lt;br /&gt;&lt;br /&gt;My advice is to be aware of these assumptions to better understand the behavior of the algorithms one is using.  Evaluate the performance of a specific modeling technique, not by looking back to its assumptions, but by looking forward to expected behavior, as indicated by rigorous out-of-sample and out-of-time testing.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5652924-5861702194488484094?l=abbottanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://abbottanalytics.blogspot.com/feeds/5861702194488484094/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5652924&amp;postID=5861702194488484094' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/5861702194488484094'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/5861702194488484094'/><link rel='alternate' type='text/html' href='http://abbottanalytics.blogspot.com/2009/04/taking-assumptions-with-grain-of-salt.html' title='Taking Assumptions With A Grain Of Salt'/><author><name>Will Dwinnell</name><uri>http://www.blogger.com/profile/03379859054257561952</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='28' src='http://2.bp.blogspot.com/_aTiM0lwqgJ4/TQPgGn46JMI/AAAAAAAAAC4/X2lS2gskiUw/S220/Will%2Bportrait%2BMay-09-2010.jpg'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5652924.post-2930356489729696620</id><published>2009-04-02T21:40:00.000-07:00</published><updated>2009-04-02T22:13:16.830-07:00</updated><title type='text'>Why normalization matters with K-Means</title><content type='html'>A question about K-means clustering in Clementine was posted &lt;a href="http://www.kdkeys.net/forums/thread/8777.aspx"&gt;here&lt;/a&gt;. I thought I knew the answer, but took the opportunity to prove it to myself.&lt;br /&gt;&lt;br /&gt;I took the &lt;a href="http://www.sigkdd.org/kddcup/index.php?section=1998&amp;amp;method=data"&gt;KDD-Cup 98 data&lt;/a&gt; and just looked at four fields: Age, NumChild, TARGET_D (the amount the recaptured lapsed donors gave) and LASTGIFT. I took only four to make the problem simpler, and chose variables that had relatively large differences in mean values (where normalization might matter). Also, another problem with the two monetary variables is that they are both skewed positively (severely so).&lt;br /&gt;&lt;br /&gt;The following image shows the results of two clustering runs: the first with raw data, the second with normalized data using the Clementine K-Means algorithm. The normalization consisted of log transforms (for TARGET_D and LASTGIFT) and z-scores for all (the log transformed fields, AGE and NUMCHILD). I used the default of 5 clusters.&lt;br /&gt;&lt;br /&gt;Here are the results in tabular form. Note that I'm reporting unnormalized values for the "normalized" clusters even though the actual clusters were formed by the normalized values. This is purely for comparative purposes.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_ehD0fJcPAoI/SdWYgvYSMEI/AAAAAAAAADY/CqGv7GSqhhY/s1600-h/Slide6.JPG"&gt;&lt;img style="margin: 0pt 10px 10px 0pt; float: left; cursor: pointer; width: 400px; height: 300px;" src="http://4.bp.blogspot.com/_ehD0fJcPAoI/SdWYgvYSMEI/AAAAAAAAADY/CqGv7GSqhhY/s400/Slide6.JPG" alt="" id="BLOGGER_PHOTO_ID_5320326223049666626" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Note that:&lt;br /&gt;1) the results are different, as measure by counts in each cluster&lt;br /&gt;2) the unnormalized clusters are dominated by TARGET_D and LASTGIFT--one cluster contains the large values and the remaining have little variance.&lt;br /&gt;3) AGE and NUMCHILD have some similar breakouts (40s with more children and 40s with fewer children for example).&lt;br /&gt;&lt;br /&gt;So, the conclusion is (to answer the original question) K-Means in Clementine does not normalize the data. Since Euclidean distance is used, the clusters will be influenced strongly by the magnitudes of the variables, especially by outliers. Normalizing removes this bias. However, whether or not one desires this removal of bias depends on what one wants to find: sometimes if one would want a variable to influence the clusters more, one could manipulate the clusters precisely in this way, by increasing the relative magnitude of these fields.&lt;br /&gt;&lt;br /&gt;One last issue that I didn't explore here, is the effects of correlated variables (LASTGIFT and TARGET_D to some degree here). It seems to me that correlated variables will artificially bias the clusters toward natural groupings of those variables, though I have never proved the extent of this bias in a controlled way (maybe someone can point to a paper that shows this clearly).&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5652924-2930356489729696620?l=abbottanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://abbottanalytics.blogspot.com/feeds/2930356489729696620/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5652924&amp;postID=2930356489729696620' title='6 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/2930356489729696620'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/2930356489729696620'/><link rel='alternate' type='text/html' href='http://abbottanalytics.blogspot.com/2009/04/why-normalization-matters-with-k-means.html' title='Why normalization matters with K-Means'/><author><name>Dean Abbott</name><uri>http://www.blogger.com/profile/16818000233889520746</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/_ehD0fJcPAoI/SdWYgvYSMEI/AAAAAAAAADY/CqGv7GSqhhY/s72-c/Slide6.JPG' height='72' width='72'/><thr:total>6</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5652924.post-4417850192645533745</id><published>2009-04-01T14:16:00.000-07:00</published><updated>2009-04-01T14:30:12.223-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='graphs'/><category scheme='http://www.blogger.com/atom/ns#' term='graphics'/><category scheme='http://www.blogger.com/atom/ns#' term='graph'/><category scheme='http://www.blogger.com/atom/ns#' term='graphing'/><title type='text'>Graphing Considered Dangerous</title><content type='html'>In my posting of Jun-25-2007, &lt;a href="http://abbottanalytics.blogspot.com/2007/06/to-graph-or-not-to-graph.html"&gt;To Graph Or Not To Graph &lt;/a&gt;, I made the case (tentatively) that graphs weren't all they're cracked up to be, and provoked some lively discussion in the Comments section here.  In his Apr-01-2009 posting, &lt;a href="http://www.stat.columbia.edu/~cook/movabletype/archives/2009/04/why_tables_are.html"&gt;Why tables are really much better than graphs&lt;/a&gt; on the &lt;a href="http://www.stat.columbia.edu/~cook/movabletype/mlm/"&gt;Statistical Modeling, Causal Inference, and Social Science&lt;/a&gt; Web log, Andrew Gelman makes a much more forceful case against graphs.  Readers may find Gelman's arguments of interest.&lt;br /&gt;&lt;br /&gt;I am not "anti-graph", but do think that graphs are often used when other tools (test statistics, tables, etc.) would have been a better choice, and graphs are certainly frequently misused.  Thoughts?&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5652924-4417850192645533745?l=abbottanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://abbottanalytics.blogspot.com/feeds/4417850192645533745/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5652924&amp;postID=4417850192645533745' title='5 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/4417850192645533745'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/4417850192645533745'/><link rel='alternate' type='text/html' href='http://abbottanalytics.blogspot.com/2009/04/graphing-considered-dangerous.html' title='Graphing Considered Dangerous'/><author><name>Will Dwinnell</name><uri>http://www.blogger.com/profile/03379859054257561952</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='28' src='http://2.bp.blogspot.com/_aTiM0lwqgJ4/TQPgGn46JMI/AAAAAAAAAC4/X2lS2gskiUw/S220/Will%2Bportrait%2BMay-09-2010.jpg'/></author><thr:total>5</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5652924.post-4436080699545241771</id><published>2009-03-19T21:00:00.000-07:00</published><updated>2009-12-29T07:58:55.843-08:00</updated><title type='text'>How many software packages are too much?</title><content type='html'>I just saw a question at &lt;a href="http://smartdatacollective.com/Home/17243"&gt;SmartDataCollective &lt;/a&gt;about how many data mining packages one needs. He writes,&lt;br /&gt;&lt;blockquote&gt;we found out that a particular client is using THREE Data Mining softwares. Not statistical softwares or the base versions, but the complete, very expensive Data Mining softwares – SAS EM, SPSS Clementine and KXEN.&lt;p&gt;I was like, “Wow!!! But do you really need 3 Data Mining softwares???” Our initial questions and the client’s answers confirmed that inconsistent data formats was not the reason as the client already has a BI/DW system. Their reason? Well, they have the opinion that some algorithms/techniques in a particular DM software is much better and accurate than the same algorithms/techniques in another DM software.&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;&lt;/p&gt;I believe there are truly good reasons to have more than one data mining software package. Each tool has its own strengths and weaknesses. As one example, Affinium Model is very good at building hundreds or even thousands of models automatically, whereas Tibco S+ (formerly Insightful Miner) only builds one model at a time. On the other hand, the flexibility of Miner in data preparation, sampling, and settings for building models is much richer than Model. I like to have several tools around for these kind of reasons.&lt;br /&gt;&lt;br /&gt;A second reason to have (or to be proficient in) multiple tools as an analytics consultant is that you can plug into nearly any organization if they have tools they want you to use. Currently, I'm working on projects that are using Clementine, Matlab, Statistica, and Insightful Miner. Last year I worked with a customer that was using CART (Salford Systems) and Oracle Data Miner, Polyanalyst, and even briefly IBM Intelligent Miner.&lt;br /&gt;&lt;br /&gt;However, except for very rare circumstances, the algorithms themselves are not appreciably different from tool to tool. Yes I know that some tools have extra knobs and options, but backprop is backprop, the Gini index is the Gini index, Entropy is Entropy. The only reason I would have both KXEN and SAS/EM or Clementine is if I wanted the automation of KXEN sometimes, and the full control of of EM or Clementine (it is hard for me to imagine why I would want both Clementine and EM--any takers on this one?).&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5652924-4436080699545241771?l=abbottanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://abbottanalytics.blogspot.com/feeds/4436080699545241771/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5652924&amp;postID=4436080699545241771' title='9 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/4436080699545241771'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/4436080699545241771'/><link rel='alternate' type='text/html' href='http://abbottanalytics.blogspot.com/2009/03/how-many-software-packages-is-too-much.html' title='How many software packages are too much?'/><author><name>Dean Abbott</name><uri>http://www.blogger.com/profile/16818000233889520746</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>9</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5652924.post-1213969469884738655</id><published>2009-03-16T16:23:00.000-07:00</published><updated>2009-12-08T18:05:02.229-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='conferences'/><title type='text'>eMetrics Conference</title><content type='html'>Early-bird pricing ends Friday for the May 4-7 &lt;a href="http://www.emetrics.org/sanjose"&gt;eMetrics&lt;/a&gt; conference in San Jose. You get a 12% discount if you use the promo code ABBOTT12 (don't worry, I don't get anything except the satisfaction that a reader of this blog got a discount). I can't go, but hope to get to one before too long.&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5652924-1213969469884738655?l=abbottanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://abbottanalytics.blogspot.com/feeds/1213969469884738655/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5652924&amp;postID=1213969469884738655' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/1213969469884738655'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/1213969469884738655'/><link rel='alternate' type='text/html' href='http://abbottanalytics.blogspot.com/2009/03/emetrics-conference.html' title='eMetrics Conference'/><author><name>Dean Abbott</name><uri>http://www.blogger.com/profile/16818000233889520746</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5652924.post-5397569854567911931</id><published>2009-03-16T08:39:00.000-07:00</published><updated>2009-12-08T18:05:16.546-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='webinars'/><title type='text'>Predictive Analytics Webinar</title><content type='html'>I'm participating in a &lt;a href="http://www.the-modeling-agency.com/webinar/index.html"&gt;free webinar&lt;/a&gt; through The Modeling Agency tomorrow at 4pm EDT (1pm PDT) for anyone interested in listening in. Tony Rathburn is doing the first technical part, and I follow with about 20 minutes of vignettes. If you do listen in, feel free to post comments here on the content (all critiques welcomed!) We'll repeat the webinar on April 7th and April 22nd.&lt;div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5652924-5397569854567911931?l=abbottanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://abbottanalytics.blogspot.com/feeds/5397569854567911931/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5652924&amp;postID=5397569854567911931' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/5397569854567911931'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/5397569854567911931'/><link rel='alternate' type='text/html' href='http://abbottanalytics.blogspot.com/2009/03/predictive-analytics-webinar.html' title='Predictive Analytics Webinar'/><author><name>Dean Abbott</name><uri>http://www.blogger.com/profile/16818000233889520746</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5652924.post-5624122549325637062</id><published>2009-03-08T10:25:00.000-07:00</published><updated>2009-03-08T10:42:23.055-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='whimsical'/><title type='text'>Some Interesting Analyses</title><content type='html'>I find it interesting to learn what other people are working on.  To me, the applications can be as interesting as the technology- even if they're not saving millions or curing cancer.  Some of these analyses could be a bit more rigorous, but they do suggest avenues for further research, and at least they aren't boring!  Here are a few things I've run across in cyberspace recently:&lt;br /&gt;&lt;br /&gt;&lt;a href="http://confoundingblog.wordpress.com/2009/02/26/is-warhammer-balanced/"&gt;Is Warhammer Balanced?&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;a href="http://baseballanalysts.com/archives/2009/03/20062008_payrol.php"&gt;MLB Payroll Efficiency, 2006-2008&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;a href="http://www.wired.com/special_multimedia/2009/st_infoporn_1702"&gt;Wired magazine: issue 17.02&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;a href="http://blog.dotphys.net/2009/02/the-price-of-a-piece-of-lego/"&gt;Analysis of the price of a piece of a lego set&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;a href="http://dberri.wordpress.com/2009/03/05/modeling-win-probability-for-a-college-basketball-game-a-guest-post-from-brian-burke/"&gt;Modeling Win Probability for a College Basketball Game&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5652924-5624122549325637062?l=abbottanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://abbottanalytics.blogspot.com/feeds/5624122549325637062/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5652924&amp;postID=5624122549325637062' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/5624122549325637062'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/5624122549325637062'/><link rel='alternate' type='text/html' href='http://abbottanalytics.blogspot.com/2009/03/some-interesting-analyses.html' title='Some Interesting Analyses'/><author><name>Will Dwinnell</name><uri>http://www.blogger.com/profile/03379859054257561952</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='28' src='http://2.bp.blogspot.com/_aTiM0lwqgJ4/TQPgGn46JMI/AAAAAAAAAC4/X2lS2gskiUw/S220/Will%2Bportrait%2BMay-09-2010.jpg'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5652924.post-4790290022313650156</id><published>2009-03-07T03:31:00.000-08:00</published><updated>2009-03-07T03:38:29.266-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='careers'/><category scheme='http://www.blogger.com/atom/ns#' term='jobs'/><category scheme='http://www.blogger.com/atom/ns#' term='job'/><category scheme='http://www.blogger.com/atom/ns#' term='career'/><title type='text'>Data Mining: Does It Get Any Better Than This?</title><content type='html'>The article &lt;a href="http://online.wsj.com/article/SB123119236117055127.html"&gt;Doing the Math to Find the Good Jobs &lt;/a&gt; appeared in the Jan-26-2009 issue of &lt;i&gt;The Wall Street Journal&lt;/i&gt;, listing the top 3 "best" jobs (of 200 studied) as:&lt;br /&gt;&lt;br /&gt;1. Mathematician&lt;br /&gt;2. Actuary&lt;br /&gt;3. Statistician&lt;br /&gt;&lt;br /&gt;I assume that "data miner" fits in somewhere among these, yipee!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5652924-4790290022313650156?l=abbottanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://abbottanalytics.blogspot.com/feeds/4790290022313650156/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5652924&amp;postID=4790290022313650156' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/4790290022313650156'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/4790290022313650156'/><link rel='alternate' type='text/html' href='http://abbottanalytics.blogspot.com/2009/03/data-mining-does-it-get-any-better-than.html' title='Data Mining: Does It Get Any Better Than This?'/><author><name>Will Dwinnell</name><uri>http://www.blogger.com/profile/03379859054257561952</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='28' src='http://2.bp.blogspot.com/_aTiM0lwqgJ4/TQPgGn46JMI/AAAAAAAAAC4/X2lS2gskiUw/S220/Will%2Bportrait%2BMay-09-2010.jpg'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5652924.post-592742523226930537</id><published>2009-02-17T06:21:00.000-08:00</published><updated>2009-02-17T14:02:38.374-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='data mining education'/><title type='text'>Maybe these will be great days for data miners!</title><content type='html'>While perusing the &lt;a href="http://analytics.ncsu.edu/"&gt;NC State Institute for Advanced Analytics site&lt;/a&gt; (to follow up on the previous post on data mining education), I noticed a link to US News and World Reports career guide, one of which describes how data mining is an "&lt;a href="http://www.usnews.com/articles/business/best-careers/2008/12/04/ahead-of-the-curve-careers-2008.html"&gt;ahead of the curve&lt;/a&gt;" career for 2009. While the example is quite limited that is mentioned, it is interesting that data mining is getting such national recognition. Maybe we're in the right industry after all!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5652924-592742523226930537?l=abbottanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://abbottanalytics.blogspot.com/feeds/592742523226930537/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5652924&amp;postID=592742523226930537' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/592742523226930537'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/592742523226930537'/><link rel='alternate' type='text/html' href='http://abbottanalytics.blogspot.com/2009/02/maybe-these-will-be-great-days-for-data.html' title='Maybe these will be great days for data miners!'/><author><name>Dean Abbott</name><uri>http://www.blogger.com/profile/16818000233889520746</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5652924.post-3256281535452803461</id><published>2009-02-14T21:21:00.000-08:00</published><updated>2009-02-14T21:33:18.653-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='model selection'/><category scheme='http://www.blogger.com/atom/ns#' term='business understanding'/><title type='text'>Could these be great days for data miners?</title><content type='html'>In a recent article on cfo.com, &lt;a href="http://www.cfo.com/article.cfm/13129558/?f=rsspage"&gt;Data Mining in the Meltdown: the Last, Best Hope?&lt;/a&gt; the author describes how data quality is the key to future success of businesses.  But data quality by itself is not enough,&lt;br /&gt;&lt;blockquote&gt;Of course, data quality matters little if a company is focusing on the wrong measures. The best companies adopt a customer-oriented definition of data quality and recognize that all items of data are not created equal...&lt;/blockquote&gt;In other words, the business objective phase (in the CRISP-DM way of viewing things) is critical. I would add that building models that are assessed in a manner commensurate with the business objective is every bit as important. If you build a series of regression models and take the one with the best R^2, you have very little idea from that metric whether or not the model will do anything productive. One must score and assess the model to reflect the business objective.&lt;br /&gt;&lt;br /&gt;The author gets at this idea indirectly with this comment:&lt;br /&gt;&lt;blockquote&gt;For every key performance indicator (KPI), for example, companies should be tracking a key risk indicator (KRI), Friend says. "You plan not just for results, but for contingencies. What happens if sales are down 20 percent?"&lt;br /&gt;&lt;/blockquote&gt;In other words, there may just be significant asymmetric costs to incorprate in the scoring of models. I'll be bringing this up at Predictive Analytics World this week; it is arguably one of the biggest mistakes made by modelers.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5652924-3256281535452803461?l=abbottanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://abbottanalytics.blogspot.com/feeds/3256281535452803461/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5652924&amp;postID=3256281535452803461' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/3256281535452803461'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/3256281535452803461'/><link rel='alternate' type='text/html' href='http://abbottanalytics.blogspot.com/2009/02/could-these-be-great-days-for-data.html' title='Could these be great days for data miners?'/><author><name>Dean Abbott</name><uri>http://www.blogger.com/profile/16818000233889520746</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5652924.post-4198595875934360590</id><published>2009-02-10T15:01:00.000-08:00</published><updated>2009-02-10T15:32:15.543-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='data mining degree'/><category scheme='http://www.blogger.com/atom/ns#' term='data mining training'/><category scheme='http://www.blogger.com/atom/ns#' term='data mining books'/><title type='text'>Can you learn data mining in undergraduate or graduate school?</title><content type='html'>I was recently asked by a former student from one of my data mining courses if a particular program was a good one to learn data mining (it happened to be &lt;a href="http://analytics.ncsu.edu/?page_id=123"&gt;this one&lt;/a&gt;, from NC State). It raises an interesting question: how much can data mining be learned from a book or a course?&lt;br /&gt;&lt;br /&gt;Some of the best data miners I have met did not have any statistics course in their past, nor (for some) any higher level mathematics. For my part, I was a computational mathematics major undergrad, and applied math for my masters, but never took a stats course either (though I did take and TA a probability course). That stated, I always recommend in &lt;a href="http://abbottanalytics.com/data-mining-courses-and-seminars.php"&gt;my courses&lt;/a&gt; that folks become familiar with basic statistics; one book I have recommended is linked in the book recommendations section--The Cartoon Guide to Statistics. Since I have never taken a college or graduate data mining course, I can't comment directly. My concern is that they are too theoretical (how the algorithms work) rather than practical (how to handle data problems, how to pose proper questions to be addressed by data mining, etc.).&lt;br /&gt;&lt;br /&gt;I'm willing to be persuaded though, so if you have experience with good, practical data mining curricula, please let me know.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5652924-4198595875934360590?l=abbottanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://abbottanalytics.blogspot.com/feeds/4198595875934360590/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5652924&amp;postID=4198595875934360590' title='9 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/4198595875934360590'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/4198595875934360590'/><link rel='alternate' type='text/html' href='http://abbottanalytics.blogspot.com/2009/02/can-you-learn-data-mining-in.html' title='Can you learn data mining in undergraduate or graduate school?'/><author><name>Dean Abbott</name><uri>http://www.blogger.com/profile/16818000233889520746</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>9</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5652924.post-7600889891358320454</id><published>2009-01-31T20:36:00.000-08:00</published><updated>2009-02-10T15:27:46.618-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='data mining conferences'/><title type='text'>Predictive Analytics World</title><content type='html'>There is a new predictive analytics conference coming up Feb 18-19 in San Francisco called &lt;a href="http://www.predictiveanalyticsworld.com/"&gt;Predictive Analytics World&lt;/a&gt;. I'm very much looking forward to it in the hopes that it will appeal to the data mining / predictive analytics practitioner.&lt;br /&gt;&lt;br /&gt;I'll be presenting a case study I worked on with &lt;a href="http://tnmarketing.com/"&gt;TN Marketing&lt;/a&gt; using ensembles of logistic regression models. Also, I'll be on a &lt;a href="http://www.predictiveanalyticsworld.com/agenda.php#expert"&gt;panel discussion&lt;/a&gt; on Cross-Industry Challenges and Solutions in Predictive Analytics.&lt;br /&gt;&lt;br /&gt;Hope to see some of you there!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5652924-7600889891358320454?l=abbottanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://abbottanalytics.blogspot.com/feeds/7600889891358320454/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5652924&amp;postID=7600889891358320454' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/7600889891358320454'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/7600889891358320454'/><link rel='alternate' type='text/html' href='http://abbottanalytics.blogspot.com/2009/01/predictive-analytics-world.html' title='Predictive Analytics World'/><author><name>Dean Abbott</name><uri>http://www.blogger.com/profile/16818000233889520746</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5652924.post-3917567966283746326</id><published>2009-01-31T20:24:00.000-08:00</published><updated>2009-02-10T15:28:02.832-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='text mining'/><category scheme='http://www.blogger.com/atom/ns#' term='regular expressions'/><title type='text'>Text Mining and Regular Expressions</title><content type='html'>I've been spending quite a lot of time in the bowels of a text mining project recently, mostly in the text/concept extraction phase. We're using the SPSS Text Mining tool for the work so far. (As a quick aside, the text mining book I've enjoyed reading the most in recent months is the &lt;a href="http://www.amazon.com/Text-Mining-Predictive-Unstructured-Information/dp/0387954333/ref=sr_1_1?ie=UTF8&amp;amp;s=books&amp;amp;qid=1233462271&amp;amp;sr=8-1"&gt;Weiss, Indurkhya, Zhang, and Damerau&lt;/a&gt;)&lt;br /&gt;&lt;br /&gt;The most difficult part of the project has been that all of the text is really customized lingo--a language of its own as presented in the notes sections of the documents we are reading. Therefore, we can't use the typical linguistic extraction techinques, and rather are relying heavily on regular expressions. That certainly takes me back a few years! I used to use regular expressions mostly in shell programming (Bourne, CShell, Korn Shell and later BASH).&lt;br /&gt;&lt;br /&gt;I must say it has been very productive, though it also makes me appreciate language rules that don't exist in any consistent way with our notes. As I am able, I'll post on more specifics on this project.&lt;br /&gt;&lt;br /&gt;Regarding books on regular expressions, I found the unix books weren't quite so good on this topic. However, the O'Reilly &lt;a href="http://www.amazon.com/Mastering-Regular-Expressions-Jeffrey-Friedl/dp/0596528124/ref=sr_1_1?ie=UTF8&amp;amp;s=books&amp;amp;qid=1233462898&amp;amp;sr=1-1"&gt;Mastering Regular Expressions&lt;/a&gt; book is quite good.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5652924-3917567966283746326?l=abbottanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://abbottanalytics.blogspot.com/feeds/3917567966283746326/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5652924&amp;postID=3917567966283746326' title='9 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/3917567966283746326'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/3917567966283746326'/><link rel='alternate' type='text/html' href='http://abbottanalytics.blogspot.com/2009/01/text-mining-and-regular-expressions.html' title='Text Mining and Regular Expressions'/><author><name>Dean Abbott</name><uri>http://www.blogger.com/profile/16818000233889520746</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>9</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5652924.post-9022994062110319607</id><published>2008-11-22T23:02:00.001-08:00</published><updated>2011-10-15T20:10:53.530-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='predictive analytics'/><category scheme='http://www.blogger.com/atom/ns#' term='business intelligence'/><title type='text'>What is Predictive Analytics?</title><content type='html'>I just saw this link about the difference between BI and Predictive Analytics. This comes on the heels of a meeting I had with UCSD Extension folks, talking about predictive analytics and data mining in the context of teaching courses for professionals, and this topic came up: how is predictive analytics different from BI?&lt;br /&gt;&lt;br /&gt;First, I'd like to applaud the author,  &lt;a href="http://it.toolbox.com/people/vlad_0424/"&gt;Vladimir Stojanovski&lt;/a&gt;, for concluding there are differences, and for trying to get at what those differences are.&lt;br /&gt;&lt;br /&gt;The article states that this:&lt;br /&gt;&lt;blockquote&gt;&lt;br /&gt;To tie this all back to the question of BI vs. Predictive Analytics (PA), a metaphor I've heard used to describe the difference goes something like this: if BI is a look in the rearview mirror, predictive analytics is the view out the windshield.&lt;/blockquote&gt;&lt;br /&gt;&lt;br /&gt;In my experience, this is a common definition. Predictive Analytics and Data Mining are seen as predicting future events, whereas OLAP looks at past data.&lt;br /&gt;&lt;br /&gt;While I'd love to jump on this bandwagon because it makes for a simple and compelling story, I cannot ride this one. And that's because both BI and PA look at historic data. PA isn't magic in coming up with predictions of the future. In fact, both BI and PA ultimately look at and use the same data (or variations of the same historic data). Both can predict the future, so long as the future is consistent with past, either in a static sense, or in a dynamic sense (by extrapolating past data into the future).&lt;br /&gt;&lt;br /&gt;I think it is better to describe the difference in this way: BI reports on historical data based upon an analyst's perspective on which fields and statistics are interesting, whereas PA induces which fields, statistics and relationships are interesting from the data itself. I think it is the combinatorics, sifting, iterative nature of PA that gives it better predictive accuracy of the future (coupled with using business metrics to assess if the fields found truly are predictive or not).&lt;br /&gt;&lt;br /&gt;So let's not oversell--what PA does is reason enough for it to be an integral part of any analytics or BI group.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5652924-9022994062110319607?l=abbottanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='related' href='http://it.toolbox.com/blogs/crm-realms/business-intelligence-and-predictive-analytics-28373' title='What is Predictive Analytics?'/><link rel='replies' type='application/atom+xml' href='http://abbottanalytics.blogspot.com/feeds/9022994062110319607/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5652924&amp;postID=9022994062110319607' title='11 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/9022994062110319607'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/9022994062110319607'/><link rel='alternate' type='text/html' href='http://abbottanalytics.blogspot.com/2008/11/what-is-predictive-analytics.html' title='What is Predictive Analytics?'/><author><name>Dean Abbott</name><uri>http://www.blogger.com/profile/16818000233889520746</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>11</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5652924.post-8226961811932609724</id><published>2008-10-20T14:28:00.000-07:00</published><updated>2009-02-10T15:28:51.082-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='KDD'/><category scheme='http://www.blogger.com/atom/ns#' term='data mining conferences'/><title type='text'>What topics would you like to see covered at a KDD conference?</title><content type='html'>This is your chance to voice your opinion!&lt;br /&gt;&lt;br /&gt;What topics, sessions, or tutorials would be most useful for you at a conference like KDD? Would a full industrial track be of interest, of are industries so diverse that we really need tracks to be narrowed to specific industries?&lt;br /&gt;&lt;br /&gt;Please--practitioners only. I'm defining practitioners as those who get paid to develop models that are actually used in industry.&lt;br /&gt;&lt;br /&gt;I'll kick it off with one idea:&lt;br /&gt;&lt;br /&gt;Tutorials (1/2 day) geared toward the practitioner. This means that if techniques are described (such as social networking), there must be implementations of the algorithmic ideas available in competitive commercial software. As great as R and Matlab are, for example, relatively few practitioners are programmers that can take advantage of these kinds of frameworks.&lt;br /&gt;&lt;br /&gt;I know there are tutorials at KDD every year. This year I didn't go because they were all on Sunday and I wasn't able to attend then, but would have wanted to go to the Text Mining tutorial as that is a topic that has become a significant part of my business over the past couple of years.&lt;br /&gt;&lt;br /&gt;One last thought: I think one thing that may happen (understandably) is that topics that have been covered in years passed are not revisited. For those of us who live in the data mining world, it is far more interesting to continue to explore new ideas, especially those that build on ideas we have already explored in depth. However, as data mining increases in its use, we are bringing folks in who have not had that same benefit. For many, a tutorial on decision trees would be very useful and interesting (like the KDD 2001 tutoral--trees to my knowledge have not been revisited since except in the framework of ensembles in 2007).&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5652924-8226961811932609724?l=abbottanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://abbottanalytics.blogspot.com/feeds/8226961811932609724/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5652924&amp;postID=8226961811932609724' title='9 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/8226961811932609724'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/8226961811932609724'/><link rel='alternate' type='text/html' href='http://abbottanalytics.blogspot.com/2008/10/what-topics-would-you-like-to-see.html' title='What topics would you like to see covered at a KDD conference?'/><author><name>Dean Abbott</name><uri>http://www.blogger.com/profile/16818000233889520746</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>9</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5652924.post-453378774088697036</id><published>2008-10-09T05:35:00.000-07:00</published><updated>2008-10-09T06:32:26.346-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='books'/><title type='text'>Two Books of Interest</title><content type='html'>Recently, I have been reading two books which may be of interest to data miners, &lt;span style="font-style:italic;"&gt;Statistical Rules of Thumb&lt;/span&gt; by Gerald Van Belle (ISBN-13: 978-0471402275) and &lt;span style="font-style:italic;"&gt;Common Errors in Statistics (and How to Avoid Them)&lt;/span&gt;, by by Phillip I. Good and James W. Hardin (ISBN-13: 978-0471794318).  Both impart practical advice based on extensive experience and statistical rigor, yet avoid becoming hung up on academic issues.&lt;br /&gt;&lt;br /&gt;While both are written from the point of view of traditional statisticians, they do suggest the use of some less traditional techniques, such as the bootstrap and robust regression.  A wide range of topics is covered, such as sample size determination, hypothesis testing and treatment of missing values.  Both books also include some material written for audiences working in specific fields, such as environmental science and epidemiology.  Material in these two books will vary in applicability to data mining, given the traditional statistical focus on smaller data sets and parametric modeling.&lt;br /&gt;&lt;br /&gt;I highly recommend both of them.  Tables of contents can easily be found on-line, and an entire chapter of &lt;span style="font-style:italic;"&gt;Statistical Rules of Thumb&lt;/span&gt; is available at: &lt;a href="http://vanbelle.org/chapters%5Cwebchapter2.pdf"&gt;Chapter 2: Sample Size&lt;/a&gt;.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5652924-453378774088697036?l=abbottanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://abbottanalytics.blogspot.com/feeds/453378774088697036/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5652924&amp;postID=453378774088697036' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/453378774088697036'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/453378774088697036'/><link rel='alternate' type='text/html' href='http://abbottanalytics.blogspot.com/2008/10/two-books-of-interest.html' title='Two Books of Interest'/><author><name>Will Dwinnell</name><uri>http://www.blogger.com/profile/03379859054257561952</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='28' src='http://2.bp.blogspot.com/_aTiM0lwqgJ4/TQPgGn46JMI/AAAAAAAAAC4/X2lS2gskiUw/S220/Will%2Bportrait%2BMay-09-2010.jpg'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5652924.post-4744604239232743143</id><published>2008-09-25T14:47:00.000-07:00</published><updated>2009-02-10T15:29:46.804-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='KDD'/><category scheme='http://www.blogger.com/atom/ns#' term='data mining conferences'/><title type='text'>KDD 2008</title><content type='html'>It's hard to believe that KDD2008 was the first KDD I've attended in seven years. It was striking how much has changed in that time, and that was one of the primary reasons I attended this past year--to see for myself if the reports I've heard are true. Sure enough, they are.&lt;br /&gt;&lt;br /&gt;These reports, primarily from colleagues in industry, were that KDD didn't have anything they could "take home and use". Many of these folks are analysts who are decidedly not academic, so I thought I had a sense for what they meant.&lt;br /&gt;&lt;br /&gt;I found their reports hit the mark. Seven years ago I was able to find (1) significant numbers of industry personnel at the conference and (2) many talks that were accessible enough for non-academics to understand. This time around there were few industry practitioners I met who were not PhDs. That's not to say there weren't interesting talks. Two I didn't see in person, but read later were the &lt;a href="http://kdd2008.com/papers.html"&gt;Elkan&lt;/a&gt; paper on learning from positive and unlabelled examples and the &lt;a href="http://kdd2008.com/papers.html"&gt;Grossman&lt;/a&gt; paper on Data Clouds.  Though-provoking both. The lunch talk by Trevor Hastie was very interesting in talking about regularization, but it was geared toward those who can digest his &lt;a href="http://www.amazon.com/Elements-Statistical-Learning-T-Hastie/dp/0387952845/ref=pd_bbs_2?ie=UTF8&amp;amp;s=books&amp;amp;qid=1222380759&amp;amp;sr=8-2"&gt;textbook&lt;/a&gt; (which is among the finest data mining / statistical learning texts out there). &lt;br /&gt;&lt;br /&gt;Social networking was a key theme of the conference, and it was such a dominant force at the conference that it deserves a separate post.&lt;br /&gt;&lt;br /&gt;Lastly, the decline in participation by the business community was nowhere more evident than in the vendors room--only a few data mining software vendors were there, which indicates to me that it isn't viewed as a place to increase sales: if I remember correctly, only Microsoft, Oracle, Statsoft, Salford Systems, and SAS were there. A quick look at the &lt;a href="http://www.kdnuggets.com/polls/2008/data-mining-software-tools-used.htm"&gt;kdnuggets software survey&lt;/a&gt; shows who wasn't there.&lt;br /&gt;&lt;br /&gt;So it seems that KDD has wandered from a business/academic mix to a more academic conference, which is, of course, the prerogative of the organizers. I'm still searching for a great conference for the data mining practitioner who has the level of understanding of data mining to read and absorb a book like the &lt;a href="http://www.amazon.com/Data-Mining-Practical-Techniques-Management/dp/0120884070/ref=sr_1_1?ie=UTF8&amp;amp;s=books&amp;amp;qid=1222380978&amp;amp;sr=1-1"&gt;Witten/Frank machine learning book&lt;/a&gt; but desires a more practical approach to the subject.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5652924-4744604239232743143?l=abbottanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://abbottanalytics.blogspot.com/feeds/4744604239232743143/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5652924&amp;postID=4744604239232743143' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/4744604239232743143'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/4744604239232743143'/><link rel='alternate' type='text/html' href='http://abbottanalytics.blogspot.com/2008/09/kdd-2008.html' title='KDD 2008'/><author><name>Dean Abbott</name><uri>http://www.blogger.com/profile/16818000233889520746</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5652924.post-1087745933366999374</id><published>2008-05-28T06:57:00.000-07:00</published><updated>2009-02-10T15:30:05.019-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='data mining software'/><title type='text'>What data mining software to buy?</title><content type='html'>This post (&lt;a href="http://www.dmreview.com/issues/2007_46/10001040-1.html?portal=analytics"&gt;http://www.dmreview.com/issues/2007_46/10001040-1.html?portal=analytics&lt;/a&gt;) is an interesting example of the assessment of analytics software.  The key paragraph is the conclusion where Mr. Raab states&lt;br /&gt;&lt;blockquote&gt;Instead of a horserace between product features, this approach puts the focus where it should be: on value to your business. It recognizes that the value of a new tool depends on the other tools already available, and it forces evaluation teams to explicitly study the impact of different tools on different users. By creating a clearer picture of how each new tool will impact the way work actually gets done within the company, it leads to more realistic product assessments and ultimately to more productive selection choices.&lt;/blockquote&gt;&lt;br /&gt;&lt;br /&gt;I couldn't agree more. For the past 10 years, since the Elder and Abbott review of data mining software presented at KDD-98 (&lt;a href="http://www.abbottanalytics.com/assets/zip/Abbott-Analytics-Comparison-High-End-DM-Tools-1998.zip"&gt;on my web site&lt;/a&gt;)  I've tried to think of ways to summarize data mining software. The obvious way is by features, such as which algorithms a product has. The usability of a tool is another characteristic to add, as John, Philip Matkovsky and I wrote about in &lt;a href="http://www.abbottanalytics.com/assets/pdf/Abbott-Analytics-Evaluation-High-End-DM-Tools-1998.pdf"&gt;"An Evaluation of High-End Data Mining Tools for Fraud Detection"&lt;/a&gt;. I've also described  the different packages by the kind of interface (wizard, menu-driven, block-diagram, command line, etc.).&lt;br /&gt;&lt;br /&gt;It's not easy to provide a summary in this multi-dimensional view of data mining tools. Sounds like an opportunity for predictive modeling!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5652924-1087745933366999374?l=abbottanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://abbottanalytics.blogspot.com/feeds/1087745933366999374/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5652924&amp;postID=1087745933366999374' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/1087745933366999374'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/1087745933366999374'/><link rel='alternate' type='text/html' href='http://abbottanalytics.blogspot.com/2008/05/what-data-mining-software-to-buy.html' title='What data mining software to buy?'/><author><name>Dean Abbott</name><uri>http://www.blogger.com/profile/16818000233889520746</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5652924.post-1294034520896173907</id><published>2008-05-26T21:34:00.000-07:00</published><updated>2009-02-10T15:30:36.688-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='data mining perceptions'/><title type='text'>What Makes a Data Mining Skeptic?</title><content type='html'>I just found this post expressing skepticism about data mining (I'll let go the comment about predictive analytics being the holy grail of data mining--not sure what this means).&lt;br /&gt;&lt;br /&gt;The fascinating part for me was this paragraph:&lt;br /&gt;&lt;br /&gt;&lt;blockquote&gt;Anyway. Lindy and I were a bit squirmy through the whole discussion. It seemed like so many hopes and dreams were being placed at the altar of the goddess Clementine... but I had to ask myself, could you REALLY get any more analysis out of it then you could get simply by asking your members what events they attend, plan to attend, ever attended, or might attend in the future, and why? Since when did we stop talking to our members about this stuff? A good internal marketing manager could give you all the answers you seek about which of your various audiences are likely to respond to which of your messages, who's going to engage with you, why and when, who's going to participate in which of your events, etcetera, and they would know these answers not through stats and charts (even if you ask for them) but through experience and listening.&lt;/blockquote&gt;&lt;br /&gt;&lt;br /&gt;It is interesting on several fronts. First, there is a strong emphasis on personal expertise and experience. But at the heart of the critique is apparently a belief that the data cannot reveal insights, or in other words, a data-driven approach doesn't give you any "analysis". Why would one believe this? (and I do not doubt the sincerity of the comment--I take it at face value).&lt;br /&gt;&lt;br /&gt;One reason may be that this individual has never seen or experienced a predictive analytics solution. While this may be true, it also misses what I think is at the heart of the critique. There is a false dichotomy set up here between data analysis and individual expertise. Anyone who has built predictive models successfully knows that one usually must have both: expert knowledge and representative data (to build predictive models).&lt;br /&gt;&lt;br /&gt;One reason for this is that while there are undoubtedly some individuals who can "give you all the answers you seek about which of your various audiences are likely to respond to which of your messages". But usually, this falls short for two reasons:&lt;br /&gt;1) most individuals who have to deal with large quantities of data don't know as much they think they know, and related to this&lt;br /&gt;2) it is difficult to impossible for anyone to sort through all the data with all of the permutations that exist.&lt;br /&gt;&lt;br /&gt;Data mining usually doesn't tell us things that experts scratch their heads at in amazement. The usually confirm what one suspects (or one of many possible conclusions one may have suspected), but with a few unexpected twists.&lt;br /&gt;&lt;br /&gt;So how can we persuade others that there is value in data mining? The first step is realizing there is value in the data.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5652924-1294034520896173907?l=abbottanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='related' href='http://www.diaryofareluctantblogger.com/2008/05/data-mine-or-data-black-hole.html' title='What Makes a Data Mining Skeptic?'/><link rel='replies' type='application/atom+xml' href='http://abbottanalytics.blogspot.com/feeds/1294034520896173907/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5652924&amp;postID=1294034520896173907' title='7 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/1294034520896173907'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/1294034520896173907'/><link rel='alternate' type='text/html' href='http://abbottanalytics.blogspot.com/2008/05/what-makes-data-mining-skeptic.html' title='What Makes a Data Mining Skeptic?'/><author><name>Dean Abbott</name><uri>http://www.blogger.com/profile/16818000233889520746</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>7</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5652924.post-2214439529284752828</id><published>2008-04-18T09:16:00.000-07:00</published><updated>2008-04-18T10:35:52.680-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='data preparation'/><category scheme='http://www.blogger.com/atom/ns#' term='distributions'/><title type='text'>When Distributions Go Bad</title><content type='html'>Recently I was working with an organization, building estimation models (rather than classification). They were interested in using linear regression, so I dutifully looked at the distribution, &lt;br /&gt; as shown to the left (all pictures were generated by Clementine, and I also scaled the distribution to protect the data even more, but didn't change the shape of the data). &lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://bp0.blogger.com/_ehD0fJcPAoI/SAjMqQZ5GbI/AAAAAAAAABE/tKYH4qDDbPI/s1600-h/Histogram+of+baddist.jpg"&gt;&lt;img style="float:left; margin:0 10px 10px 0;cursor:pointer; cursor:hand;" src="http://bp0.blogger.com/_ehD0fJcPAoI/SAjMqQZ5GbI/AAAAAAAAABE/tKYH4qDDbPI/s400/Histogram+of+baddist.jpg" border="0" alt=""id="BLOGGER_PHOTO_ID_5190623596874635698" /&gt;&lt;/a&gt;&lt;br /&gt;There were approximately 120,000 examples. If this were a typical skewed transformation, I would log transform it and be done with it. However, in this distribution there are three interesting problems: &lt;br /&gt;&lt;br&gt;&lt;br /&gt;1) skew is 57--heavy positive skew&lt;br /&gt;2) kurtosis is 6180--heavily peaked&lt;br /&gt;3) about 15K of these had value 0, contributing to the kurtosis value&lt;br /&gt;&lt;br /&gt;So what to do? One answer is to create the log transform, but maintain sign, using sgn(x)*log10( 1 + abs(x) ).  This picture looks like this:&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://bp2.blogger.com/_ehD0fJcPAoI/SAjRawZ5GdI/AAAAAAAAABU/LBtA2V38oCg/s1600-h/Histogram+of+baddist_nlog10+with+normal.jpg"&gt;&lt;img style="float:left; margin:0 10px 10px 0;cursor:pointer; cursor:hand;" src="http://bp2.blogger.com/_ehD0fJcPAoI/SAjRawZ5GdI/AAAAAAAAABU/LBtA2V38oCg/s400/Histogram+of+baddist_nlog10+with+normal.jpg" border="0" alt=""id="BLOGGER_PHOTO_ID_5190628828144802258" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;This takes care of the summary statistics problems, as skew became 0.6 and kurtosis -0.14. But it doesn't look right--the spike at 0 looks problematic (and turned out that it was).  Also, the distribution actually ends up with two ~normal distributions of different variance, one to the left and one to the right of 0.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://bp0.blogger.com/_ehD0fJcPAoI/SAjTxQZ5GeI/AAAAAAAAABc/c1Y0wqMUzzk/s1600-h/Histogram+of+baddist_nlog10+neg+with+normal.jpg"&gt;&lt;img style="float:left; margin:0 10px 10px 0;cursor:pointer; cursor:hand;" src="http://bp0.blogger.com/_ehD0fJcPAoI/SAjTxQZ5GeI/AAAAAAAAABc/c1Y0wqMUzzk/s400/Histogram+of+baddist_nlog10+neg+with+normal.jpg" border="0" alt=""id="BLOGGER_PHOTO_ID_5190631413715114466" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://bp3.blogger.com/_ehD0fJcPAoI/SAjT6AZ5GfI/AAAAAAAAABk/eLEg2A3OWk8/s1600-h/Histogram+of+baddist_nlog10+%231+pos+with+normal.jpg"&gt;&lt;img style="float:left; margin:0 10px 10px 0;cursor:pointer; cursor:hand;" src="http://bp3.blogger.com/_ehD0fJcPAoI/SAjT6AZ5GfI/AAAAAAAAABk/eLEg2A3OWk8/s400/Histogram+of+baddist_nlog10+%231+pos+with+normal.jpg" border="0" alt=""id="BLOGGER_PHOTO_ID_5190631564038969842" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br /&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br /&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br /&gt;Another approach to this is to use the logistic transform 1 / ( 1 + exp(-x/A) ) where A is a scaling factor. Here are the distributions for the original distribution (baddist), the log-transformed version (baddist_nlog10), and the logistic transformed with 3 values of A: 5, 10, and 20, with the corresponding pictures for the three logistic transformed versions.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://bp1.blogger.com/_ehD0fJcPAoI/SAjWZgZ5GgI/AAAAAAAAABs/tc2gqyuSR-c/s1600-h/summary+statistics.jpg"&gt;&lt;img style="float:left; margin:0 10px 10px 0;cursor:pointer; cursor:hand;" src="http://bp1.blogger.com/_ehD0fJcPAoI/SAjWZgZ5GgI/AAAAAAAAABs/tc2gqyuSR-c/s400/summary+statistics.jpg" border="0" alt=""id="BLOGGER_PHOTO_ID_5190634304228104706" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://bp3.blogger.com/_ehD0fJcPAoI/SAjW6AZ5GhI/AAAAAAAAAB0/S5uVQar0Kb8/s1600-h/Histogram+of+baddist_logistic5.jpg"&gt;&lt;img style="float:left; margin:0 10px 10px 0;cursor:pointer; cursor:hand;" src="http://bp3.blogger.com/_ehD0fJcPAoI/SAjW6AZ5GhI/AAAAAAAAAB0/S5uVQar0Kb8/s400/Histogram+of+baddist_logistic5.jpg" border="0" alt=""id="BLOGGER_PHOTO_ID_5190634862573853202" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://bp3.blogger.com/_ehD0fJcPAoI/SAjXCAZ5GiI/AAAAAAAAAB8/BAP1z3-DVSk/s1600-h/Histogram+of+baddist_logistic10.jpg"&gt;&lt;img style="float:left; margin:0 10px 10px 0;cursor:pointer; cursor:hand;" src="http://bp3.blogger.com/_ehD0fJcPAoI/SAjXCAZ5GiI/AAAAAAAAAB8/BAP1z3-DVSk/s400/Histogram+of+baddist_logistic10.jpg" border="0" alt=""id="BLOGGER_PHOTO_ID_5190635000012806690" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://bp3.blogger.com/_ehD0fJcPAoI/SAjXGAZ5GjI/AAAAAAAAACE/1T7kmzM_e8E/s1600-h/Histogram+of+baddist_logistic20.jpg"&gt;&lt;img style="float:left; margin:0 10px 10px 0;cursor:pointer; cursor:hand;" src="http://bp3.blogger.com/_ehD0fJcPAoI/SAjXGAZ5GjI/AAAAAAAAACE/1T7kmzM_e8E/s400/Histogram+of+baddist_logistic20.jpg" border="0" alt=""id="BLOGGER_PHOTO_ID_5190635068732283442" /&gt;&lt;/a&gt;&lt;br /&gt;Of course, going solely on the basis of the summary statistics, I might have a mild preference for the nlog10 version. As it turned out, the logistic transform produced "better" scores (we measure model accuracy by how well the model rank-ordered the predicted amounts, and I'll leave it at that). That was interesting in of itself since none of the distributions really looked very good. However, another interesting  question was &lt;span style="font-style:italic;"&gt;which&lt;/span&gt; value of "A" to use: 5, 10, 20 (or some other value I don't show here). We found the value that worked best for us, but because of the severity of the logistic transform in how it scales the tails of the distribution, the selection of "A" depended on which range of the target values we were most interested in rank-ordering well. The smaller values of A produced bigger spikes at the extremes, and therefore the model did not rank-order these values well (these models did better on the lower end of distribution magnitudes). If we wanted to identify the tails better, we should increase the scaling factor "A" and it did in fact improve the rank-ordering at the extremes. &lt;br /&gt;&lt;br /&gt;So, in the end, the scaling of the target value depends on the business question being answered (no surprises here). So now I open it up to all of you--what would you do? And, if you are interested in this data, I have it on my web site that you can access &lt;a href="http://www.abbottanalytics.com/data/baddist_data.txt"&gt;here&lt;/a&gt;.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5652924-2214439529284752828?l=abbottanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://abbottanalytics.blogspot.com/feeds/2214439529284752828/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5652924&amp;postID=2214439529284752828' title='10 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/2214439529284752828'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/2214439529284752828'/><link rel='alternate' type='text/html' href='http://abbottanalytics.blogspot.com/2008/04/when-distributions-go-bad.html' title='When Distributions Go Bad'/><author><name>Dean Abbott</name><uri>http://www.blogger.com/profile/16818000233889520746</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://bp0.blogger.com/_ehD0fJcPAoI/SAjMqQZ5GbI/AAAAAAAAABE/tKYH4qDDbPI/s72-c/Histogram+of+baddist.jpg' height='72' width='72'/><thr:total>10</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5652924.post-3181326723599752794</id><published>2008-04-17T21:54:00.000-07:00</published><updated>2009-02-10T15:31:02.451-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='data mining software'/><category scheme='http://www.blogger.com/atom/ns#' term='data mining users'/><category scheme='http://www.blogger.com/atom/ns#' term='data mining survey'/><title type='text'>Data Mining survey</title><content type='html'>Karl Rexer of Rexer Analytics conducted an extensive survey of data miners in 2007, and  reported on those results &lt;a href="http://www.quirks.com/articles/2008/20080309.aspx?searchID=10852254"&gt;here&lt;/a&gt; at Quirks.com (a site I had never heard of before--unfortunately, you have to register to see it).&lt;br /&gt;&lt;br /&gt;This is not to be confused with their 2008 survey, results due out soon I would expect.&lt;br /&gt;&lt;br /&gt;A few interesting items in the survey results:&lt;br /&gt;&lt;br /&gt;&lt;blockquote&gt;• Correspondingly, the most commonly used algorithms are regression (79 percent), decision trees (77 percent) and cluster analysis (72 percent). Again, this reflects what we have seen in our own work. Regression certainly remains the algorithm of choice for large sections of the academic community and within the financial services sector. More and more data miners, however, are using decision trees, and cluster analysis has long been the bedrock of the marketing community.&lt;/blockquote&gt;&lt;br /&gt;&lt;br /&gt;I find it interesting in of itself that academics are participating in a data mining survey, and I don't mean that in a negative way. I have viewed data mining more as a business-centric way of thinking, and to have regression advocates participate in a survey of this type is a good sign. Of course it could also mean that business folks don't have the time to fill out surveys :)&lt;br /&gt;&lt;blockquote&gt;&lt;br /&gt;• SPSS, SPSS Clementine, and SAS are the three most frequently utilized analytic tools and were each used in 2006 by more than 40 percent of data miners. Forty-five percent of data miners also employed their own code in 2006. Respondents were asked about 26 different software packages from the powerhouses above to less-visible and -utilized packages such as Chordiant, Fair Isaac and KXEN.&lt;/blockquote&gt;&lt;br /&gt;&lt;br /&gt;Clementine usually shows up at the top of the KDNuggets survey, and I've never been sure if it was because of the kdnuggets typical user, or if it reflected true general use in the data mining community. This gives further evidence that its use is more widespread. The fact that SPSS and SAS are the others show the dominance in the survey of statisticians or acamedicians. I rarely find heavy SPSS or SAS users among technical business analysts.&lt;br /&gt;&lt;br /&gt;&lt;blockquote&gt;• Comparisons of reported 2006 use and planned 2007 use show that there is increasing interest in the Oracle Data Mining tool, and decreasing interest in C4.5/C5.0/See5. It will be interesting to see how these trends develop over time and if other tools find greater prominence in the future.&lt;/blockquote&gt;&lt;br /&gt;&lt;br /&gt;I concur from my experience. I would put SQL Server in that category as well. I think the C4.5 popularity was largely due to licensing.&lt;br /&gt;&lt;br /&gt;&lt;blockquote&gt;• The primary factors data miners consider when selecting an analytic tool are: 1) the dependability and stability of software, 2) the ability to handle large data sets, and 3) data manipulation capabilities. Data miners were least interested in the reputation of the software and the software’s compatibility either with other programs or with software used by colleagues.&lt;/blockquote&gt;&lt;br /&gt;&lt;br /&gt;THis looks like the responses of technical people--very much common sense. I wonder what decision makers would say? Reputation I would think ranks much higher among these people.&lt;br /&gt;&lt;br /&gt;&lt;blockquote&gt;• The top challenges facing data miners are dirty data, data access and explaining data mining to others. Over three-quarters of data miners listed dirty data as one of the major challenges that they face. This is again consistent with our own experience and the conventional wisdom discussed at data mining conferences: a significant proportion of most projects consist of data understanding, data cleaning and data preparation.&lt;/blockquote&gt;&lt;br /&gt;&lt;br /&gt;No surprises here! However, once one goes through this process, its importance is reduced (because it is solved).&lt;br /&gt;&lt;br /&gt;Thanks to Rexer Analytics for putting this and the 2008 survey together. I'm looking forward to those results.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5652924-3181326723599752794?l=abbottanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://abbottanalytics.blogspot.com/feeds/3181326723599752794/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5652924&amp;postID=3181326723599752794' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/3181326723599752794'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/3181326723599752794'/><link rel='alternate' type='text/html' href='http://abbottanalytics.blogspot.com/2008/04/data-mining-survey.html' title='Data Mining survey'/><author><name>Dean Abbott</name><uri>http://www.blogger.com/profile/16818000233889520746</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5652924.post-3392129639200727646</id><published>2008-04-16T13:23:00.000-07:00</published><updated>2009-02-10T15:31:16.769-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='data mining data'/><title type='text'>Data Mining Data Sets</title><content type='html'>Every once in a while I receive a request or see one posted on some bulletin board about data mining data sets. I have to say, I have little patience for many of these requests because a simple google (or Clusty) search will solve the problem. Nevertheless, here are four sites I've used in the past to grab data for some testing of algorithms of software packages:&lt;br /&gt;&lt;br /&gt;There are several sites for data, including:&lt;br /&gt;&lt;br /&gt;UC Irvine Machine Learning Repository: &lt;a href="http://archive.ics.uci.edu/ml/"&gt;http://archive.ics.uci.edu/ml/&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Carnegie Mellon Statlib Archive: &lt;a href="http://lib.stat.cmu.edu/datasets/"&gt;http://lib.stat.cmu.edu/datasets/&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;DELVE Datasets: &lt;a href="http://www.cs.utoronto.ca/%7Edelve/data/datasets.html"&gt;http://www.cs.utoronto.ca/~delve/data/datasets.html&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;MIT Broad Institute Cancer Datasets: &lt;a href="http://www.broad.mit.edu/cgi-bin/cancer/datasets.cgi"&gt;http://www.broad.mit.edu/cgi-bin/cancer/datasets.cgi&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5652924-3392129639200727646?l=abbottanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://abbottanalytics.blogspot.com/feeds/3392129639200727646/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5652924&amp;postID=3392129639200727646' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/3392129639200727646'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/3392129639200727646'/><link rel='alternate' type='text/html' href='http://abbottanalytics.blogspot.com/2008/04/data-mining-data-sets.html' title='Data Mining Data Sets'/><author><name>Dean Abbott</name><uri>http://www.blogger.com/profile/16818000233889520746</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5652924.post-5146626728353863496</id><published>2008-04-15T14:34:00.000-07:00</published><updated>2009-02-10T15:31:43.328-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='text mining'/><category scheme='http://www.blogger.com/atom/ns#' term='dm radio'/><title type='text'>DM Radio and Text Mining</title><content type='html'>I'll be interviewed on the topic of text mining this coming Thursday, April 17th at 3pm EDT on DM Radio along with Barry DeVille of SAS and Jeff Catlin Lexalytics. The title of this entry links to the DM Review site.&lt;br /&gt;&lt;br /&gt;I think you have to register to listen.&lt;br /&gt;&lt;br /&gt;The schedule will go something like this:&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;3:00 PM&lt;br /&gt;Hosts Eric Kavanagh and Jim Ericson frame the argument: What is text analytics, and how can it be used to find those golden needles in the haystack?&lt;br /&gt;&lt;br /&gt;3:12 PM&lt;br /&gt;Hosts interview Barry DeVille of SAS Institute: What are some good examples of customer success? What are some common mistakes?&lt;br /&gt;&lt;br /&gt;3:24 PM&lt;br /&gt;Hosts interview Jeff Catlin, CEO of Lexalytics: How does his application work?  What are some examples of text mining at work?&lt;br /&gt;&lt;br /&gt;3:36 PM&lt;br /&gt;Hosts interview Dean Abbott of The Modeling Agency: We heard what the vendors said, but what does that all really mean?&lt;br /&gt;&lt;br /&gt;3:48 PM&lt;br /&gt;Roundtable discussion: All bets are off! Guests are encouraged to engage in open dialogue, and listeners can email their questions to dmradio@sourcemedia.com&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5652924-5146626728353863496?l=abbottanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='related' href='http://www.dmreview.com/dmradio/10001193-1.html' title='DM Radio and Text Mining'/><link rel='replies' type='application/atom+xml' href='http://abbottanalytics.blogspot.com/feeds/5146626728353863496/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5652924&amp;postID=5146626728353863496' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/5146626728353863496'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/5146626728353863496'/><link rel='alternate' type='text/html' href='http://abbottanalytics.blogspot.com/2008/04/dm-radio-and-text-mining.html' title='DM Radio and Text Mining'/><author><name>Dean Abbott</name><uri>http://www.blogger.com/profile/16818000233889520746</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5652924.post-4427087900097790270</id><published>2008-04-10T18:14:00.000-07:00</published><updated>2008-04-10T18:22:12.062-07:00</updated><title type='text'>Data Mining: Widespread Acceptance When?</title><content type='html'>Data mining is widely accepted today among industries which have a history of "management by numbers", such as banking, pure science and market research.  Data mining is easily viewed by management in such industries as a logical extension of less sophisticated quantitative analysis which already enjoys currency there.  Further, information infrastructure necessary to feed the data mining process is typically already present.&lt;br /&gt;&lt;br /&gt;It seems likely that at least some (if not many) other industries could realize a significant benefit from data mining, yet this has emerged in practice only sporadically.  The question is: Why?&lt;br /&gt;&lt;br /&gt;Under what organizational conditions will data mining spread to a broader audience?&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5652924-4427087900097790270?l=abbottanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://abbottanalytics.blogspot.com/feeds/4427087900097790270/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5652924&amp;postID=4427087900097790270' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/4427087900097790270'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/4427087900097790270'/><link rel='alternate' type='text/html' href='http://abbottanalytics.blogspot.com/2008/04/data-mining-widespread-acceptance-when.html' title='Data Mining: Widespread Acceptance When?'/><author><name>Will Dwinnell</name><uri>http://www.blogger.com/profile/03379859054257561952</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='28' src='http://2.bp.blogspot.com/_aTiM0lwqgJ4/TQPgGn46JMI/AAAAAAAAAC4/X2lS2gskiUw/S220/Will%2Bportrait%2BMay-09-2010.jpg'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5652924.post-9088205887450765520</id><published>2008-04-04T14:52:00.000-07:00</published><updated>2008-04-04T14:57:48.421-07:00</updated><title type='text'>Data modeling infrastructure in data mining</title><content type='html'>I've had two inquiries in the last day relating to the building of data infrastructure between the database and predictive modeling tool, which I find to be an interesting coincidence. I hadn't even thought about a need here before (perhaps because I wasn't aware of the vendors that address this issue), but am curious if others have thought through this issue/problem.&lt;br /&gt;&lt;br /&gt;I have seen situations where the analyst and DBA need to coordinate, but due to the politics or personalities in an organization, do not. In these cases, a data miner may need tables that actually exist, but the miner doesn't have permission to access the tables, or perhaps doesn't have the expertise to know how to join all the requisite tables. In these cases, I can imagine this middleware if you will could be quite useful if it were more user-friendly. However, I'm not yet convinced this a real issue for most organizations. &lt;br /&gt;&lt;br /&gt;Any thoughts?&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5652924-9088205887450765520?l=abbottanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://abbottanalytics.blogspot.com/feeds/9088205887450765520/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5652924&amp;postID=9088205887450765520' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/9088205887450765520'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/9088205887450765520'/><link rel='alternate' type='text/html' href='http://abbottanalytics.blogspot.com/2008/04/data-modeling-infrastructure-in-data.html' title='Data modeling infrastructure in data mining'/><author><name>Dean Abbott</name><uri>http://www.blogger.com/profile/16818000233889520746</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5652924.post-4648571774700202887</id><published>2008-04-02T17:14:00.000-07:00</published><updated>2009-02-10T15:31:56.528-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='data mining books'/><title type='text'>Another Moneyball quote</title><content type='html'>Gotta get back in the habit of posting...&lt;br /&gt;&lt;br /&gt;A quick way is to post another quote from Moneyball that I really liked&lt;br /&gt;&lt;br /&gt;&lt;blockquote&gt;Intelligence about baseball statistics had become equated in the public mind with the ability to recite arcane baseball stats. What James's wider audience had failed to understand was that the statistics were beside the point. The point was understanding; the point was to make life on  earth just a bit more intelligible; and that point, somehow, had been lost. "I wonder," James wrote, "if we haven't become so numbed by all these numbers that we are no longer capable of truly assimilating any knowledge which might result from them."&lt;/blockquote&gt; (p.95)&lt;br /&gt;&lt;br /&gt;What I like about this quote is that it is something may of us in the analytics world have experienced: losing the point of the modeling or summary statistics by forgetting why we are doing the analysis in the first place. Or, as my good friend John Elder used to describe it, "rapture of the depths"&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5652924-4648571774700202887?l=abbottanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://abbottanalytics.blogspot.com/feeds/4648571774700202887/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5652924&amp;postID=4648571774700202887' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/4648571774700202887'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/4648571774700202887'/><link rel='alternate' type='text/html' href='http://abbottanalytics.blogspot.com/2008/04/another-moneyball-quote.html' title='Another Moneyball quote'/><author><name>Dean Abbott</name><uri>http://www.blogger.com/profile/16818000233889520746</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5652924.post-8432530353147669721</id><published>2008-01-13T06:00:00.000-08:00</published><updated>2008-01-14T17:41:10.026-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='ethics'/><category scheme='http://www.blogger.com/atom/ns#' term='risk'/><category scheme='http://www.blogger.com/atom/ns#' term='data mining'/><category scheme='http://www.blogger.com/atom/ns#' term='insurance'/><category scheme='http://www.blogger.com/atom/ns#' term='prediction'/><category scheme='http://www.blogger.com/atom/ns#' term='morality'/><category scheme='http://www.blogger.com/atom/ns#' term='predictive'/><title type='text'>Data Mining: Interesting Ethical Questions</title><content type='html'>Data mining permits useful extrapolation from sometimes obscure clues.  Information which human experts have ignored as irrelevant has been eagerly snapped up by data mining software.  This leads to interesting ethical questions.&lt;br /&gt;&lt;br /&gt;Consider the risk of selling an individual automobile insurance for one year.  Many factors are related to this risk.  Some are obvious, such as incidence of previous accidents, traffic violations or average number of miles driven per year.  Other risk factors may not be so obvious, but are nonetheless real.  Suppose that it could be shown statistically that, when added to information already in use, late payment of utility bills incrementally improved prediction.&lt;br /&gt;&lt;br /&gt;One might take the perspective that this is a business of prediction, not explanation, so- whatever the connection- this information should be added to the insurance risk model.  This perspective reasons: if the connection is statistically significant, however strange it may seem, we should conclude that it is real and it should be exploited for business purposes.&lt;br /&gt;&lt;br /&gt;Obviously, there is a countervailing perspective which has the customer asking, "What the... ?  What do my utility bills have to do with my car insurance?"  Even extremely &lt;i&gt;laissez-faire&lt;/i&gt; governments may intervene in markets and forsake economic efficiency in favor of other priorities.  In the United States, for example, certain types of discrimination in lending is illegal.&lt;br /&gt;&lt;br /&gt;Another thing to consider (again, granting that the utility bill-automobile risk connection is real) is that, in prohibiting the use of utility bill payments in auto insurance risk prediction implies that less risky customers will be paying for riskier customers.&lt;br /&gt;&lt;br /&gt;Thoughts?&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5652924-8432530353147669721?l=abbottanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://abbottanalytics.blogspot.com/feeds/8432530353147669721/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5652924&amp;postID=8432530353147669721' title='7 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/8432530353147669721'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/8432530353147669721'/><link rel='alternate' type='text/html' href='http://abbottanalytics.blogspot.com/2008/01/data-mining-interesting-ethical.html' title='Data Mining: Interesting Ethical Questions'/><author><name>Will Dwinnell</name><uri>http://www.blogger.com/profile/03379859054257561952</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='28' src='http://2.bp.blogspot.com/_aTiM0lwqgJ4/TQPgGn46JMI/AAAAAAAAAC4/X2lS2gskiUw/S220/Will%2Bportrait%2BMay-09-2010.jpg'/></author><thr:total>7</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5652924.post-603321761440055287</id><published>2007-12-14T18:26:00.000-08:00</published><updated>2009-07-25T08:31:22.405-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='data mining'/><category scheme='http://www.blogger.com/atom/ns#' term='testing'/><category scheme='http://www.blogger.com/atom/ns#' term='critical junctures'/><category scheme='http://www.blogger.com/atom/ns#' term='problem deifition'/><title type='text'>Three Critical Junctures</title><content type='html'>I don't know that it's possible to say that any single part of the data mining process is the "most important", but there are three junctures which are absolutely critical to successful data mining: 1. problem definition, 2. data acquisition and 3. model validation.  Failures at other points will more often lead to loss in the form of missed opportunities.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Problem Definition&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;&lt;i&gt;Problem definition&lt;/i&gt; means understanding the "real-world" or "business" problem, as opposed to the technical modeling or segmentation problem.  In some cases, deliberation on the nature of the business problem may reveal that an empirical model or other sophisticated analysis is not needed at all.  In most cases, the model will only be one part of a larger solution.  This is a point worth elaboration.  Saying that the model is only part of a larger solution is not merely a nod to the database which feeds to model and the reporting system which summarizes model performance in the field.  The point here is that a predictive model or clustering mechanism must somehow be fit into the architecture of the solution &lt;i&gt;some how&lt;/i&gt;.  The important question here is: "How?"  Models sometimes solve the whole (technical) problem, but in other situations, optimizers are run over models, or models are used to guide a separate search process.  Deciding exactly how the model will be used with the total solution is not always trivial.&lt;br /&gt;&lt;br /&gt;Also: attacking the wrong business problem all but ensures failure, since the chances of being able to quickly and inexpensively "re-engineer" a fully-constructed technical solution for the real business problem are slim.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Data Acquisition&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;&lt;i&gt;Data acquisition&lt;/i&gt; refers to the actual collection of whatever data is to be used to build the model.  If, for instance, sampling is not representative of the statistical universe to which to model will be applied, all bets are off.  More than once, I have received analytical extracts of databases from other individuals which, for instance, contained no accounts with last names starting with the letter 'P' through 'Z'!  Clearly, a very arbitrary sample had been drawn.  The same thing happens all the time when database programmers naively query for limited ranges of account numbers or other record index values ("all account numbers less than 140000").&lt;br /&gt;&lt;br /&gt;With larger and larger data sets being examined by data miners, the need for sampling will not go away in the foreseeable future.  Sampling has long been studied within statistics and there are far too many pitfalls in this area to ignore the issue.  My strong recommendation is to learn about it, and I suggest a book like &lt;i&gt;Sampling: Design and Analysis Sampling: Design and Analysis&lt;/i&gt; by Sharon L. Lohr (ISBN-13: 978-0534353612).&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Model Validation&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;&lt;i&gt;Model validation&lt;/i&gt; gets my vote for "most important step in any data mining project".  This is where- to the extent it's possible- the data miner determines how much the model &lt;b&gt;really&lt;/b&gt; has learned.  As I write this, it is the end of the year 2007, yet, amazingly people who call themselves "analysts" continue to produce models without delivering any sort of serious evidence that their models work.  Years after the publication of &lt;a href="http://www-cse.ucsd.edu/users/elkan/kddcoil.pdf"&gt;"Magical Thinking in Data Mining: Lessons From CoIL Challenge 2000"&lt;/a&gt; by Charles Elkan, in which the dangers of testing on the training set were (yet again!) demonstrated, models are not receiving the rigorous testing they need.&lt;br /&gt;&lt;br /&gt;"Knowing what you know" (and what you &lt;i&gt;don't&lt;/i&gt; know) is critical.  No model is perfect, and understanding the limits of likely performance is crucial.  This requires the use of error resampling methods, such as holdout testing, k-fold cross-validation and bootstrapping.  Performance of models, once deployed, should not be a surprise, nor a matter of faith.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5652924-603321761440055287?l=abbottanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://abbottanalytics.blogspot.com/feeds/603321761440055287/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5652924&amp;postID=603321761440055287' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/603321761440055287'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/603321761440055287'/><link rel='alternate' type='text/html' href='http://abbottanalytics.blogspot.com/2007/12/three-critical-junctures.html' title='Three Critical Junctures'/><author><name>Will Dwinnell</name><uri>http://www.blogger.com/profile/03379859054257561952</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='28' src='http://2.bp.blogspot.com/_aTiM0lwqgJ4/TQPgGn46JMI/AAAAAAAAAC4/X2lS2gskiUw/S220/Will%2Bportrait%2BMay-09-2010.jpg'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5652924.post-1213925106410811287</id><published>2007-11-08T07:55:00.000-08:00</published><updated>2009-02-10T15:34:10.951-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='random'/><category scheme='http://www.blogger.com/atom/ns#' term='data mining books'/><title type='text'>Random things...</title><content type='html'>I was just looking at my favorite economics blog, &lt;a href="http://www.optimist123.com/"&gt;The Skeptical Optimist&lt;/a&gt;, and saw a post on randomness based on two books the blog author, Steve Conover is reading called &lt;a href="http://www.amazon.com/Black-Swan-Impact-Highly-Improbable/dp/1400063515/ref=pd_bbs_2/103-7449907-5650235?ie=UTF8&amp;amp;s=books&amp;amp;qid=1194537613&amp;amp;sr=8-2"&gt;The Black Swan&lt;/a&gt; and &lt;a href="http://www.amazon.com/Fooled-Randomness-Hidden-Chance-Markets/dp/0812975219/ref=pd_bbs_sr_1/103-7449907-5650235?ie=UTF8&amp;amp;s=books&amp;amp;qid=1194537613&amp;amp;sr=8-1"&gt;Fooled by Randomness&lt;/a&gt;. This caught my eye--a quote from one of the two books (it was  unclear to me which one):&lt;br /&gt;&lt;blockquote&gt;Here's an example of his point about randomness: How many times have you heard about mutual fund X's "superlative performance over the last five years"?  Our typical reaction to that message is that mutual fund X must have better managers than other funds.  Reason: Our minds are built to assign cause-and-effect whenever possible, in spite of the strong possibility that random chance played a big role in the outcome. &lt;/blockquote&gt;&lt;br /&gt;&lt;br /&gt;He then gives an example of two stock pickers, one of whom gets it "right" about 1/2 the time, and a second who gets it right 12 consecutive times. The punch line is this:&lt;br /&gt;&lt;blockquote&gt;Taleb's point: Randomness plays a much larger role in social outcomes than we are willing to admit—to ourselves, or in our textbooks.  Our minds, uncomfortable with randomness, are programmed to employ hindsight bias to provide retroactive explanations for just about everything.  Nonetheless, randomness is frequently the only "reason" for many events. &lt;/blockquote&gt;&lt;br /&gt;&lt;br /&gt;I personally don't agree philosophically with the role of randomness (I would prefer to say that many outcomes are unexplained then say randomness is the "reason" or "cause"--randomness does nothing itself, it is our way of saying "I don't know why" or "it is too hard to figure out why").&lt;br /&gt;&lt;br /&gt;But that said, this is an extremley important principal for data miners. We have all seen predictive models that apparently do well on one data set, and then does poorly on another. Usually this is attributed to overfit, but it doesn't have to be solely an overfit problem. David Jensen of UMass described in one paper the phenomenon of &lt;span style="font-style: italic;"&gt;oversearching &lt;/span&gt;for models in the paper &lt;a href="http://citeseer.ist.psu.edu/jensen98multiple.html"&gt;Multiple Compisons in Induction Algorithms&lt;/a&gt;, where you could happen upon a model that works well, but is just a happenstance find.&lt;br /&gt;&lt;br /&gt;The solution? One great help in overcoming these problems is through sampling--the train/test/validate subset method, or by resampling methods (like bootstrapping). But having the mindset of skepticism about models helps tremendously in digging to ensure the models truly are predictive and not just a random matching of the patterns of interest.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5652924-1213925106410811287?l=abbottanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://abbottanalytics.blogspot.com/feeds/1213925106410811287/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5652924&amp;postID=1213925106410811287' title='9 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/1213925106410811287'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/1213925106410811287'/><link rel='alternate' type='text/html' href='http://abbottanalytics.blogspot.com/2007/11/random-things.html' title='Random things...'/><author><name>Dean Abbott</name><uri>http://www.blogger.com/profile/16818000233889520746</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>9</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5652924.post-9104136572301632952</id><published>2007-10-23T17:13:00.001-07:00</published><updated>2007-10-29T07:46:26.419-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='statistics'/><category scheme='http://www.blogger.com/atom/ns#' term='hate'/><title type='text'>Follow-Up to: Statistics: Why Do So Many Hate It?</title><content type='html'>In a question posted Oct-14-2007 to &lt;a href="http://answers.yahoo.com/"&gt;Yahoo! Answers&lt;/a&gt;, user &lt;i&gt;lifetimestudentofmath&lt;/i&gt; asked:&lt;br /&gt;&lt;br /&gt;&lt;i&gt;&lt;b&gt;How would you run this regression?&lt;/b&gt;&lt;br /&gt;A relationship between beer expenditure and income was tested. The relationship may be qualitatively effected by gender. How would you test the hypothesis that women spend less money on beer than women?&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;My guess is that this is a homework question, and that the teacher wants students to use a dummy variable to represent gender, so that a simple interpretation of gender's coefficient will reveal the answer.&lt;br /&gt;&lt;br /&gt;In reality, of course, the interaction of income and gender may yield a more nuanced answer.  What if two regressions were performed, one for men and the other for women, with income as the predictor and beer expenditure as the target, and the regression lines crossed?  Such a result precludes so simple a response as "men spend more on beer".&lt;br /&gt;&lt;br /&gt;This question suggests another reason so many people hate statistics: its subtlety.  The annoying thing about reality (which is the subject of statistical study), is that it is so complicated.  Even things which seem simple will often reveal surprisingly complex behavior.  The problem is that people don't want complicated answers.  Although my response is: It is foolish to expect simple solutions to complicated problems, the fundamental, irreducible complexity of reality- which is mirrored in statistics- also drives negative feelings toward statistics.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5652924-9104136572301632952?l=abbottanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://abbottanalytics.blogspot.com/feeds/9104136572301632952/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5652924&amp;postID=9104136572301632952' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/9104136572301632952'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/9104136572301632952'/><link rel='alternate' type='text/html' href='http://abbottanalytics.blogspot.com/2007/10/follow-up-to-statistics-why-do-so-many.html' title='Follow-Up to: Statistics: Why Do So Many Hate It?'/><author><name>Will Dwinnell</name><uri>http://www.blogger.com/profile/03379859054257561952</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='28' src='http://2.bp.blogspot.com/_aTiM0lwqgJ4/TQPgGn46JMI/AAAAAAAAAC4/X2lS2gskiUw/S220/Will%2Bportrait%2BMay-09-2010.jpg'/></author><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5652924.post-8574816269515361363</id><published>2007-10-17T17:19:00.000-07:00</published><updated>2007-10-19T04:51:09.602-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='statistics'/><category scheme='http://www.blogger.com/atom/ns#' term='torture'/><title type='text'>Statistics: Why Do So Many Hate It?</title><content type='html'>In &lt;a href="http://www.statisticgraphs.com/applied-math/why-is-statistics-so-scary/"&gt;Why is Statistics So Scary?&lt;/a&gt;, the Sep-26-2007 posting to the &lt;a href="http://www.statisticgraphs.com/"&gt;Math Stats And Data Mining&lt;/a&gt; Web log, the author wonders why so many people exhibit negative reactions to statistics.&lt;br /&gt;&lt;br /&gt;I've had occasion to wondered about the same thing.  I make my living largely from statistics, and have frequently received unfavorable reactions when I explain my work to others.  Invariably, such respondents admit the great usefulness of statistics, so that is not the source of this negativity.  I am certain that individual natural aptitude for this sort of work varies, but I do not believe that this accounts for the majority of negative feelings towards statistics.&lt;br /&gt;&lt;br /&gt;Having received formal education in what I call "traditional" or "classical" statistics, and having since assisted others studying statistics in the same context, I suggest that one major impediment for many people is the total reliance by classical statisticians on a large set of very narrowly focused techniques.  While they serve admirably in many situations, it is worth noting the disadvantages of classical statistical techniques:&lt;br /&gt;&lt;br /&gt;1. Being so highly specialized, there are many of these techniques to remember.&lt;br /&gt;&lt;br /&gt;2. It is also necessary to remember the appropriate applications of these techniques.&lt;br /&gt;&lt;br /&gt;3. Broadly, classical statistics involves many assumptions.  Violation of said assumptions may invalidate the results of these techniques.&lt;br /&gt;&lt;br /&gt;Classical techniques were developed largely during a time without the benefit of rapid, inexpensive computation, which is very different from the environment we enjoy today.&lt;br /&gt;&lt;br /&gt;The above were major motivations for me to embrace newer analytical methods (data mining, bootstrapping, etc.) in my professional life.  Admittedly, newer methods have disadvantages of their own (not the least of which is their hunger for data), but it's been my experience that newer methods tend to be easier to understand, more broadly applicable and, consequently, simpler to apply.&lt;br /&gt;&lt;br /&gt;I think the broader educational question is: Would students be better served by one or more years of torture, imperfectly or incorrectly learning myriad methods which will soon be forgotten, or the provision of a few widely useful tools and an elemental-level of understanding?&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5652924-8574816269515361363?l=abbottanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://abbottanalytics.blogspot.com/feeds/8574816269515361363/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5652924&amp;postID=8574816269515361363' title='5 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/8574816269515361363'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/8574816269515361363'/><link rel='alternate' type='text/html' href='http://abbottanalytics.blogspot.com/2007/10/statistics-why-do-so-many-hate-it.html' title='Statistics: Why Do So Many Hate It?'/><author><name>Will Dwinnell</name><uri>http://www.blogger.com/profile/03379859054257561952</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='28' src='http://2.bp.blogspot.com/_aTiM0lwqgJ4/TQPgGn46JMI/AAAAAAAAAC4/X2lS2gskiUw/S220/Will%2Bportrait%2BMay-09-2010.jpg'/></author><thr:total>5</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5652924.post-3103612391461036442</id><published>2007-10-16T18:15:00.000-07:00</published><updated>2007-10-16T18:30:16.198-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='conferences'/><title type='text'>See The World</title><content type='html'>I recently had the pleasure of attending the Insightful Impact 2007 conference, where I especially enjoyed a presentation on ensemble methods by two young, up-and-coming, aspiring data miners: Brian Siegel and his side-kick... Deke Abbott, or Dean Abner, or some such.&lt;br /&gt;&lt;br /&gt;I am frequently asked what is the best way to learn about data mining (or machine learning, statistics, etc.).  I get a great deal of information from reading, either books or white papers and reports which are available for free, on-line.  Another great learning experience involves attendance of conferences and trade shows.  I don't travel a great deal and find it convenient to attend whatever free or cheap events happen to be within close distance.  I also try to get to KDD when it's on the east coast of the United States.  Aside from the presentations, events like these are an opportunity to get away from the muggles and spend some time with other data miners.  I highly recommend it.&lt;br /&gt;&lt;br /&gt;Nice job, Dean and Brian.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5652924-3103612391461036442?l=abbottanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://abbottanalytics.blogspot.com/feeds/3103612391461036442/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5652924&amp;postID=3103612391461036442' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/3103612391461036442'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/3103612391461036442'/><link rel='alternate' type='text/html' href='http://abbottanalytics.blogspot.com/2007/10/see-world.html' title='See The World'/><author><name>Will Dwinnell</name><uri>http://www.blogger.com/profile/03379859054257561952</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='28' src='http://2.bp.blogspot.com/_aTiM0lwqgJ4/TQPgGn46JMI/AAAAAAAAAC4/X2lS2gskiUw/S220/Will%2Bportrait%2BMay-09-2010.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5652924.post-3705080185716286771</id><published>2007-08-31T14:26:00.000-07:00</published><updated>2007-08-31T14:34:40.533-07:00</updated><title type='text'>Interesting real-world example of Simpson's Paradox</title><content type='html'>At MineThatData blog there is a very interesting post on &lt;a href="http://minethatdata.blogspot.com/2007/08/e-mail-productivity-is-waning-or-is-it.html"&gt;email marketing productivity&lt;/a&gt; was very interesting, and a good example of &lt;a href="http://en.wikipedia.org/wiki/Simpson%27s_paradox"&gt;Simpson's Paradox&lt;/a&gt; (as I posted in the comments). The key (as always) is that there are disproportionate population sizes with quite disparate results. As Kevin points out in the post, there is a huge difference between the profit due to engaged customers vs. those who aren't engaged, but the number of non-engaged customers dwarfs the engaged.&lt;br /&gt;&lt;br /&gt;The problem we all have in analytics is finding these effects--unless you create the right features, you never see it. To create good features, you usually need to have moderate to considerable expertise in the domain area to know what might be interesting. And yes, neural networks can find these effects automatically, but you still have to back out the relationships between the features found by the NNets and the original inputs in order to interpret the results.&lt;br /&gt;&lt;br /&gt;Nevertheless, this is a very important post if for no other reason but to alert practitioners that relative sizes of groups of customers (or other natural groupings in the data) matter tremendously.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5652924-3705080185716286771?l=abbottanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://abbottanalytics.blogspot.com/feeds/3705080185716286771/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5652924&amp;postID=3705080185716286771' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/3705080185716286771'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/3705080185716286771'/><link rel='alternate' type='text/html' href='http://abbottanalytics.blogspot.com/2007/08/interesting-real-world-example-of.html' title='Interesting real-world example of Simpson&apos;s Paradox'/><author><name>Dean Abbott</name><uri>http://www.blogger.com/profile/16818000233889520746</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5652924.post-8477580876335305059</id><published>2007-08-21T18:00:00.001-07:00</published><updated>2007-08-21T18:17:51.954-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='decisions'/><category scheme='http://www.blogger.com/atom/ns#' term='baseball'/><title type='text'>Little League pitch counts -- data vs decisions revisited</title><content type='html'>I posted recently on the &lt;a href="http://abbottanalytics.blogspot.com/2007/07/when-data-and-decisions-dont-match.html"&gt;new rules in pitch counts for Little League&lt;/a&gt;. I've had to defend my comments recently (in a polite way on this blog, and a bit more strenuously in person with friends who have sons pitching in LL), but was struck again about this issue while watching the &lt;a href="http://sports.espn.go.com/sports/llws07/index"&gt;LL World Series&lt;/a&gt; on ESPN. &lt;br /&gt;&lt;br /&gt;On the ESPN web site I read &lt;a href="http://sports.espn.go.com/sports/llws07/columns/story?columnist=kreidler_mark&amp;id=2984633"&gt;this article&lt;/a&gt; on pitch counts, and found this comment on point:&lt;br /&gt;&lt;blockquote&gt;What's interesting here is that the 20-pitch specialist is the residue of a change that did not, strictly speaking, emanate from problems within Little League itself. Around the coaching community, it is widely understood that the advent of nearly year-round travel (or "competitive") ball is one of the primary reasons for the rise in young arm problems. In some ways, Little League has made a pitch-count adjustment in reaction to forces that are beyond its control.&lt;br /&gt;&lt;br /&gt;Travel ball has become an almost de facto part of a competitive player's baseball life -- just as it has in soccer, basketball and several other youth sports. An alphabet soup of sponsoring organizations, from AAU to USSSA, BPA and well beyond, offers the opportunity to play baseball at levels -- and sheer numbers of games -- that a previous generation of players would have found mind-boggling.&lt;br /&gt;&lt;br /&gt;But travel ball is here to stay -- and so too, apparently, is a new approach by Little League to containing the potential damage to young arms. So get used to the 20-pitch kid. He's a closer on the shortest leash imaginable.&lt;/blockquote&gt;&lt;br /&gt;&lt;br /&gt;In other words, coaches know that it isn't pitch counts &lt;span style="font-style:italic;"&gt;per se&lt;/span&gt; that cause the problems, but rather the number of months of the year the kids are pitching. &lt;br /&gt;&lt;br /&gt;Interestingly, there is no ban on breaking pitches, though when I talk to coaches, there is speculation that these cause arm problems. In fact, on the Little League web site, they state:&lt;br /&gt;&lt;blockquote&gt;While there is no medical evidence to support a ban on breaking pitches, it is widely speculated by medical professionals that it is ill-advised for players under 14 years old to throw breaking pitches,” Mr. Keener said. “Breaking pitches for these ages continues to be strongly discouraged by Little League, and that is an issue we are looking at as well. As with our stance on pitch counts, we will act if and when there is medical evidence to support a change.&lt;/blockquote&gt;&lt;br /&gt;&lt;br /&gt;I'm glad they are studying it, but the decision not to act to ban breaking pitches due to a lack of data is interesting since there is also a lack of data with pitch counts, but it didn't stop the officials from making rules there! Hopefully with the new pitch count rules, and the new data collected, we can see of the data bears out this hypothesis.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5652924-8477580876335305059?l=abbottanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://abbottanalytics.blogspot.com/feeds/8477580876335305059/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5652924&amp;postID=8477580876335305059' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/8477580876335305059'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/8477580876335305059'/><link rel='alternate' type='text/html' href='http://abbottanalytics.blogspot.com/2007/08/little-league-pitch-counts-data-vs.html' title='Little League pitch counts -- data vs decisions revisited'/><author><name>Dean Abbott</name><uri>http://www.blogger.com/profile/16818000233889520746</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5652924.post-7371719241455916989</id><published>2007-08-15T06:42:00.000-07:00</published><updated>2007-08-15T06:48:13.035-07:00</updated><title type='text'>KDNuggets Poll on "Data Mining" as a term</title><content type='html'>KDNuggets has a new poll on whether or not "data mining" should still be used to describe the kind of analysis we all know and love. It is still barely winning, but interesting, Knowledge Discovery is almost beating it out as the better term.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5652924-7371719241455916989?l=abbottanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='related' href='http://vote.sparklit.com/poll.spark?pollID=203792' title='KDNuggets Poll on &quot;Data Mining&quot; as a term'/><link rel='replies' type='application/atom+xml' href='http://abbottanalytics.blogspot.com/feeds/7371719241455916989/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5652924&amp;postID=7371719241455916989' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/7371719241455916989'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/7371719241455916989'/><link rel='alternate' type='text/html' href='http://abbottanalytics.blogspot.com/2007/08/kdnuggets-poll-on-data-mining-as-term.html' title='KDNuggets Poll on &quot;Data Mining&quot; as a term'/><author><name>Dean Abbott</name><uri>http://www.blogger.com/profile/16818000233889520746</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5652924.post-6180037966622864815</id><published>2007-08-15T06:12:00.002-07:00</published><updated>2007-08-15T06:40:19.680-07:00</updated><title type='text'>The latest Y2K bug--and why mean values don't tell the whole story</title><content type='html'>I was interested in the recent hubbub over surface temperatures as &lt;a href="http://www.dailytech.com/article.aspx?newsid=8383"&gt;first written in NASA's Daily Tech&lt;/a&gt;, and picked up by other news sources. (Note: the article doesn't render well for me in Firefox, but IE is fine).&lt;br /&gt;&lt;br /&gt;However, I found &lt;a href="http://209.218.29.87/?p=1885"&gt;this article&lt;/a&gt; describing the data even more interesting, from the &lt;a href="http://209.218.29.87/"&gt;Climate Audit Blog&lt;/a&gt;. From a data mining / statistics perspective, it was the distribution of the errors that was interesting. I had read in the media (sorry-don't remember where) that there was an average error of 0.15 deg. C due to the Y2K error in the data--that didn't seem too bad. But, at the blog, he describes that the errors are (1) &lt;a href="http://www.climateaudit.org/wp-content/uploads/2007/08/hansen40.gif"&gt;bimodal&lt;/a&gt;, (2) postively skewed (hence the positive average error), and (3) typically much larger than 0.15 deg. So while on average it doesn't seem bad, the surface temperature errors are indeed significant. &lt;br /&gt;&lt;br /&gt;Once again, averages can mask data issues. Better to augment averages with other metrics, or better yet, visualize!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5652924-6180037966622864815?l=abbottanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://abbottanalytics.blogspot.com/feeds/6180037966622864815/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5652924&amp;postID=6180037966622864815' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/6180037966622864815'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/6180037966622864815'/><link rel='alternate' type='text/html' href='http://abbottanalytics.blogspot.com/2007/08/latest-y2k-bug-and-why-mean-values-dont.html' title='The latest Y2K bug--and why mean values don&apos;t tell the whole story'/><author><name>Dean Abbott</name><uri>http://www.blogger.com/profile/16818000233889520746</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5652924.post-9012734055841544707</id><published>2007-08-11T05:31:00.000-07:00</published><updated>2007-08-11T05:57:30.565-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='survey'/><category scheme='http://www.blogger.com/atom/ns#' term='Rexer Analytics'/><title type='text'>Rexer Analytics Data Miner Survey, Aug-2007</title><content type='html'>Rexer Analytics recently distributed a report summarizing the findings of their survey of data miners (observation count=214, after removal of tool vendor employees).&lt;br /&gt;&lt;br /&gt;Not surprisingly, the top two &lt;b&gt;types of analysis&lt;/b&gt; were: 1. &lt;i&gt;predictive modeling&lt;/i&gt; (89%) and 2. &lt;i&gt;segmentation/clustering&lt;/i&gt; (77%).  Other methods trail off sharply from there.&lt;br /&gt;&lt;br /&gt;The top three &lt;b&gt;types of algorithms&lt;/b&gt; used were: 1. &lt;i&gt;decision trees&lt;/i&gt; (79%), 2. &lt;i&gt;regression&lt;/i&gt; (77%) and 3. &lt;i&gt;cluster analysis&lt;/i&gt; (72%).  It would be interesting to know more about the specifics (which tree-induction algorithms, for instance), but I'd be especially interested in what forms of "regression" are being used since that term covers a lot of ground.&lt;br /&gt;&lt;br /&gt;Responses regarding &lt;b&gt;tool usage&lt;/b&gt; were divided into &lt;i&gt;never&lt;/i&gt;, &lt;i&gt;occasionally&lt;/i&gt; and &lt;i&gt;frequently&lt;/i&gt;.  The authors of the report sorted tools in decreasing order of popularity (&lt;i&gt;occasionally&lt;/i&gt; plus &lt;i&gt;frequently&lt;/i&gt; used).  Interestingly, &lt;i&gt;your own code&lt;/i&gt; took second place with 45%, which makes me wonder what languages are being used.  (If you must know, SPSS came in first, with 48%.)&lt;br /&gt;&lt;br /&gt;When asked about &lt;b&gt;challenges faced by data miners&lt;/b&gt;, the top three answers were: 1. &lt;i&gt;dirty data&lt;/i&gt; (76%), 2. &lt;i&gt;unavailability of/difficult access to data&lt;/i&gt; (51%) and 3. &lt;i&gt;explaining data mining to others&lt;/i&gt; (51%).  So much for quitting my job in search of something better!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5652924-9012734055841544707?l=abbottanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://abbottanalytics.blogspot.com/feeds/9012734055841544707/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5652924&amp;postID=9012734055841544707' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/9012734055841544707'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/9012734055841544707'/><link rel='alternate' type='text/html' href='http://abbottanalytics.blogspot.com/2007/08/rexer-analytics-data-miner-survey-aug.html' title='Rexer Analytics Data Miner Survey, Aug-2007'/><author><name>Will Dwinnell</name><uri>http://www.blogger.com/profile/03379859054257561952</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='28' src='http://2.bp.blogspot.com/_aTiM0lwqgJ4/TQPgGn46JMI/AAAAAAAAAC4/X2lS2gskiUw/S220/Will%2Bportrait%2BMay-09-2010.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5652924.post-3122661761521999832</id><published>2007-07-28T22:48:00.000-07:00</published><updated>2007-07-28T23:00:35.078-07:00</updated><title type='text'>NY Times Defines Data Mining</title><content type='html'>In their article &lt;a href="http://www.nytimes.com/2007/07/29/washington/29nsa.html?ei=5065&amp;en=f770108f5d23c8b4&amp;ex=1186286400&amp;partner=MYWAY&amp;pagewanted=print"&gt;here&lt;/a&gt;, the NY Times defines data mining in this way:&lt;br /&gt;&lt;br /&gt;&lt;blockquote&gt;It is not known precisely why searching the databases, or data mining, raised such a furious legal debate. But such databases contain records of the phone calls and e-mail messages of millions of Americans, and their examination by the government would raise privacy issues.&lt;/blockquote&gt;&lt;br /&gt;&lt;br /&gt;While I recognize that the NYT is not a technical body, and reporters often get the gist of technology wrong, this particular kind of definition has swept the media to such a degree that the term "data mining" may never recover. &lt;br /&gt;&lt;br /&gt;The definition itself has problems, such as&lt;br /&gt;1) searching databases per se I'm sure is not what they mean by data mining; almost certainly they mean programs that automatically searching the databases to find interesting patterns (and presumably horribly overfitting int he process, registering many false positives) as the problem. After all, a Nexus search searches a database and no one raises an eyebrow at that.&lt;br /&gt; &lt;br /&gt;2) the problem with the searching is not the searching (or the data mining in their terminology), but the data that is being searched. Therefore the headline of the story, "Mining of Data Prompted Fight Over Spying" should probably more accurately read something like "Data allowed to be Mined Prompted Fight Over Spying"&lt;br /&gt;&lt;br /&gt;It is this second point that I have argued over with others who are concerned about privacy, and therefore have become anti-data-mining. It is the data that is the problem, not the mining (regardless of the definition of mining). But I think the term "data mining" resonates well and generates a clear mental image of what is going on, which is why it gained popularity in the first place.&lt;br /&gt;&lt;br /&gt;So I predict that within 5 years, few data miners (and I consider myself one of them) will refer to him/herself as a data miner, nor will we describe what we do as data mining. Predictive Analytics anyone?&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5652924-3122661761521999832?l=abbottanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://abbottanalytics.blogspot.com/feeds/3122661761521999832/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5652924&amp;postID=3122661761521999832' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/3122661761521999832'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/3122661761521999832'/><link rel='alternate' type='text/html' href='http://abbottanalytics.blogspot.com/2007/07/ny-times-defines-data-mining.html' title='NY Times Defines Data Mining'/><author><name>Dean Abbott</name><uri>http://www.blogger.com/profile/16818000233889520746</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5652924.post-7938267963958479063</id><published>2007-07-21T05:10:00.000-07:00</published><updated>2007-07-21T05:19:54.953-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='idempotnence'/><category scheme='http://www.blogger.com/atom/ns#' term='idempotent'/><category scheme='http://www.blogger.com/atom/ns#' term='logistic regression'/><title type='text'>Idempotent Capable Modeling Algorithms</title><content type='html'>In &lt;a href="http://hunch.net/?p=280"&gt;Idempotent-capable Predictors&lt;/a&gt;, the Jul-06-2007 posting to &lt;a href="http://hunch.net/"&gt;Machine Learning (Theory)&lt;/a&gt; Web log, the author suggests the importance of empirical models being idempotent (in this case, meaning that they can use one of the input variables as the model output).&lt;br /&gt;&lt;br /&gt;This is of interest since: 1. One would like to believe that the modeling process could generate the right answer, once it had actually been given the right answer, and 2. It is not uncommon for analysts to design inputs to models which give "hints" (which are partial solutions of the problem).  In the article mentioned above, it is noted that some typical modeling algorithms, such as logistic regression, are &lt;b&gt;not&lt;/b&gt; idempotent capable.  The author wonders how important this property is, and I do, too.  Thoughts?&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5652924-7938267963958479063?l=abbottanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://abbottanalytics.blogspot.com/feeds/7938267963958479063/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5652924&amp;postID=7938267963958479063' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/7938267963958479063'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/7938267963958479063'/><link rel='alternate' type='text/html' href='http://abbottanalytics.blogspot.com/2007/07/idempotent-capable-modeling-algorithms.html' title='Idempotent Capable Modeling Algorithms'/><author><name>Will Dwinnell</name><uri>http://www.blogger.com/profile/03379859054257561952</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='28' src='http://2.bp.blogspot.com/_aTiM0lwqgJ4/TQPgGn46JMI/AAAAAAAAAC4/X2lS2gskiUw/S220/Will%2Bportrait%2BMay-09-2010.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5652924.post-4892494825480154828</id><published>2007-07-17T08:43:00.000-07:00</published><updated>2007-07-17T08:50:46.128-07:00</updated><title type='text'>More Statistics Humor</title><content type='html'>In February of this year, Dean posted a witty comment regarding statistics which ignited an amusing exchange of comments (&lt;a href="http://abbottanalytics.blogspot.com/2007/02/quote-of-day.html"&gt;Quote of the day&lt;/a&gt;).  Readers who found that item entertaining may also appreciate the quotes listed at the bottom of &lt;a href="http://web.ecs.baylor.edu/faculty/marks/Research/EILab/Resources/Tomb/index.html"&gt;The Jesus Tomb Math&lt;/a&gt;.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5652924-4892494825480154828?l=abbottanalytics.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://abbottanalytics.blogspot.com/feeds/4892494825480154828/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5652924&amp;postID=4892494825480154828' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/4892494825480154828'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5652924/posts/default/4892494825480154828'/><link rel='alternate' type='text/html' href='http://abbottanalytics.blogspot.com/2007/07/more-statistics-humor.html' title='More Statistics Humor'/><author><name>Will Dwinnell</name><uri>http://www.blogger.com/profile/03379859054257561952</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='28' src='http://2.bp.blogspot.com/_aTiM0lwqgJ4/TQPgGn46JMI/AAAAAAAAAC4/X2lS2gskiUw/S220/Will%2Bportrait%2BMay-09-2010.jpg'/></author><thr:total>0</thr:total></entry></feed>
