I just read a fascinating book review in the Wall Street Journal Physics Envy: Models Behaving Badly. The author of the book, Emanuel Derman (former head of Quantitative Analsis at Goldman Sachs) argues that the financial models involved human beings and therefore were inherently brittle: as human behavior changed, the models failed. "in physics you're playing against God, and He doesn't change His laws very often. In finance, you're playing against God's creatures."
I'll agree with Derman that whenever human beings are in the loop, data suffers. People change their minds based on information not available to the models.
I also agree that human behavioral modeling is not the same as physical modeling. We can use the latter to provide motivation and even mathematics for human behavioral modeling, but we should not take this too far. A simple example is this: purchase decisions sometimes depend not on the person's propensity to purchase alone, but also on whether or not they had an argument that morning, or if they just watched a great movie. There is an emotional component that data cannot reflect. People therefore behave in ways that on the surface are contradictory, seemingly "random", which is way response rates of 1% can be "good".
However, I bristle a bit at the the emphasis on the physics analogy. In closed systems, models can explain everything. But once one opens up the world, even physical models are imperfect because they often do not incorporate all the information available. For example, missile guidance is based on pure physics: move a surface on a wing and one can change the trajectory of the missile. There are equations of motion that describe exactly where the missile will go. There is no mystery here.
However, all operational missile guidances systems are "closed loop"; the guidance command sequence is not completely scheduled but is updated throughout the flight. Why? To compensate for unexpected effects of the guidance commands, often due to ballistic winds, thermal gradients, or other effects on the physical system. It is the closed-loop corrections that make missile guidance work. The exact same principal applies to your car's cruise control, chasing down a fly ball in baseball, or even just walking down the street.
For a predictive model to be useful long-term, it needs updating to correct for changes in the population the models are applied to, whether the models be for customer acquisition, churn, fraud detection, or any model. The "closed-loop" typical in data mining is called "model updating" and is critical for long-term modeling success.
The question then becomes this: can the models be updated quickly enough to compensate for changes in the population? If a missile can only be updated at 10Hz (10x / sec.) but uncertainties effect the trajectory significantly in milliseconds, the closed-loop actions may be insufficient to compensate. If your predictive can only be updated monthly, but your customer behavior changes significantly on a weekly basis, your models will be behind perpetually. Measuring the effectiveness of model predictions is therefore critical in determining the frequency of model updating necessary in your organization.
To be fair, until I read the book I have no quibble with the arguments. The arguments here are based solely on the book review and some ideas they prompted in my mind. I'd welcome comments from anyone who has read the book already.
The book can be found on amazon here.
UPDATE: Aaron Lai wrote an article for CFA Magazine on the same topic, also quoting Derman. I commend the article to all (note: this is a PDF file download).
Tips, tricks, and comments related to topics in data science and machine learning. Used to be called "data mining and predictive analytics" but updated the title to reflect the language of the day!
Hosted by Dean Abbott, Abbott Analytics
Wednesday, December 28, 2011
Friday, November 04, 2011
Statistical Rules Of Thumb, part III: Always Visualize the Data
As I perused Statistical Rules of Thumb again, as I do from time to time, I came across this gem. (note: I live in CA, so get no money from these amazon links).
Van Belle uses the term "Graph" rather than "Visualize", but it is the same idea. The point is to visualize in addition to computing summary statistics. Summaries are useful, but can be deceiving; any time you summarize data you will lose some information unless the distributions are well behaved. The scatterplot, histogram, box and whiskers plot, etc. can reveal ways the summaries can fool you. I've seen these as well, especially variables with outliers or that are bi- or tri-modal.
One of the most famous examples of this effect is Anscombe's Quartet. I'm including the Wikipedia image of the plots here:
All four datasets have the same mean x values, y values, x standard deviation, y standard deviation, x-y pearson correlation coefficient, and regression line of y, so the summaries don't tell the differences in the data.
I use correlations a lot to get the gist of the relationships in the data, and I've seen how correlations can deceive. In one project, we had 30K data points with a correlation of 0.9+. When we removed just 100 of these data points (the largest magnitudes of x and y), the correlation shrunk to 0.23.
Most data mining software has ways to visualize data easily now. Avail yourself to them to avoid subsequent surprises in your data.
Van Belle uses the term "Graph" rather than "Visualize", but it is the same idea. The point is to visualize in addition to computing summary statistics. Summaries are useful, but can be deceiving; any time you summarize data you will lose some information unless the distributions are well behaved. The scatterplot, histogram, box and whiskers plot, etc. can reveal ways the summaries can fool you. I've seen these as well, especially variables with outliers or that are bi- or tri-modal.
One of the most famous examples of this effect is Anscombe's Quartet. I'm including the Wikipedia image of the plots here:
All four datasets have the same mean x values, y values, x standard deviation, y standard deviation, x-y pearson correlation coefficient, and regression line of y, so the summaries don't tell the differences in the data.
I use correlations a lot to get the gist of the relationships in the data, and I've seen how correlations can deceive. In one project, we had 30K data points with a correlation of 0.9+. When we removed just 100 of these data points (the largest magnitudes of x and y), the correlation shrunk to 0.23.
Most data mining software has ways to visualize data easily now. Avail yourself to them to avoid subsequent surprises in your data.
Friday, July 29, 2011
Yet another "Wisdom of Crowds" success
I was at the Federal Building downtown San Diego for a consulting job, and met some representatives for a life and disability insurance company who were giving away a big-screen HD TV for the individual who came closest to guessing the number of M&Ms (chocolate and peanut butter filled) in a container. Because they do this often, I won't show the specific container they use.
I offered to make a guess of the total, but only if I could see all of the guesses so far. I was drawing from the Wisdom of Crowds example from Chapter 1 of the book where a set of independent guesses tend to outperform even an expert's best guess. I've done the same experiment many times in data mining courses I've taught, and have found the same phenomenon.
I collected data from 77 individuals (including myself) shown here (sorted for convenience, but this makes no difference in the analysis):
37
625
772
784
875
888
903
929
983
987
1001
1015
1040
1080
1080
1124
1245
1250
1450
1500
1536
1596
1600
1774
1875
1929
1972
1976
1995
2000
2012
2033
2143
2150
2200
2221
2235
2251
2321
2331
2412
2500
2500
2550
2571
2599
2672
2714
2735
2777
2777
2803
2832
2873
2931
3001
3101
3250
3333
3362
3500
3500
3501
3501
3583
3661
3670
3697
3832
3872
4280
4700
4797
5205
5225
5257
9886
10000
187952
Note there are a few flakey ones in the lot. The last two were easy to spot (so I put them at the bottom of my list). The idea of course is to just take the average of the guesses.
Average all: 4932
Average all without 37 and 187932: 2626
Then I looked at the histogram and decided that the guesses close to 10000 were also too flaky to include:
So I removed all data points greater than 8000, which took away 2 samples, leaving this histogram and a mean of 2436.
So now for the outcome:
Actual Count: 2464
Average of trimmed sample: 2436 (error 28)
Best individual guess: 2500 (error 36)
So amazingly, the average won, though I wouldn't have been disappointed at all if it finished 3rd or 4th because it still would have been a great guess.
Wisdom of Crowds wins again!
PS I reported to the insurance agents a guess of 2423 because I had omitted my original guess (provided before looking at any other guesses--2550 if you must know) and my co-worker's guess of 3250, so these helped bring up the mean a bit. The Average would have lost (barely) if I had not included them.
PPS So how will they split the winnings since two guessed the same value? I won't recommend the saw approach. I hope they ask each of the two guessers to either modify their guess, and require they modify their guess by at least one.
PPPS Note: the charts were done using JMP Pro 9 for the Macintosh
I offered to make a guess of the total, but only if I could see all of the guesses so far. I was drawing from the Wisdom of Crowds example from Chapter 1 of the book where a set of independent guesses tend to outperform even an expert's best guess. I've done the same experiment many times in data mining courses I've taught, and have found the same phenomenon.
I collected data from 77 individuals (including myself) shown here (sorted for convenience, but this makes no difference in the analysis):
37
625
772
784
875
888
903
929
983
987
1001
1015
1040
1080
1080
1124
1245
1250
1450
1500
1536
1596
1600
1774
1875
1929
1972
1976
1995
2000
2012
2033
2143
2150
2200
2221
2235
2251
2321
2331
2412
2500
2500
2550
2571
2599
2672
2714
2735
2777
2777
2803
2832
2873
2931
3001
3101
3250
3333
3362
3500
3500
3501
3501
3583
3661
3670
3697
3832
3872
4280
4700
4797
5205
5225
5257
9886
10000
187952
Note there are a few flakey ones in the lot. The last two were easy to spot (so I put them at the bottom of my list). The idea of course is to just take the average of the guesses.
Average all: 4932
Average all without 37 and 187932: 2626
Then I looked at the histogram and decided that the guesses close to 10000 were also too flaky to include:
So I removed all data points greater than 8000, which took away 2 samples, leaving this histogram and a mean of 2436.
So now for the outcome:
Actual Count: 2464
Average of trimmed sample: 2436 (error 28)
Best individual guess: 2500 (error 36)
So amazingly, the average won, though I wouldn't have been disappointed at all if it finished 3rd or 4th because it still would have been a great guess.
Wisdom of Crowds wins again!
PS I reported to the insurance agents a guess of 2423 because I had omitted my original guess (provided before looking at any other guesses--2550 if you must know) and my co-worker's guess of 3250, so these helped bring up the mean a bit. The Average would have lost (barely) if I had not included them.
PPS So how will they split the winnings since two guessed the same value? I won't recommend the saw approach. I hope they ask each of the two guessers to either modify their guess, and require they modify their guess by at least one.
PPPS Note: the charts were done using JMP Pro 9 for the Macintosh
Monday, June 13, 2011
What do Data Miners Need to Learn?
I've been asked by several folks recently what they need to learn to succeed in data mining and predictive analytics. This is a different twist on the question I also get, namely what degree should one get to be a good (albeit "green") data miner. Usually, the latter question gets the answer "it doesn't matter" because I know so many great data miners without a statistics or mathematics degree. Understandably, there are many non-stats/math degrees that have a very strong statistics or mathematics component, such as psychology, political science, and engineering to name a few. But then again, you don't necessarily have to load up on the stats/math courses in these disciplines either.
So the question of "what to learn" applies across majors whether undergraduate or graduate. Of course statistics and machine learning courses are directly applicable. However, the answer I've been giving recently to the question what do new data miners need to learn (assuming they will learn algorithms) have centered around two other topics: databases and business.
I had no specific coursework or experience in either when I began my career. In the 80s, databases were not as commonplace in the DoD world where I began my career; we usually worked with flat files provided to us by a customer, even if these files were quite large. Now, most customers I work with have their data stored in databases or data marts, and as a result, we data miners often must lean on DBAs or an IT layer of people to get at the data. This would be fine except that (1) the data that is provided to data miners is often not the complete data we need or at least would like to have before building models, (2) we sometimes won't know how valuable data is until we look at it, and (3) communication with IT is often slow and laden with political issues inherent in many organizations.
On the other hand, IT is often reticent to give analysts significant freedom to query databases because of the harm they can do (wise!) because data miners have in general a poor understanding of how databases work and which queries are dangerous or computationally expensive.
Therefore, I am becoming more of the opinion that a masters program in data mining, or a data mining certificate program should contain at least one course on databases, which should contain at least some database design component, but for the most part should emphasize a users perspective). It is probably more realistic to require this for a degree than a certificate, but could be included in both. I know that for me, in considering new hires, this would be provide a candidate an advantage for me if he or she had SQL or SAS experience.
For the second issue, business experience, there are some that might be concerned that "experience" is too narrow for a degree program. After all, if someone has experience in building response models, what good would that do for Paypal if they are looking for building fraud models? My reply is "a lot"! Building models on real data (meaning messy) to solve a real problem (meaning identifying a target variable that conveys the business decision to be improved) requires a thought process that isn't related to knowing algorithms or data.
Building "real-world" models requires a translation of business objectives to data mining objectives (as described in the Business Understanding section of CRISP-DM, pdf here). When I have interviewed young data miners in the past, it is those who have had to go through this process that are better prepared to begin the job right away, and it is those who recognize the value here who do better at solving problems in a way that impacts decisions rather than finding cool, innovative solutions that never see the light of day. (UPDATE: the crisp-dm.org site is no longer up--see comments section. The CRISP-DM 1.0 document however can still be downloaded here, with higher resolution graphics, by the way!)
My challenge to the universities who are adding degree programs in data mining and predictive analytics, or are offering Certificate programs is then to include courses on how to access data (databases), and how to solve problems (business objectives, perhaps by offering a practicum with a local company).
So the question of "what to learn" applies across majors whether undergraduate or graduate. Of course statistics and machine learning courses are directly applicable. However, the answer I've been giving recently to the question what do new data miners need to learn (assuming they will learn algorithms) have centered around two other topics: databases and business.
I had no specific coursework or experience in either when I began my career. In the 80s, databases were not as commonplace in the DoD world where I began my career; we usually worked with flat files provided to us by a customer, even if these files were quite large. Now, most customers I work with have their data stored in databases or data marts, and as a result, we data miners often must lean on DBAs or an IT layer of people to get at the data. This would be fine except that (1) the data that is provided to data miners is often not the complete data we need or at least would like to have before building models, (2) we sometimes won't know how valuable data is until we look at it, and (3) communication with IT is often slow and laden with political issues inherent in many organizations.
On the other hand, IT is often reticent to give analysts significant freedom to query databases because of the harm they can do (wise!) because data miners have in general a poor understanding of how databases work and which queries are dangerous or computationally expensive.
Therefore, I am becoming more of the opinion that a masters program in data mining, or a data mining certificate program should contain at least one course on databases, which should contain at least some database design component, but for the most part should emphasize a users perspective). It is probably more realistic to require this for a degree than a certificate, but could be included in both. I know that for me, in considering new hires, this would be provide a candidate an advantage for me if he or she had SQL or SAS experience.
For the second issue, business experience, there are some that might be concerned that "experience" is too narrow for a degree program. After all, if someone has experience in building response models, what good would that do for Paypal if they are looking for building fraud models? My reply is "a lot"! Building models on real data (meaning messy) to solve a real problem (meaning identifying a target variable that conveys the business decision to be improved) requires a thought process that isn't related to knowing algorithms or data.
Building "real-world" models requires a translation of business objectives to data mining objectives (as described in the Business Understanding section of CRISP-DM, pdf here). When I have interviewed young data miners in the past, it is those who have had to go through this process that are better prepared to begin the job right away, and it is those who recognize the value here who do better at solving problems in a way that impacts decisions rather than finding cool, innovative solutions that never see the light of day. (UPDATE: the crisp-dm.org site is no longer up--see comments section. The CRISP-DM 1.0 document however can still be downloaded here, with higher resolution graphics, by the way!)
My challenge to the universities who are adding degree programs in data mining and predictive analytics, or are offering Certificate programs is then to include courses on how to access data (databases), and how to solve problems (business objectives, perhaps by offering a practicum with a local company).
Thursday, May 05, 2011
Number of Hidden Layer Neurons to Use
In the linkedin.com Artificial Neural Networks group, a question arose about how many hidden neurons one should choose. I've never found a fully satisfactory answer to this, but there is quite a lot of guesses and rules of thumb out there.
I've always like Warren Sarle's neural network FAQ that includes a discussion on this topic.
There is another reference on the web that I agree with only about 50%, but the references are excellent: http://www.faqs.org/faqs/ai-faq/neural-nets/part3/section-10.html.
My personal preference is to use software that experiments with multiple architectures and selects the one that performs best on held-out data. Better still are the algorithms that also select (i.e. prune) inputs as well. As I teach in my courses, I've spent far too many hours in my life selection neural network architectures and re-training, so I'd much rather let the software do it for me.
I've always like Warren Sarle's neural network FAQ that includes a discussion on this topic.
There is another reference on the web that I agree with only about 50%, but the references are excellent: http://www.faqs.org/faqs/ai-faq/neural-nets/part3/section-10.html.
My personal preference is to use software that experiments with multiple architectures and selects the one that performs best on held-out data. Better still are the algorithms that also select (i.e. prune) inputs as well. As I teach in my courses, I've spent far too many hours in my life selection neural network architectures and re-training, so I'd much rather let the software do it for me.
Monday, April 25, 2011
Statistical Rules of Thumb, part II
A while back, Will Dwinnell posted on two books, one of which is one of my favorites as well:
Will mentioned a few general topics covered in the book, but I thought I would mention two specific ones that I agree with wholeheartedly.
7.3: Always Graph the Data
In this section he quotes E.R. Tufte as follows (Abbott quoting van Belle quoting Tufte):
I'm not so sure I agree with the superlatives, I certainly agree with the gist that excellence in graphics is parsimonious, clear, insightful, and informationally rich. Contrast this to another rule of thumb:
7.4: Never use a Pie Chart
well, that's not exactly rocket science; pie charts have lots of detractors...The only thing worse than a pie chart is a 3-D pie chart!
7.6: Stacked Barcharts are Worse than Bargraphs.
Perhaps the biggest problem with stacked bar graphs (such as the one here) is that you cannot see clearly the comparison between the colored values in the bins.
(a good summary of why they are problematic is in Stephen Few's Newletter, which you can download here)
I have found that data shown in a chart like this can be shown better in a table, perhaps with some conditional formatting (in Excel) or other color coding to push the eye toward the key differences in values. For continuous data, this often means binning a variable (akin to the histogram) and creating a cross-tab. The key is clarity--make the table so that the key information is obvious.
Will mentioned a few general topics covered in the book, but I thought I would mention two specific ones that I agree with wholeheartedly.
7.3: Always Graph the Data
In this section he quotes E.R. Tufte as follows (Abbott quoting van Belle quoting Tufte):
Graphical Excellence is that which gives the viewer the greatest number of ideas in the shortest time with the least ink in the shortest space.
I'm not so sure I agree with the superlatives, I certainly agree with the gist that excellence in graphics is parsimonious, clear, insightful, and informationally rich. Contrast this to another rule of thumb:
7.4: Never use a Pie Chart
well, that's not exactly rocket science; pie charts have lots of detractors...The only thing worse than a pie chart is a 3-D pie chart!
7.6: Stacked Barcharts are Worse than Bargraphs.
Perhaps the biggest problem with stacked bar graphs (such as the one here) is that you cannot see clearly the comparison between the colored values in the bins.
(a good summary of why they are problematic is in Stephen Few's Newletter, which you can download here)
I have found that data shown in a chart like this can be shown better in a table, perhaps with some conditional formatting (in Excel) or other color coding to push the eye toward the key differences in values. For continuous data, this often means binning a variable (akin to the histogram) and creating a cross-tab. The key is clarity--make the table so that the key information is obvious.
Tuesday, April 19, 2011
Rexer Analytics data mining survey
Rexer Analytics, a data mining consulting firm, is conducting their 5th annual survey of the analytic behaviors, views and preferences of data mining professionals. I urge all of you to respond to the survey and help us all understand better the nature of the data mining and predictive analytics industry. The following text contains their instructions and overview.
If you want to skip the verbage and just get on with the survey, use code RL3X1 and go here.
If you want to skip the verbage and just get on with the survey, use code RL3X1 and go here.
Your responses are completely confidential: no information you provide on the survey will be shared with anyone outside of Rexer Analytics. All reporting of the survey findings will be done in the aggregate, and no findings will be written in such a way as to identify any of the participants. This research is not being conducted for any third party, but is solely for the purpose of Rexer Analytics to disseminate the findings throughout the data mining community via publication, conference presentations, and personal contact.
To participate, please click on the link below and enter the access code in the space provided. The survey should take approximately 20 minutes to complete. Anyone who has had this email forwarded to them should use the access code in the forwarded email.
Survey Link: www.RexerAnalytics.com/Data-Miner-Survey-2011-Intro2.html
Access Code: RL3X1
If you would like a summary of last year’s or this year’s findings emailed to you, there will be a place at the end of the survey to leave your email address. You can also email us directly (DataMinerSurvey@RexerAnalytics.com) if you have any questions about this research or to request research summaries. Here are links to the highlights of the previous years’ surveys. Contact us if you want summary reports from any of these years.
-- 2010 survey highlights: http://www.rexeranalytics.com/Data-Miner-Survey-Results-2010.html
-- 2009 survey highlights: http://www.rexeranalytics.com/Data-Miner-Survey-Results-2009.html
-- 2008 survey highlights: http://www.rexeranalytics.com/Data-Miner-Survey-Results-2008.html
-- 2007 survey highlights: http://www.rexeranalytics.com/Data-Miner-Survey-Results.html
Thank you for your time. We hope this research program continues to provide useful information to the data mining community.
Sincerely,
Karl Rexer, PhD
Monday, April 11, 2011
Predictive Models are not Statistical Models — JT on EDM
This post was first posted on Predictive Models are not Statistical Models — JT on EDM
My friend and colleague James Taylor asked me last week to comment on a question regarding statistics vs. predictive analytics. The bulk of my reply is on James' blog; my fully reply is here, re-worked from my initial response to clarify some points further.
I have always love reading the green "Sage" books, such as Understanding Regression Assumptions (Quantitative Applications in the Social Sciences)
or Missing Data (Quantitative Applications in the Social Sciences) because they are brief, cover a single topic, and are well-written. As a data miner though, I am also somewhat amused reading them because they are obviously written by statisticians with the mindset that the model is king. This means that we either pre-specify a model (the hypothesis) or require the model be fully interpretable, fully representing the process we are modeling. When the model is king, it's as if there is a model in the ether that we as modelers must find, and if we get coefficients in the model "wrong", or if the model errors are "wrong", we have to rebuild the data and then the model to get it all right.
In data mining and predictive analytics, the data is king. These models often impute the models from the data (decision trees do this), or even if they only fit coefficients (like neural networks), it's the accuracy that matters rather than the coefficients. Often, in the data mining world, we won't have to explain precisely why individuals behave as they do so long as we can explain generally how they will behave. Model interpretation is often related to describing trends (sensitivity or importance of variables).
I have always found David Hand's summaries of the two disciplines very useful, such as this one here; I found that he had a healthy respect for both disciplines.
My friend and colleague James Taylor asked me last week to comment on a question regarding statistics vs. predictive analytics. The bulk of my reply is on James' blog; my fully reply is here, re-worked from my initial response to clarify some points further.
I have always love reading the green "Sage" books, such as Understanding Regression Assumptions (Quantitative Applications in the Social Sciences)
or Missing Data (Quantitative Applications in the Social Sciences) because they are brief, cover a single topic, and are well-written. As a data miner though, I am also somewhat amused reading them because they are obviously written by statisticians with the mindset that the model is king. This means that we either pre-specify a model (the hypothesis) or require the model be fully interpretable, fully representing the process we are modeling. When the model is king, it's as if there is a model in the ether that we as modelers must find, and if we get coefficients in the model "wrong", or if the model errors are "wrong", we have to rebuild the data and then the model to get it all right.
In data mining and predictive analytics, the data is king. These models often impute the models from the data (decision trees do this), or even if they only fit coefficients (like neural networks), it's the accuracy that matters rather than the coefficients. Often, in the data mining world, we won't have to explain precisely why individuals behave as they do so long as we can explain generally how they will behave. Model interpretation is often related to describing trends (sensitivity or importance of variables).
I have always found David Hand's summaries of the two disciplines very useful, such as this one here; I found that he had a healthy respect for both disciplines.
Tuesday, March 29, 2011
Analyzing the Results of Analysis
Sometimes, the output of analytical tools can be voluminous and complicated. Making sense of it sometimes requires, well, analysis. Following are two examples of applying our tools to their own output.
Model Deployment Verification
From time to time, I have deployed predictive models on a vertical application in the finance industry which is not exactly "user friendly". I have virtually no access to the actual deployment and execution processes, and am largely limited to examination the production mode output, as implemented on the system in question.
As sometimes happens, the model output does not match my original specification. While the actual deployment is not my individual responsibility, it very much helps if I can indicate where the likely problem is. As these models are straightforward linear or generalized linear models (with perhaps a few input data transformations), I have found it useful to calculate the correlation between each of the input variables and the difference between the deployed model output and my own calculated model output. The logic is that input variables with a higher correlation with the deployment error are more likely to be calculated incorrectly. While this trick is not a cure-all, it quickly identifies in 80% or more of cases the culprit data elements.
Model Stability Over Time
A bedrock premise of all analytical work is that the future will resemble the past. After all, if the rules of the game keep changing, then there's little point in learning them. Specifically in predictive modeling, this premise requires that the relationship between input and output variables must remain sufficiently stable for discovered models to continue to be useful in the future.
In a recent analysis, I discovered that models universally exhibited a substantial drop in test performance, when comparing out-of-time to (in-time) out-of-sample. The relationships between at least some of my candidate input variables and the target variable are presumably changing over time. In an effort to minimize this issue, I attempted to determine which variables were most susceptible. I calculated the correlation between each candidate predictor and the target, both for an early time-frame and for a later one.
My thinking was that variables whose correlation changed the most across time were the least stable and should be avoided. Note that I was looking for changes in correlation, and not whether correlations were strong or weak. Also, I regarded strengthening correlations just as suspect as weakening ones: The idea is for the model to perform consistently over time.
In the end, avoiding the use of variables which exhibited "correlation slide" did weaken model performance, but did ensure that performance did not deteriorate so drastically out-of-time.
Final Thought
It is interesting to see how useful analytical tools can be when applied to the analytical process itself. I note that solutions like the ones described here need not use fancy tools: Often simple calculations of means, standard deviation and correlations are sufficient.
Model Deployment Verification
From time to time, I have deployed predictive models on a vertical application in the finance industry which is not exactly "user friendly". I have virtually no access to the actual deployment and execution processes, and am largely limited to examination the production mode output, as implemented on the system in question.
As sometimes happens, the model output does not match my original specification. While the actual deployment is not my individual responsibility, it very much helps if I can indicate where the likely problem is. As these models are straightforward linear or generalized linear models (with perhaps a few input data transformations), I have found it useful to calculate the correlation between each of the input variables and the difference between the deployed model output and my own calculated model output. The logic is that input variables with a higher correlation with the deployment error are more likely to be calculated incorrectly. While this trick is not a cure-all, it quickly identifies in 80% or more of cases the culprit data elements.
Model Stability Over Time
A bedrock premise of all analytical work is that the future will resemble the past. After all, if the rules of the game keep changing, then there's little point in learning them. Specifically in predictive modeling, this premise requires that the relationship between input and output variables must remain sufficiently stable for discovered models to continue to be useful in the future.
In a recent analysis, I discovered that models universally exhibited a substantial drop in test performance, when comparing out-of-time to (in-time) out-of-sample. The relationships between at least some of my candidate input variables and the target variable are presumably changing over time. In an effort to minimize this issue, I attempted to determine which variables were most susceptible. I calculated the correlation between each candidate predictor and the target, both for an early time-frame and for a later one.
My thinking was that variables whose correlation changed the most across time were the least stable and should be avoided. Note that I was looking for changes in correlation, and not whether correlations were strong or weak. Also, I regarded strengthening correlations just as suspect as weakening ones: The idea is for the model to perform consistently over time.
In the end, avoiding the use of variables which exhibited "correlation slide" did weaken model performance, but did ensure that performance did not deteriorate so drastically out-of-time.
Final Thought
It is interesting to see how useful analytical tools can be when applied to the analytical process itself. I note that solutions like the ones described here need not use fancy tools: Often simple calculations of means, standard deviation and correlations are sufficient.
Sunday, March 06, 2011
Statistics: The Need for Integration
I'd like to revisit an issue we covered here, way back in 2007: Statistics: Why Do So Many Hate It?. Recent comments made to me, both in private conversation ("Statistics? I hated that class in college!"), and in print prompt me to reconsider this issue.
One thing which occurs to me is that many people have a tendency to think of statistics in an isolated way. This world view keeps statistics at bay, as something which is done separately from other business activities, and, importantly, which is done and understood only by the statisticians. This is very far from the ideal which I suggest, in which statistics (including data mining) are much more integrated with the business processes of which they are a part.
In my opinion, this is a strange way to frame statistics. As an analog, imagine if, when asked to produce a report, a business team turned to their "English guy", with the expectation that he did all the writing. I am not suggesting that everyone needs to do the heavy lifting that data miners do, but that people who don't accept some responsibility for data mining's contribution to the business process. Managers, for example, who throw up their hands with the excuse that "they are not numbers people" forfeit control over an important part of their business function. It is healthier for everyone involved, I submit, if statistics moves away from being a black art, and statisticians become less of an arcane priesthood.
One thing which occurs to me is that many people have a tendency to think of statistics in an isolated way. This world view keeps statistics at bay, as something which is done separately from other business activities, and, importantly, which is done and understood only by the statisticians. This is very far from the ideal which I suggest, in which statistics (including data mining) are much more integrated with the business processes of which they are a part.
In my opinion, this is a strange way to frame statistics. As an analog, imagine if, when asked to produce a report, a business team turned to their "English guy", with the expectation that he did all the writing. I am not suggesting that everyone needs to do the heavy lifting that data miners do, but that people who don't accept some responsibility for data mining's contribution to the business process. Managers, for example, who throw up their hands with the excuse that "they are not numbers people" forfeit control over an important part of their business function. It is healthier for everyone involved, I submit, if statistics moves away from being a black art, and statisticians become less of an arcane priesthood.
Wednesday, February 23, 2011
The Power of Prescience: Achieving Lift with Predictive Analytics
I'll be participating in the DM Radio broadcast tomorrow, The Power of Prescience: Achieving Lift with Predictive Analytics Thursday, Feb 23 at 3pm ET. The best practices that we will be discussing include:
I also plan on talking about the importance of proper perspective in building models. While we want predictive models to be good, even excellent, but in the end, we need the models to improve decision-making over what is done currently. I'm not advocating low expectations, just reasonable expectations.
1) properly define the problem to be solved (don’t shoot in the dark); 2) identify a key target variable to predict (must be a good decision-making metric in the company); 3) determine what “good” means, success-wise (what is the baseline for success?); 4) identify the appropriate data that can aid in prediction. There’s also: 5) finding the right algorithms, but this doesn’t matter unless 1-4 are nailed.
I also plan on talking about the importance of proper perspective in building models. While we want predictive models to be good, even excellent, but in the end, we need the models to improve decision-making over what is done currently. I'm not advocating low expectations, just reasonable expectations.
Wednesday, February 16, 2011
The Judgement of Watson: Mathematics Wins!
Tom Davenport argues in this HBR article Why I'm Pulling for Watson - Tom Davenport - Harvard Business Review that
While this is true, I don't agree that Watson itself is using "judgement" or "making decisions". It appears to me that it is a very nice search engine that incorporates NLP to make these searches more relevant. It isn't giving opinions, synthesizing information to create innovative ideas, or making inferences through extrapolation, all things humans do on a regular basis. This has long been one of my complaints about the way neural networks were described: they "learn", they "think", they "make inferences". No, they are a nonlinear function that finds weights via gradient descent searches. The no more "learn" than logistic regression "learns".
A lot of the hype gets back to the old "hard AI" vs. "soft AI" debates that have been going on for decades. I appreciated very much the book by Roger Penrose on this subject, Shadows of the Mind: A Search for the Missing Science of Consciousness.
This isn't to minimize the incredible feat IBM has accomplished with Watson, or on a simpler level, the feats of decision-making that can be performed with nonlinear mathematics in neural networks or support vector machines. These are phenomenal accomplishments that are awe inspiring mathematically, and on a more practical level will assist us all in the future with improved ability to automate decision-making. Of course, these kinds of decisions are those that do not require innovation or judgement, but can be codified mathematically. Every time I check out at an automatic teller at Home Depot, deposit checks at an ATM, or even make an amazon purchase, I'm reminded of the depth of technology that makes these complex transactions simple to the user. Watson is the beginning of the next leap in this ongoing technological march forward, all created by enterprising humans who have been able to break down complex behavior into repeatable, reliable, and flexible algorithmic steps.
In the end, I agree with Mr. Davenport, "So whether the humans or Watson win, it means that humans have come out on top."
I want Watson to win. Why? It's elementary: my dear Watson is a triumph of human ingenuity. In other words, there is no way humans can lose this competition. Watson also illustrates that the knowledge, judgment, and insights of the smartest humans can be embedded into automated systems. I suspect that those automated systems will ultimately be used to make better decisions in many domains, and interact with humans in a much more intelligent way. If computers can persuade Alex Trebek that they're very smart—and that's what he said about Watson—they'll be able to interact effectively with almost any human with a problem to solve.
While this is true, I don't agree that Watson itself is using "judgement" or "making decisions". It appears to me that it is a very nice search engine that incorporates NLP to make these searches more relevant. It isn't giving opinions, synthesizing information to create innovative ideas, or making inferences through extrapolation, all things humans do on a regular basis. This has long been one of my complaints about the way neural networks were described: they "learn", they "think", they "make inferences". No, they are a nonlinear function that finds weights via gradient descent searches. The no more "learn" than logistic regression "learns".
A lot of the hype gets back to the old "hard AI" vs. "soft AI" debates that have been going on for decades. I appreciated very much the book by Roger Penrose on this subject, Shadows of the Mind: A Search for the Missing Science of Consciousness.
This isn't to minimize the incredible feat IBM has accomplished with Watson, or on a simpler level, the feats of decision-making that can be performed with nonlinear mathematics in neural networks or support vector machines. These are phenomenal accomplishments that are awe inspiring mathematically, and on a more practical level will assist us all in the future with improved ability to automate decision-making. Of course, these kinds of decisions are those that do not require innovation or judgement, but can be codified mathematically. Every time I check out at an automatic teller at Home Depot, deposit checks at an ATM, or even make an amazon purchase, I'm reminded of the depth of technology that makes these complex transactions simple to the user. Watson is the beginning of the next leap in this ongoing technological march forward, all created by enterprising humans who have been able to break down complex behavior into repeatable, reliable, and flexible algorithmic steps.
In the end, I agree with Mr. Davenport, "So whether the humans or Watson win, it means that humans have come out on top."
Tuesday, February 08, 2011
Predictive Analytics Innovation
The Predictive Analytics Summit, a relative newcomer to the Predictive Analytics conference circuit, will be held in San Diego on Feb 24-25. At the first Summit in San Francisco last Fall, I enjoyed several of the talks and the networking. This time I will be presenting a fraud detection case study.
Monday, February 07, 2011
Webinar with James Taylor -- 10 Best Practices in Operational Analytics
I'll be presenting a webinar with James Taylor this Wednesday at 10AM PST entitled "10 best practices in operational analytics".
One of the most powerful ways to apply advanced analytics is by putting them to work in operational systems. Using analytics to improve the way every transaction, every customer, every website visitor is handled is tremendously effective. The multiplicative effect means that even small analytic improvements add up to real business benefit.
In this session James Taylor, CEO of Decision Management Solutions, and Dean Abbott of Abbott Analytics will provide you with 10 best practices to make sure you can effectively build and deploy analytic models into you operational systems.
Friday, January 28, 2011
Predictive Analytics World Early-bird ends Monday
The earlybird special for Predictive Analytics World / San Francisco ends January 31, 2011 which saves you $200 on the conference rate and $100 on any workshop, including my Hands-On Predictive Analytics using SAS Enterprise Miner on March 17th.
More details on the 7 workshops can be found here.
Hope to see you there!
More details on the 7 workshops can be found here.
Hope to see you there!
Thursday, January 27, 2011
Do analytics books sell?
Kevin Hillstrom has a fascinating post on brief, technical ebooks (Amazon singles) sold on Amazon here: Kevin Hillstrom: MineThatData: Amazon Singles. His points: interesting content is what sells. Length doesn't matter, but these ebooks are typically less than 50 pages. Price doesn't matter.
Should I jump in? Should you?
Should I jump in? Should you?
Saturday, January 22, 2011
Doing Data Mining Out of Order
I like the CRISP-DM process model for data mining, teach from it, and use it on my projects. I commend it to practitioners and managers routinely as an aid during any data mining project. However, while the process sequence is generally the one I use, I don't always; data mining often requires more creativity and "art" to re-work the data than we would like; it would be very nice if we could create a checklist and just run through the list on every project! But unfortunately data doesn't always cooperate in this way, and we therefore need to adapt to the specific data problems so that the data is better prepared.
For example, on a current financial risk project I am working, the customer is building data for predictive analytics for the first time. The customer is data savvy, but new to predictive analytics, so we've had to iterate several times on how the data is pulled and rolled up out of the database. In particular, target variable has had to be cleaned up because of historic coding anomalies.
One primary question to resolve for this project is an all-too-common debate over what is the right level of aggregation: do we use transactional data even though some customers have many transactions and some have few, or do we roll data up to the customer level to build customer risk models. (A transaction-based model will score each transaction for risk, whereas a customer-based model will score, daily, the risk associated with each customer given the new transactions that have been added.) There are advantages and disadvantages to both, but in this case, we are building a customer-centric risk model for reasons that make sense in this particular business context.
Back to the CRISP-DM process and why it is advantageous to deviate from CRISP-DM. In this project, we jumped from Business Understanding and the beginnings of Data Understanding straight to Modeling. I think in this case, I would call it "modeling" (small 'm') because we weren't building models to predict risk, but rather to understand the target variable better. We were not sure exactly how clean the data was to begin with, especially the definition of the target variable, because no one had ever looked at the data in aggregate before, only on a single customer-by-customer basis. By building models, and seeing some fields that predict the target variable "too well", we have been able to identify historic data inconsistencies and miscoding.
Now that we have the target variable better defined, I'm going back to the data understanding and data prep stages to complete those stages properly, and this is changing how the data will be prepped in addition to modifying the definition of the target variable. It's also much more enjoyable to build models than do data prep, so for me this was a "win-win" anyway!
For example, on a current financial risk project I am working, the customer is building data for predictive analytics for the first time. The customer is data savvy, but new to predictive analytics, so we've had to iterate several times on how the data is pulled and rolled up out of the database. In particular, target variable has had to be cleaned up because of historic coding anomalies.
One primary question to resolve for this project is an all-too-common debate over what is the right level of aggregation: do we use transactional data even though some customers have many transactions and some have few, or do we roll data up to the customer level to build customer risk models. (A transaction-based model will score each transaction for risk, whereas a customer-based model will score, daily, the risk associated with each customer given the new transactions that have been added.) There are advantages and disadvantages to both, but in this case, we are building a customer-centric risk model for reasons that make sense in this particular business context.
Back to the CRISP-DM process and why it is advantageous to deviate from CRISP-DM. In this project, we jumped from Business Understanding and the beginnings of Data Understanding straight to Modeling. I think in this case, I would call it "modeling" (small 'm') because we weren't building models to predict risk, but rather to understand the target variable better. We were not sure exactly how clean the data was to begin with, especially the definition of the target variable, because no one had ever looked at the data in aggregate before, only on a single customer-by-customer basis. By building models, and seeing some fields that predict the target variable "too well", we have been able to identify historic data inconsistencies and miscoding.
Now that we have the target variable better defined, I'm going back to the data understanding and data prep stages to complete those stages properly, and this is changing how the data will be prepped in addition to modifying the definition of the target variable. It's also much more enjoyable to build models than do data prep, so for me this was a "win-win" anyway!