Larger quantities of data permit greater precision, greater certainty and more detail in analysis. As observation counts increase, standard errors decrease and the opportunity for more detailed- perhaps more segmented- analysis rises. These are things which are obvious to even junior analysts: The standard error of the mean is calculated as the standard deviation divided by the square root of the observation count.
This general idea may seem obvious when spoken aloud, but it is something which many non-technical people seem to give little thought. Ask any non-technical client whether more data will provide a better answer, and the response will be in the affirmative. It is a simple trend to understand.
However, people who do not analyze data for a living do not necessarily think about such things in precise terms. On too many occasions, I have listened to managers or other customers indicate that they wanted to examine data set X and test Y things. Without performing any calculations, I had strong suspicions that it would not be feasible to test Y things, given the meager size of data set X. Attempts to explain this have been met with various responses. To be fair, some of them were constructive acknowledgments of this unfortunate reality, and new expectations were established. In other cases, I was forced to be the insistent bearer of bad news.
In one such situation, a data set with less than twenty thousand observations was to be divided among about a dozen direct mail treatments. Expected response rates were typically in the single-digit percents, meaning that only a few hundred observations would be available for analysis. Treatments were to be compared based on various business metrics (customer spending, etc.). Given the small number of respondents and high variability of this data, I realized that this was unlikely to be productive. I eventually gave up trying to explain the futility of this exercise, and resigned myself to listening to biweekly explanations the noisy graphs and summaries. One day, though, I noticed that one of the cells contained a single observation! Yes, much energy and attention was devoted to tracking this "cell" of one individual, which of course would have no predictive value whatsoever.
It is important for data analysts to make clear the limitations of our craft. One such limitation is the necessity of sufficient data from which to draw reasonable and useful conclusions. It may be helpful to indicate this important requirement as the data budget: "Given the quality and volume of our historical data, we only have the data budget to answer questions about 3 segments, not 12." Simply saying "We don't have enough data" is not effective (so I have learned through painful experience). Referring to this issue in terms which others can appreciate may help.
Great post will, and a very funny treatment of the subject. And, moreover, this phenomenon gets far worse when decision-makers want to segment the data further, say be geographic region.
ReplyDeleteBut some modelers make a bad situation even worse by statifying the sample (throwing away non-responders), so that the 10K samples with 200 responders becomes 200 non-responders and 200 responders! That post is here: http://abbottanalytics.blogspot.com/2009_11_01_archive.html