Tuesday, March 13, 2007

Missing Values and Special Values: The Plague of Data Analysis

Every so often, an article is published on data mining which includes a statistic like "Amount of data mining time spent preparing the data: 70%", or something similar , expressed as a pie chart. It is certainly worth the investment of time and effort at the beginning of a data mining project, to get the data cleaned up, to maximize model performance and avoid problems later on.

Two related issues is data preparation are missing values and special values. Note that some "missing values" are truly "missing values" (items for which there is a true value which is not present in the data), while others are actually special values or undefined (or at least poorly defined) values. Much has already been written about truly missing values, especially in the statistical literature. See, for instance:

Dealing with Missing Data, by Judi Scheffer

Missing data, by Thomas Lumley

Working With Missing Values, by Alan C. Acock

How can I deal with missing data in my study?, by Derrick A. Bennett

Advanced Quantitative Research Methodology, G2001, Lecture Notes: Missing Data, by Gary King

Important topics to understand and keywords to search on, if one wishes to study missing data and its treatment are: MAR ("missing at random"), MCAR ("missing completely at random"), NMAR ("not missing at random"), non-response and imputation (single and multiple).

Special values, which are not quite the same as missing values, also require careful treatment. An example I encountered recently in my work with bank account data was a collection of variables which were defined over lagged time windows, such as "maximum balance over the last 6 months" or "worst delinquency in the last 12 months".

The first issue was that the special values were not database nulls ("missing values"), but were recorded as flag values, such as -999.

The second issue was that the flag values, while consistent within individual variables, varied across this set of variables. Some variables used -999 as the flag value, others used -999.99. Still others used -99999.

The first and second issues, taken together, meant that actually detecting the special values was, ultimately, a tedious process. Even though this was eventually semi-automated, the results needed to be carefully checked by the analyst.

The third issue was the phenomenon driving the creation of special values in the first place: many accounts had not been on the system long enough to have complete lagged windows. For instance, an account which is only 4 months old has not been around long enough to accumulate 12 months worth of delinquency data. In this particular system, such accounts received the flag value. Such cases are not quite the same as data which has an actual value which is simply unrecorded, and methods for "filling-in" such holes probably would provide spurious results.

A similar issue surrounds a collection of variables which relies on some benchmark event- which may or may not have happened, such as "days since purchase" or "months since delinquency". Some accounts had never purchased anything, and others had never been delinquent. One supposes that, theoretically, such situations should have infinity recorded. In the actual data, though, they had flag values, like -999.

Simply leaving the flag values makes no sense. There are a variety of ways of dealing with such circumstances, and solutions need to be carefully chosen given the context of the problem. One possibility is to convert the original variable to one which represents, for instance, the probability of the target class (in a classification problem). A simple binning or curve-fitting procedure would act as a single-variable model of the target, and the special value would be assigned whatever probability was observed in the training data for those cases.

Many important, real circumstances will give rise to these special values. Be vigilant, and treat them with care to extract the most information from them and avoid data mining pitfalls.

3 comments:

  1. Ahhh. I'm dealing with missing values on a project this week, so this is exactly up my alley!

    One book I really like too on missing values is by Paul Allison,

    2001 Missing Data. Thousand Oaks, CA: Sage Publications. It's from a statistical perspective (so it spends time on maximum likelihood techniques as one example, something that data miners rarely do in my experience), but it is still excellent.

    I also like Dorian Pyle's book on Data Preparation -- he persuaded me in the book to consider imputing to a distribution as an improvement over mean/median imputation, and I've actually used it on many occasions.

    As one other extreme, something I haven't seen written up anywhere, but one I've used as well (and is related to Pyle's technique), rather than randomly imputing from a distribution, like a normal, Poisson, or other distribution, us one matches your data. Just grab a random value from your actual training data--this is particularly good when you don't have a clean distribution.

    But when possible, I prefer something smarter like building models to impute missing values--even when the models stink, they are almost always better than random or constant-value imputation.

    ReplyDelete
  2. I agree. Filling in missing values with global constants is almost never a good idea, unless perhaps the proportion of missings is extremely small.

    All imputation methods are, essentially, models. I've found that dividing up the data by deciles of even one (but preferably multiple) other variable, and filling in locally with the local mean or median is relatively inexpensive and much better than filling in with global constants.

    ReplyDelete
  3. Anonymous12:29 PM

    I handle missing and special values in a very simple manner : I leave them as they are. Just becaus I do not have the time to impute them. By using decision trees (bagging) they present no problem and they are taken into account.
    De resulting models are "Good enough" to be useful in marketing and also very cheap in data mining time.

    ReplyDelete