Every so often, an article is published on data mining which includes a statistic like "Amount of data mining time spent preparing the data: 70%", or something similar , expressed as a pie chart. It is certainly worth the investment of time and effort at the beginning of a data mining project, to get the data cleaned up, to maximize model performance and avoid problems later on.
Two related issues is data preparation are missing values and special values. Note that some "missing values" are truly "missing values" (items for which there is a true value which is not present in the data), while others are actually special values or undefined (or at least poorly defined) values. Much has already been written about truly missing values, especially in the statistical literature. See, for instance:
Dealing with Missing Data, by Judi Scheffer
Missing data, by Thomas Lumley
Working With Missing Values, by Alan C. Acock
How can I deal with missing data in my study?, by Derrick A. Bennett
Advanced Quantitative Research Methodology, G2001, Lecture Notes: Missing Data, by Gary King
Important topics to understand and keywords to search on, if one wishes to study missing data and its treatment are: MAR ("missing at random"), MCAR ("missing completely at random"), NMAR ("not missing at random"), non-response and imputation (single and multiple).
Special values, which are not quite the same as missing values, also require careful treatment. An example I encountered recently in my work with bank account data was a collection of variables which were defined over lagged time windows, such as "maximum balance over the last 6 months" or "worst delinquency in the last 12 months".
The first issue was that the special values were not database nulls ("missing values"), but were recorded as flag values, such as -999.
The second issue was that the flag values, while consistent within individual variables, varied across this set of variables. Some variables used -999 as the flag value, others used -999.99. Still others used -99999.
The first and second issues, taken together, meant that actually detecting the special values was, ultimately, a tedious process. Even though this was eventually semi-automated, the results needed to be carefully checked by the analyst.
The third issue was the phenomenon driving the creation of special values in the first place: many accounts had not been on the system long enough to have complete lagged windows. For instance, an account which is only 4 months old has not been around long enough to accumulate 12 months worth of delinquency data. In this particular system, such accounts received the flag value. Such cases are not quite the same as data which has an actual value which is simply unrecorded, and methods for "filling-in" such holes probably would provide spurious results.
A similar issue surrounds a collection of variables which relies on some benchmark event- which may or may not have happened, such as "days since purchase" or "months since delinquency". Some accounts had never purchased anything, and others had never been delinquent. One supposes that, theoretically, such situations should have infinity recorded. In the actual data, though, they had flag values, like -999.
Simply leaving the flag values makes no sense. There are a variety of ways of dealing with such circumstances, and solutions need to be carefully chosen given the context of the problem. One possibility is to convert the original variable to one which represents, for instance, the probability of the target class (in a classification problem). A simple binning or curve-fitting procedure would act as a single-variable model of the target, and the special value would be assigned whatever probability was observed in the training data for those cases.
Many important, real circumstances will give rise to these special values. Be vigilant, and treat them with care to extract the most information from them and avoid data mining pitfalls.