Wednesday, August 21, 2013

Beware Phantom Data

One of the perennial challenges facing the data analyst is missing values. A great deal has been written about the importance of identifying the source of missing values, the danger of overly simplistic solutions and, of course, the many and varied mechanisms for "filling them in" with synthetic data ("imputation").

Of the tremendous volume of material written on this subject, nearly all assumes that the analyst knows precisely which items are missing from the data. In reality, this is sometimes not the case. Relational databases and statistical software files, as a rule, have a special value to indicate "missing", though that does not mean that it is always used. Some file formats offer only indirect provision for missings, if any at all, and how software reacts to such missings varies.

Consider, too, the popular practice of using special values (such as -9999) to represent missing values. What could possibly go wrong? For one thing, the person writing the data may not consider whether the flag value might represent a legitimate value. Is it possible, for instance, to have an account balance of -9999 dollars (euros, etc.)? In my career, I have seen databases which used different flag values for each field (-99, -9999, -99999, etc.), making the writing of code against such data extremely tedious (and error-prone). I have also seen -9999 used to indicate one type of missing value, and -9998 to indicate another type of missing. When the hand-off of information from one person (system, process, etc.) to another is confused or incomplete, interpretation of the data becomes incorrect.

Another aspect of this problem is the precise definition given to fields, and their possible misinterpretation by data consumers (such as data miners). Imagine that a particular integer field is being used to record the number of times each customer has made a payment on their loan, within the past 6 months. As customers begin their tenure, this variable starts with a value of zero. Suppose our model included this field as an independent variable. Presumably low risk customers have higher values, while higher risk customers have lower values. Without missing any payments, early lifecycle customer are penalized arbitrarily by the model. One could make the argument that this variable should be undefined (recorded as a database missing value flag) until a customer has a full 6-month track record, but this is exactly the sort of conversation which very often fails to materialize in real organizations.

These are all instances of "phantom data": Items in the database which are missing values, but mistaken for real data. It shouldn't take much imagination on the reader's part to conjure similar problematic situations in his or her own field. The lesson is to look beyond the known missings for more subtle gaps in the data. Time spent investigating the nature of database systems, company procedures and so forth which generate data is insurance against being burned by serious misunderstanding of the data.