Tuesday, January 04, 2005

Beware of Outliers in Computing Correlations

Outliers can cause mislead summary statistics in a variety of ways. One such way is when computing correlations, and it doesn’t take many outliers to significantly change correlation coefficients. Take, for example, 1,000 random samples of a variable X, uniformly distributed over the range [0,1]. A second variable, Y, is a combination of X and a second uniform random sample so that X and Y have a correlation coefficient of 0.99. Now, suppose in the 1,000 data points, two outliers are introduced. The values of X are the same, but the values of Y are magnified by a factor of 10, and are placed away from the trend of the original data points. The left figure below shows the strong correlation of X and Y, and the figure to the right shows the same data with two outliers.

The statistics for the four variables are shown in Table 1 below. When one looks at the correlations, however (Table 2), the correlations suddenly change from 0.99 down to 0.49.

Therefore, test your data for outliers that could be influencing summary statistics. More on how to do that in a future issue of Abbott Insights™.