Applied Data Science and Machine Learning: 01/01/2005

Tuesday, January 04, 2005

Beware of Outliers in Computing Correlations

Outliers can cause mislead summary statistics in a variety of ways. One such way is when computing correlations, and it doesn’t take many outliers to significantly change correlation coefficients. Take, for example, 1,000 random samples of a variable X, uniformly distributed over the range [0,1]. A second variable, Y, is a combination of X and a second uniform random sample so that X and Y have a correlation coefficient of 0.99. Now, suppose in the 1,000 data points, two outliers are introduced. The values of X are the same, but the values of Y are magnified by a factor of 10, and are placed away from the trend of the original data points. The left figure below shows the strong correlation of X and Y, and the figure to the right shows the same data with two outliers.

The statistics for the four variables are shown in Table 1 below. When one looks at the correlations, however (Table 2), the correlations suddenly change from 0.99 down to 0.49.

Therefore, test your data for outliers that could be influencing summary statistics. More on how to do that in a future issue of Abbott Insights™.

Applied Data Science and
Machine Learning

Tuesday, January 04, 2005

Beware of Outliers in Computing Correlations

Applied Predictive Analytics

Contributors

Our Web Sites

Smart Data Collective

Blog Archive

Data Mining Blogs and Sites

Data Mining Conferences

Labels

Insurance

Popular Posts

Applied Data Science and Machine Learning

Tuesday, January 04, 2005

Beware of Outliers in Computing Correlations

Applied Predictive Analytics

Contributors

Our Web Sites

Subscribe To This Blog

Smart Data Collective

Blog Archive

Data Mining Blogs and Sites

Data Mining Conferences

Labels

Insurance

Popular Posts

Applied Data Science and
Machine Learning