## Monday, June 25, 2007

### To Graph Or Not To Graph

Recently, I had an interesting conversation with an associate regarding graphs. My colleague had worked with someone who held the opinion that graphs were worthless, since anything you might decide based on a graph should really be decided based on a statistic. My initial response was to reject this idea. I have used graphs a number of times in my work, and believed them to be useful, although I readily admit that in many cases, a simple numeric measure or test could have been substituted, and may have added precision to the analysis. Data visualization and related technologies are all the rage at the moment, but I wonder (despite having a nerd's appetite for computer eye-candy) whether data mining should perhaps be moving away from these human-centric tools.

Thoughts?

Dean Abbott said...

I'm not sure which statistical test your colleague has in mind. A good example of the limitations of statistical tests to convey information is the famous Anscombe's Quartet data set. In his excellent book The Visual Display of Quantitative Information, Edward Tufte states the following:

"Graphics reveal data. Indeed, graphics can be more precise and revealing than conventional statistical computations. Consider Anscombe's Quartet: all four of these data sets are described by exactly the same linear model (at least until the residuals are examined)."

Since I don't know how to put graphics in the comments section of these posts, I will repeat this and place the graphic in a new post...

A place to see Anscombe's quartet data is here.

The point is this: statistical tests always summarizes data, thus removing some information that is in the data. If the data is well-behaved, the summary is an excellent representation of the information in the data (very little information is removed or smoothed away), and the test can be used effectively. In this case, I would agree with your colleague. But often, the data is not well behaved, and in these cases, no statistical test will tell you the full story.

In the context of data mining, what I often do is build a non-parametric model, look at which variables are key predictors in the model, and then examine those variables graphically to understand why the model picked up on those variables.

Will Dwinnell said...

I think that the idea wasn't so much to replace a graph with a single statistical test, but rather to draw conclusions based on numeric measures (which may or may not, strictly speaking, be statistical "tests"), rather than on visual interpretation of graphs.

I agree with Tufte that a table which merely dumps the raw data of Anscombe's quartet is not terribly meaningful. The question, though, is: What conclusion is being drawn via a graph, and would it be better served via one or more quantitative measurements? In the case of Anscombe's quartet, for instance, the page to which you refer mentions that the mean and standard deviation of the Y values are identical for all 4 groups. The graphs, however, depict very different relationships between the Y values and the X values. To make an apples-to-apples comparison, though, the data summaries should include the X values- after all, the graphs do, and someone obviously had to have already selected specifically X and Y to examine, before making the graphs.

Dean Abbott said...

I need to post the original data and the Ascombe's quartet....

But Tufte's point is this:
All four data sets in the quartet have the identical summary statistics--same y vs. x correlation, same R^2 in the linear fit, same MSE, etc. These are the kind of statistics you might use to examine different variables in your data. Moreover, you could view this as data having a common "X", and four different "Y" values (or vica versa). So on the surface, the summary statistics make it appear that the data all have similar (if not identical) properties. In a modeling context, you could try to predict four different Ys from a common X and find that they are all the same, if you only look at the summary statistics.

But when you see the data graphically, it is obvious that they are all quite different. Summary statistics can be quite deceiving, and you need more summary statistics to test if your interpretation of one statistic is actually valid.

For example, when I was doing some work with the IRS, we were looking at line items on tax returns (about 30K returns), and two line items had a correlation of about 0.92 or 0.95 --something quite big. So, it looked like they were quite related. But after removing a few dozen data points, the correlation was somewhere in the 0.25 range or so. Why? Because there were huge outliers in the data.

Now of course you could have diagnosed that there were big outliers by looking at quartiles, skew, error distributions in models, etc., but it was a lot easier, and I would argue more informative, to just graph the data.

Anonymous said...

Hi I just found your site the other day, and here's my two cents on graphs:

A la Tufte, graphs do reveal data in a very concrete way. They allow you to present a lot of data at the level of the observation, as opposed to category or entire data set. Boxplots, histograms, or scatterplots, of course, could show outliers in the IRS data.

I think graphs are especially important when presenting results to management or clients. They may not appreciate r-squared, but a line of best fit could be useful.

Van Scott

Will Dwinnell said...

anonymous, I think you make a good point, but it's worth noting the difference between presenting data, and making a decision based on it. I absolutely agree that graphs are useful when dealing with managers and other non-technical folks.

However, I think the fellow I mentioned makes a good point, too, regarding the use of measurements instead of graphs as a basis for decision-making. I can't say that I agree with him completely, because I do find graphs useful for diagnostic purposes on occasion.

Consider your example of finding outliers. Visual inspection will likely not be effective unless the number of dimensions is very small (perhaps 2 or 3). For my part, I'd rather consider something like Mahalanobis distance.

Anonymous said...

Good points, especially about the number of dimensions that can be sensibly graphed. Graphing multivariate data sets is a visual challenge.

The work that I'm doing now (in a business environment) is a combination of exploration (like EDA) and statistical tests. When I find something that appears germane to the business problem, I usually provide the graph and the statistic.

Will Dwinnell said...

All four data sets in the quartet have the identical summary statistics--same y vs. x correlation, same R^2 in the linear fit, same MSE, etc.

First, we must admit the extremely artificial nature of this data set, given its small exemplar count and very small variable count.

If the objective is to distinguish the four data sets, then I'd say this selection of summary statistics wouldn't have been my first choice. While I would have looked at the means and maybe standard deviations of the variable Y, I would also have considered its median and IQR. From the bivariate perspective, more robust estimates of correlation should be considered (Spearman's rank-order, or Kendall's tau).

If the objective was to model the data, then I don't understand, for instance, why the analyst would choose a linear model for the data set with the obviously quadratic relationship (note that this would have been obvious from regression diagnostics).

I think the real question is: Under what circumstances are graphs preferable, where real data is being used to make a real decision?

Dean Abbott said...

"If the objective was to model the data, then I don't understand, for instance, why the analyst would choose a linear model for the data set with the obviously quadratic relationship (note that this would have been obvious from regression diagnostics)."

That is exacatly what I'm driving at. Before you model, you have no idea what form of the model is appropriate. Afterward, yes examining the residuals is a good idea. I don't know about you, but the most common technique I use for examining residuals is to graph them. Getting a report on the average error or maximum error will help, but won't tell me enough to understand how to fix the problem. Do you have a favorite test here?

Anonymous said...

Hi,

On the question of "to draw conclusions based on numeric measures (which may or may not, strictly speaking, be statistical "tests"), rather than on visual interpretation of graphs" I would avoid (possibly) subjective interpretations where they can be replaced with a more precise interpretation.

I don't think it’s really about whether graphics are useful but more so about the nature of these two types of “display” (a summary displayed as a numerical character or a pictorial representation of a series numerical characters) and what it does/could mean to interpret each given what is actually being shown.

Yes graphics are a great way to present things. This is sometimes because people like pictures rather then the actually ability of each type of display to provide the correct interpretation though.

Yes graphics are a quick way for an analyst to “glance” at a relationship for example.

Some intricacies of interpreting and apply “numerical” summaries have been provided. We should not forget that there are a number of ins and outs to graphics as well. Type of display, scales used, dimensions of the graphic itself, etc…On top of this there is the interpreting parties ability to actually properly conceptualize each type of “display” and interpret it correctly.

My 2-cents...

Jay

Will Dwinnell said...

I don't know about you, but the most common technique I use for examining residuals is to graph them. Getting a report on the average error or maximum error will help, but won't tell me enough to understand how to fix the problem. Do you have a favorite test here?

Let me clarify: this idea about graphs was suggested to me. I do not wholly endorse it, though I think it has some merit and am willing to play Devil's advocate. I have found graphs useful for exploratory and diagnostic purposes and will continue to do so.

I find the Anscombe data to be highly artificial and not typical of data mining scenarios. For one thing, I believe that any reasonable data mining effort involves a search over model complexity. Even restricting the analyst to polynomial models, Anscombe's quadratic data set would have quickly been found out in such a search.

My thoughts regarding your immediate question:

1. Residuals certainly can be analyzed via graphs. I have used them for that myself.

2. Graphing residuals clearly also has limitations, especially when:

2.a. The number of observations is so high that point clouds are difficult to interpret.

2.b. The number of observations is so small that statistical significance is in question.

2.c. (If plotting residuals veruss predictors...) The count of predictor variables is more than a very small number, since interactions become obscured.

3. Interpretation of residuals plots cannot be automated.

Naturally, the question of numeric alternatives presents itself. I Beyond basic performance summaries such as you mention (average error, etc.), I suggest something like mean absolute error or mean error within each decile of the model output be examined. Such measures can be checked for simple relationships like trends, etc.