Wednesday, January 10, 2007

Data Visualization: the good, the bad, and the complex

I have found that data visualization for the purposes of explaining results is often done poorly. I am not a fan of the pie chart, for example, and am nearly always against the use of 3-D charts when shown on paper or a computer screen (where it appears as a 2-D entity anyway). With that said, that doesn't mean that charts and graphs need to be boring. If you would like to see some interesting examples of obtuse charts and figures, go Stephen Few's web site to look at the examples--they are very interesting.

I like in particular this one, which also contains a good example of humility on the part of the chart designer, along with their improvement on the original.

However, even well-designed charts are not always winners if they don't communicate the ideas effectively to the intended audience. One of my favorite charts in my work was for a health club is on my web site, and is reproduced here:

The question here was this: based on survey given to members of the clubs, which characteristics expressed in the survey were most related to the members with the highest value? I have always liked it because it has a combination of simplicity (it is easy to see the balls and understand that higher is better for each of them, showing which characteristics for the club are better than the peer average), yet it is rich with information. There are at least four dimensions of information (arguably six). The figure of merit for judging 'good' is a combination of questions on the club survey related to overall satisfaction, likelihood to recommend the club to a friend, and the individual's interest in renewing members--this was called the 'Index of Excellence'

  • seven most significant survey questions are plotted in order right to left (rightmost is the most important). Signficance was determine by a combination of factor analysis and linear regression models
  • the relative performance of each club compared to the others in its peer group is shown by the y-axis, with the average of clubs.
  • the relative difference between results from the year 2003 and 2002 are shown in two ways: first with the color of the ball (green for better, yellow for about the same, and red for worse), and also by comparing the big ball to the dot in the same relative position (up and down) in the importance axis.
  • finally, the size of the ball indicated the relative importance of the survey question for that club--bigger meant more important.

Each bullet was a dimension represented in the plot, but note that bullets 2 and 3 were relative values and really represent two dimensions. Regardless of how many dimensions you would count, the chart I think is visually appealing and information rich. One could simplify it by removing the small dots, but that's about all I would do to it. My web site also has this picture there, but it was recolored to fit the color scheme of the web site, and I think it loses some of its visual intuitive feel as a result.

However, much to my dismay, the end customer found it too complex, and we (Seer Analytics, LLC and I) created another rule-based solution that turned out to be more appealing.

Opinions on the graphic are appeciated as well--maybe Seer and I just missed something here :) But at this point it is all academic anyway since the time for modifying this solution has long passed.


Will Dwinnell said...

Nerd that I am, I must admit to being a lover of computer eye-candy. I even bought a copy of "Johnny Mnemonic" (from the "cheap bin" at Wal-Mart) just to watch the cyberspace sequence! I have written on visualization and use it frequently in my work.

Nevertheless, I am reminded of two occasions in which people had commented on the desirability of statistical tests over graphics (once by Gregory Piatetsky-Shapiro, and once by a colleague who felt that all graphs were worthless). Statistical tests help take people "out of the loop", increasing speed, testability and objectivity.

I still believe that visualization has utility, especially as a fast way for the analyst to interrogate the data, but it is worth keeping this other, more test-oriented perspective in mind.

Dean Abbott said...

Will--thanks once again for very interesting, challenging, and provocative comment!

That said, Statistical tests over graphics? Can you elaborate? It seems to me that these are complementary. If a graphic is created merely to show the results of a statistical test, then I would agree, but rarely see graphics done in this way.

For example, let's say you have done PCA and want to assess the components. You could display a table of eigenvalues. But I can make a conclusion more quickly with a graphic (such as a scree plot). Maybe you will find it interesting that my favorite way to view the loadings is a hybrid--I load them into excel (the eigenvectors) and use conditional formatting to color code the cells with values that exceed some value -- it lets me put in 3 different cutoffs, so I have four colors (3 that meet the conditions and one default white). this way I see the numbers but get visual cues as to what to focus on. This is particularly helpful when I have 100+ fields and a dozen or more components to look at.

I find that good visualization provide more information (denser and clearer) than tables of data, at least in some circumstances. There is no concise way to present all the information in the graphic presented in this post except with graphics. So the comment that "all graphics are worthless" I find surprising at best, and ignorant at worst.

I think Ed Tufte's work in this regard is quite helpful, such as The Visual Display of Quantitative Information.

Will Dwinnell said...

Just to play devil's advocate... One might argue that whatever decision-making process that you utilized to select a cut-off for the first so-many principal components, based on variances, number of components, etc. could be formalized into a set of rules or formula which could be applied without needing graphs.

Personally, I wouldn't characterize myself at the "graphs are are useless" end of the spectrum, but I can share that I've been working very hard to formalize as much of my modeling process as possible for the practical reason that I want to encapsulate it with k-fold cross validation (or bootstrapping, etc.). My work often involves smaller data sets, or large data sets with substantial class imbalance (with the target class representing 0.5% to 4% of the population). More thorough testing than train-and-test is necessary.

Dean Abbott said...

I think it is difficult to formalize the cutoff for principle components -- maybe that should be a new thread all by itself. I've seen people select all components with eigenvalues > 1, or components that comprise 90%+ of the variance...those kind of things.

But part of the problem is that the measure of goodness of these components is not necessarily related to the predictive model. For PCA, the figure of merit is sum of squared error in the projections. This may be (and often is) completely unrelated to the figure of merit for a classification model, so the fields selected by PCA may be suboptimal.

So in this case, we did use a subjective measure--we wanted components with SS loadings > 1 (the factor analysis equivalent to eigenvalues, I believe anyway. I understand PCA better than FA), but we also wanted to limit the number of factors to a small number to make the final model more understandable. There were 10 factors with SS loadings greater than 1, but we ended up using only 4 of them, and 3 original questions in the survey to get our 7 fields for the model. I'm not very good in this limited blogger HTML environment, so pardon the messiness here, but this is the way the factors showed up. The columns are:
1) factor
2) SS Loadings
3) proportional variance
4) cumulative variance

The analysis was done in S-Plus.

Factor1 4.39 12.2% 12.2%
Factor2 2.85 7.9% 20.1%
Factor3 1.87 5.2% 25.3%
Factor4 1.58 4.4% 29.7%
Factor5 1.57 4.4% 34.1%
Factor6 1.49 4.1% 38.2%
Factor7 1.45 4.0% 42.2%
Factor8 1.35 3.7% 46.0%
Factor9 1.13 3.1% 49.1%
Factor10 1.07 3.0% 52.1%

So how did we do it? What was the formal methodology?

1) Factors with SS loadings > 1
2) no more than 7 factors, fewer than 7 if factors don't contribute sufficiently, and fewer still if we retain some original fields along with the factors (as we did in this case)
3) all factors had to make sense, that is, the loadings and groupings of loadings had to tell a story that made sense.

We had plenty of data here, so it wasn't a problem to just partition into train/test/validate sets.