I have been reading a book by Don Knuth called Things a Computer Scientist Rarely Talks About (Center for the Study of Language and Information - Lecture Notes)--a very good read for those of you interested in theology as well as analytics. This post is not about the theology of the book (as interesting as that is to me), but rather the reason described in this book for his writing of another book called 3:16, a study of all the 3:16 verses in the Bible. In his chapter on randomized testing (I like to think of model ensembles here), he describes how random sampling is a good way to get an idea of the content of "stuff", whether computer science assignments (he actually does this--randomly take page X of a project and look at that in depth), or understanding books (like the Bible). His 3:16 book takes this verse from every book in the Bible to get a sense of the overall message of the Bible. He admittedly chose 3:16 because of John 3:16 so that he would get at least one great verse, but this was a concession to making the book marketable.
At first I wasn't a big fan of this idea. After all, it is a small sample, But he describes how he then studied these verses in depth. Whereas his prior understanding of the Bible was vague and general (which has its positive points), this exercise led also to a deeper (albeit narrow) understanding as well. I recommend this approach very much.
What does this have to do with analytics? Data Mining often is viewed as a way to get the gist of your data, see the big picture, understand patterns through summarized views. But just as important is the deep view, looking at a few examples (prototypes) in depth. In the text mining project I'm working on right now, while we extract "concepts", much of our time is also spent tracing a few text blocks through the processing to understand in detail why the analytics is working the way it does. I'm a "both / and" kind of guy, so this suits me well; big picture analytics as well as deep dives into record-level descriptions.