Monday, October 20, 2008

What topics would you like to see covered at a KDD conference?

This is your chance to voice your opinion!

What topics, sessions, or tutorials would be most useful for you at a conference like KDD? Would a full industrial track be of interest, of are industries so diverse that we really need tracks to be narrowed to specific industries?

Please--practitioners only. I'm defining practitioners as those who get paid to develop models that are actually used in industry.

I'll kick it off with one idea:

Tutorials (1/2 day) geared toward the practitioner. This means that if techniques are described (such as social networking), there must be implementations of the algorithmic ideas available in competitive commercial software. As great as R and Matlab are, for example, relatively few practitioners are programmers that can take advantage of these kinds of frameworks.

I know there are tutorials at KDD every year. This year I didn't go because they were all on Sunday and I wasn't able to attend then, but would have wanted to go to the Text Mining tutorial as that is a topic that has become a significant part of my business over the past couple of years.

One last thought: I think one thing that may happen (understandably) is that topics that have been covered in years passed are not revisited. For those of us who live in the data mining world, it is far more interesting to continue to explore new ideas, especially those that build on ideas we have already explored in depth. However, as data mining increases in its use, we are bringing folks in who have not had that same benefit. For many, a tutorial on decision trees would be very useful and interesting (like the KDD 2001 tutoral--trees to my knowledge have not been revisited since except in the framework of ensembles in 2007).

Thursday, October 09, 2008

Two Books of Interest

Recently, I have been reading two books which may be of interest to data miners, Statistical Rules of Thumb by Gerald Van Belle (ISBN-13: 978-0471402275) and Common Errors in Statistics (and How to Avoid Them), by by Phillip I. Good and James W. Hardin (ISBN-13: 978-0471794318). Both impart practical advice based on extensive experience and statistical rigor, yet avoid becoming hung up on academic issues.

While both are written from the point of view of traditional statisticians, they do suggest the use of some less traditional techniques, such as the bootstrap and robust regression. A wide range of topics is covered, such as sample size determination, hypothesis testing and treatment of missing values. Both books also include some material written for audiences working in specific fields, such as environmental science and epidemiology. Material in these two books will vary in applicability to data mining, given the traditional statistical focus on smaller data sets and parametric modeling.

I highly recommend both of them. Tables of contents can easily be found on-line, and an entire chapter of Statistical Rules of Thumb is available at: Chapter 2: Sample Size.