There is a new predictive analytics conference coming up Feb 18-19 in San Francisco called Predictive Analytics World. I'm very much looking forward to it in the hopes that it will appeal to the data mining / predictive analytics practitioner.
I'll be presenting a case study I worked on with TN Marketing using ensembles of logistic regression models. Also, I'll be on a panel discussion on Cross-Industry Challenges and Solutions in Predictive Analytics.
Hope to see some of you there!
Tips, tricks, and comments related to topics in data science and machine learning. Used to be called "data mining and predictive analytics" but updated the title to reflect the language of the day!
Hosted by Dean Abbott, Abbott Analytics
Saturday, January 31, 2009
Text Mining and Regular Expressions
I've been spending quite a lot of time in the bowels of a text mining project recently, mostly in the text/concept extraction phase. We're using the SPSS Text Mining tool for the work so far. (As a quick aside, the text mining book I've enjoyed reading the most in recent months is the Weiss, Indurkhya, Zhang, and Damerau)
The most difficult part of the project has been that all of the text is really customized lingo--a language of its own as presented in the notes sections of the documents we are reading. Therefore, we can't use the typical linguistic extraction techinques, and rather are relying heavily on regular expressions. That certainly takes me back a few years! I used to use regular expressions mostly in shell programming (Bourne, CShell, Korn Shell and later BASH).
I must say it has been very productive, though it also makes me appreciate language rules that don't exist in any consistent way with our notes. As I am able, I'll post on more specifics on this project.
Regarding books on regular expressions, I found the unix books weren't quite so good on this topic. However, the O'Reilly Mastering Regular Expressions book is quite good.
The most difficult part of the project has been that all of the text is really customized lingo--a language of its own as presented in the notes sections of the documents we are reading. Therefore, we can't use the typical linguistic extraction techinques, and rather are relying heavily on regular expressions. That certainly takes me back a few years! I used to use regular expressions mostly in shell programming (Bourne, CShell, Korn Shell and later BASH).
I must say it has been very productive, though it also makes me appreciate language rules that don't exist in any consistent way with our notes. As I am able, I'll post on more specifics on this project.
Regarding books on regular expressions, I found the unix books weren't quite so good on this topic. However, the O'Reilly Mastering Regular Expressions book is quite good.