Saturday, January 31, 2009

Text Mining and Regular Expressions

I've been spending quite a lot of time in the bowels of a text mining project recently, mostly in the text/concept extraction phase. We're using the SPSS Text Mining tool for the work so far. (As a quick aside, the text mining book I've enjoyed reading the most in recent months is the Weiss, Indurkhya, Zhang, and Damerau)

The most difficult part of the project has been that all of the text is really customized lingo--a language of its own as presented in the notes sections of the documents we are reading. Therefore, we can't use the typical linguistic extraction techinques, and rather are relying heavily on regular expressions. That certainly takes me back a few years! I used to use regular expressions mostly in shell programming (Bourne, CShell, Korn Shell and later BASH).

I must say it has been very productive, though it also makes me appreciate language rules that don't exist in any consistent way with our notes. As I am able, I'll post on more specifics on this project.

Regarding books on regular expressions, I found the unix books weren't quite so good on this topic. However, the O'Reilly Mastering Regular Expressions book is quite good.


Angelo_Arts said...

There is a great book that deals with text mining from a regular expression approach. It is Practical Text Mining with Perl.
I am also trying to improve the quality text content extraction from web pages in particular to be used for clustering later. The thing is easier when dealing with dynamic web pages where you can directly go to the database server to extract the relevant content, but it becomes harder as you deal with static pages where everything is mixed up together. The best thing I could do so far is using regex to eliminate HTML tags and what seems to be irrelevant elements such as the table tags and what live inside them. I am not loving the results so far! The problem with regular expressions (regex) is that you must have a deep knowledge in both the structure and the content of every web page in order to extract quality content from a huge pool of irrelevant elements.
Have you ever done text extraction on static web pages? Do you have any other good ideas or techniques than regex?

Dean Abbott said...

I have used the text mining tool in Clementine and Polyanalyst. I've had very positive experiences with both tools.

In Clementine, you can read in data from a web site and it will automatically grab all the pages from the web site and strip out the html tags--very convenient for what you are doing. An example is shown here. and the Clementine tool information is seen here..

I haven't used Polyanalyst on web pages (only for notes fields stored in a more structured format), so don't recall offhand if it reads in web pages (Sergei, if you are reading this, please comment)--however I strongly believe it does. If I find more information I'll include it in a separate comment.

Of course there are many other tools that I'm sure handle web data more automatically, Enterprise Miner, but I haven't personally experienced them.

Dean Abbott said...

By the way, thanks for the recommendation on the Perl book.

Angelo_Arts said...
This comment has been removed by the author.
Angelo_Arts said...

Actually the tools that incorporate tag removal techniques to extract textual content from static web pages are so available. TextPipe Pro is one of the best commercially available tool that includes web text mining. This is a demo to show the capabilities of Textpipe Pro in web text mining. I can also implement my own tag cleaners in many languages including perl (using HTML::Parser and other modules), and python (BeautifulSoup parser) and many others.

Again, most of these tools rely on regular expression in extracting useful content. There is no way to - for example - determine which TABLE tag really contains useful content and which another TABLE tag is used for web layout. Also the problem of badly written html code that do not adhere to web standards is a serious challenge to regular expressions. Some web programmers close every single tag they write and some do not. Some of them validate their html code before publish and some do not. Regular expression can be very useful I agree, but from my personal experience it indeed doesn't say the final word in data cleaning and preparation when we deal with static web pages.

Dean Abbott said...

And the problem of variable human-generated content doesn't end with HTML. Part of the problem with my current text mining project is parsing out a keyword from a list of nouns. Sometimes a comma means "new idea" (it is a stop character) and sometimes a comma is just a delimiter of nouns (and not a stop character). Determining what role it plays requires some careful thinking to switch on and off which mode the text block is in.

Thanks for the URL for TextPipe Pro. I hope to be able to take a look within the next few weeks.

Dean Abbott said...

I saw on Keith McCormick's site a Text processing in Python reference. Though I can't vouch for it yet, it is free, but only for personal use. You can also buy the book on amazon, which gives the author, David Mertz, so royalties. If you like what you read, please to support him.

Anonymous said...

Another good book is Tapping Into Unstructured Data. It's by Inmon and Anthony Nesavich.

criticpapa said...
This comment has been removed by a blog administrator.