Tuesday, November 14, 2006

Free And Inexpensive Data Mining Software

I recently came across an article on-line which included the claim that data mining was "hugely expensive". I disagree. Given reasonably capable desktop hardware, and a qualified data miner (who is the most expensive component of data mining cost!), a variety of capable data mining software packages are available for free or relatively little (meaning < US$100). I cannot vouch that any one of these tools is a good fit for your specific application, but they are certainly worth a look:

DMSK
KNIME
SNNS (Stuttgart Neural Network Simulator)
YALE (Yet Another Learning Environment)
Weka

There are some commercial tools which sell for less than US$250, such as:

BrainMaker

Of course, there is always the "roll-your-own" approach, in which the data miner constructs or gathers his or her own tools. The internet houses a wealth of source code in a variety of languages. Aside from searching on general-purpose Internet search engines for things like:

"quadratic discriminant" "source code"

or

backpropagation Java

...there are also source code repositories, such as:

Google Code
LiteratePrograms
MATLAB Central

10 comments:

Dean Abbott said...

Will:
You and I agree on most things data mining, but here I have to take exception. While I like some of the free tools out there, and your list is a good one (I usually recommend WEKA, have tried SNNS and YALE in the past,and would add "R" to the list). However, the problem I have with most of these tools is how much they put on the user to either be a proficient programmer, to preprocess the data outside the environment, or to accept a strange interface.

I worked with a researcher at the IRS several years ago, and he was a Clementine user (because that's what he had to use). He eventually discovered Weka, and began using it, much to my surprise. Now don't get me wrong, Weka has an almost unbelievable breadth of algorithms available to the user, and I like to use it to play with different approaches, but I was quite surprised because I have always found Weka difficult to use.

In the end, this past spring, I asked him how it was going with Weka, and he finally gave up using it. Why? Because getting data into the software was too painful. It's not surprising because these free tools are typically developed by academics, excel in algorithms, but are weak in usuability for non-technical users, data I/O, and back-end reporting.

So as a researcher, I love Weka and strongly recommend it. I haven't used SNNS for years, but when I did use it, I loved the number of neural networks it had available. I'm just getting started with Yale, but it too is not real easy out-of-the-box.

But when I need to develop a solution, I'd rather use a commercial tool. You're right to point out that many tools exist without the big price tab, even SPSS for Windows or S-Plus are under $1000 (I think!) and contain most of what data miner would need to build good models.

Will Dwinnell said...

You make some good points, and I'm not suggesting that very inexpensive tools are ideal solutions- obviously there are trade-offs, as you mention. Personally, I prefer to work in the software US$2,000 (give or take) range. I will say, though, that I have yet to see a tool that costs north of $100,000 (and there have been a surprising number of them) that I thought was even close to being worth the money.

Will Dwinnell said...

My other comment on this was prepared in haste. I also wanted to mention that cost, for many would-be data miners, is a significant issue. As I imagine is the case with you, Dean, I have had the luxury of having an employer or client pay for software, etc. to make this process easier. This is not true for everyone, though. Students and others with less financial means may be able to usefully trade-off work for cost.

Dean Abbott said...

True enough on the cost issue, particularly for students. And I agree that for students, there are free options that would benefit them.

I agree as well that for many practitioners, under $2K can buy an excellent piece of software for data mining: the software already mentioned, but also other tools like Statistica, WizWhy, Neuralware Predict, XLMiner, etc. are all excellent tools.

The biggest advantage of the "big boys", like Clementine, Affinium Model, Enterprise Miner are their ability to handle wide ranges of data types, sizes, and formats on the front end and the back end. I have found too that data prep is easier in Clementine and Model (though to a lesser degree in Model). This deserves another post where I'll summarize what tool different customers use and why.

Will Dwinnell said...

Judging from the summary added by the editor here:

KDnuggets May-2005 Data Mining Software Survey

...I suppose I'm firmly in the "Department-level" camp. I think that tools in the "Personal-level" and "Free" categories may suit students, organizations with little budget and novice data miners looking to cut their teeth. I'd be very interested on your thoughts on the "Enterprise-level" tools, Dean. I could be persuaded that some tools in the low end of the "Enterprise" range were worth it, but I must say that I've yet to encounter a tool that I thought was really worth 6 figures.

Dean Abbott said...

The enterprise tools I have used extensively are so good, that I recommend them for big organizations that have larger data mining projects (multiple people) and large amounts of data. I'm thinking in particular of Clementine, Insightful Miner, Affinium Model, and Generation5 (I have not been able to get a hold of Enterprise Miner, though would love to test drive it to see how much it has progressed over the past six years or so, the last time I used it extensively). So if I have a lot of data prep (joining tables, feature creation, missing value imputation that is more "clever", and variable selection), I find I can turn models around faster and try more approaches with these tools than with tools that require programming.

But maybe the better question would be this: what tool would I buy if I had a project I needed a solid data mining tool? If I could afford the Clementine/IMiner/Affinium Model tool, and if the project had data issues where preprocessing was important, I would get one of these tools over the under $2K tools. If I couldn't afford the big tool, then I would move to the CART family, Matlab, or one of the stats tools (S-Plus, SPSS, or SAS--though probably JMP rather than base SAS as I'm not a SAS programmer). There are a bunch of other tools I have used that I would be happy to use as well, like MineSet, Statistica, Megaputer to name a few. Tools like Raptor (SOMs) and WizWhy fill a nice niche to complement modeling and data understanding, and I would love to have these in my tool box. Also, I've had great success building neural networks in NeuralWare Predict as well which operates in an Excel Spreadsheet.

So the bottom line is that I do think the enterprise tools offer much more than just big price tags, but the decision to buy one depends on budget, data complexity, and data mining project size.

Next on my list to evaluate are: Oracle Data Miner, Megaputer (again), and XLMiner.

Anonymous said...

Dean said to Will:

While I like some of the free tools out there, [...] the problem I have with most of these tools is how much they put on the user to either be a proficient programmer, to preprocess the data outside the environment, or to accept a strange interface.

Some of the freely available open-source tools have matured significantly over the last couple of years. YALE for example comes with a graphical user interface, online tutorial, more than 400 operators for various input formats covering many text file formats, Excel sheets, ARFF files, databases (incl. SQL access for selects, joins, etc.), text document collections, PDF files, music files (MP3), etc. as well as many pre-processing, data mining, evaluation, and visualization operators.
The interface is not that strange, the preprocessing can be done within the tool, and programming proficiency is not necessary.

In the end, this past spring, I asked him how it was going with Weka, and he finally gave up using it. Why? Because getting data into the software was too painful. It's not surprising because these free tools are typically developed by academics, excel in algorithms, but are weak in usuability for non-technical users, data I/O, and back-end reporting.

While this is true for some of the free tools, it does not apply to all them. As stated above, YALE comes with connectors to many input formats and source types, many evaluation and visualization features for both data and results, a comfortable GUI, and supports flexible setups and rapid prototyping.

I'm just getting started with Yale, but it too is not real easy out-of-the-box.

Feedback on how we can improve YALE is always welcome. Just post your questions, suggestions, and feedback to the forum discussion board for YALE at SourceForge.net, and we will try to support you and make YALE easier to use.

For getting started, the YALE online tutorial and the YALE GUI manual allow a give a quick introduction and provide example applications. For a deeper insight, the YALE tutorial with its more than 400 pages are a good resource.

Dean said:

The biggest advantage of the "big boys", like Clementine, Affinium Model, Enterprise Miner are their ability to handle wide ranges of data types, sizes, and formats on the front end and the back end. I have found too that data prep is easier in Clementine and Model (though to a lesser degree in Model).

Is it really true, that Clementine and Affinum Model can handle more data formats and types than e.g. YALE with its more than 400 operators?

From their web sites, it is hard to verify such claims. Is there a publicly available list of all the data types and formats they can handle and the data mining and visualization operators they provide?

I really would like to see a fair and fact-based comparison of open-source tools like YALE and e.g. Clementine and Affinum Model rather than vague claims.

Dean said:

So if I have a lot of data prep (joining tables, feature creation, missing value imputation that is more "clever", and variable selection), I find I can turn models around faster and try more approaches with these tools than with tools that require programming.

YALE also allows database table joins and other SQL queries, offers many operators for various ways of manually and automatically constructing, generating, and selecting features/variables and for filling missing values,and it supports rapid prototyping without the need for programming. See for example our KDD-2006 paper on rapid prototyping with YALE and some complex example applications:

Ingo Mierswa, Michael Wurst, Ralf Klinkenberg, Martin Scholz, and Timm Euler, YALE: Rapid Prototyping for Complex Data Mining Tasks, in Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2006), ACM Press, 2006.

I agree, that the preferrence for a particular tool depends on the type of project and task at hand, but also a lot on personal preferences and what people already know or are used to, but I think it is less and less important, whether you go for a commercial or an open-source tool. The latter have gained a lot of momentum in the last few years and caught up a lot. I still have not seen any commercial tool with more than 400 operators and more flexibility than YALE, but maybe I am just to unexperienced and you can provide me with examples of such tools and publicly available list of their operators and full capabilities that I can look at without buying these tools.

Best regards,
Ralf

Dean Abbott said...

Ralf:
Thanks for your comments. I have tried YALE in the past few months, but found its interface difficult to work through, and as a result gave up. However, based on your comments, I revisited the software, and while I'm still struggling with the interface, I see now the richness of options.

I'll post again once I have more information. That said, I truly believe that YALE, Weka, and other tools like these are rich technically. What I am looking for (and will re-examine YALE and be glad to revise my comments made already!) is the ability for a business user to use the software efficiently and effectively.

Dean

Will Dwinnell said...

I keep running into these tools! Here are two more:

Orange
The Statistical Lab


-Will

Anonymous said...

Dean:
Thanks for your reply to my comment. I agree, the focus of YALE is on flexibility and providing many techniques, interfaces, and options and hence is more technical. While we added the graphical user interfaces, the online tutorial with application examples, and an easy to use programm installer to improve the usability of YALE, I agree, that there is still quite some room for improvements as far as the usability is concerned, especially for non-experts and more business-oriented people. One of the things we are thinking about in this direction is providing a simple to use data mining wizard to help unexperienced users in setting up their data mining applications and support them in the design choices. This would result in an alternative user interface for YALE with a focus on simplicity and quick results, even for non-experts. That way, users could choose their prefered YALE interface depending on their expertise, the time they would like to invest to solve a task at hand, and the flexibility they need. Well, currently these are only ideas and such a wizard is not available yet. Maybe we start implementing this sometime in 2007. Please send us your comments on the usability of YALE and your ideas for improving it.

Best regards,
Ralf