Friday, July 17, 2015

Data Mining's Forgotten Step-Children

Depending on whose definition one reads, the list of activities which comprise data mining will vary, but the first two items are always the same...

Number 1: Prediction

The most common data mining function, by far, is prediction (or, more esoterically, supervised learning), which is sometimes listed twice, depending on the type of variable being predicted: classification (when the target is categorical) vs. regression (when the target is numerical). Predictive models learned by machines from historical examples easily occupy the most of almost any measure of data mining: time, money, technical papers published, software packages, etc. The hyperbole of marketers and the fears of data mining critics, also, are most often associated with prediction.

Number 2: Clustering

The second most common data mining function in practice is clustering (sometimes known by the alias unsupervised learning). Gathering things into "natural" groupings has a long history in some fields (cladistics in biology, for instance), though clustering's "no right or wrong answer" quality likely will cement its continuing spot in second place.  Despite being second banana to prediction, clustering enjoys widespread application and is well understood even in non-technical circles. What marketer doesn't like a good segmentation?

"... and all the rest!"

What else is in the data mining toolbox? Definitions vary, but the next two most commonly mentioned tasks are anomaly detection and association rule discovery. Other tasks have been included, such as data visualization, though that field dates back well over a hundred years and clearly enjoys a healthy existence outside of the data mining field.

Anomaly detection (a superset of statistical outlier detection) searches for observations which violate patterns in data. Generally, these patterns are discovered (explicitly or not) using prediction or clustering. Given that a wide array of prediction or clustering techniques might be applied, the patterns concluded to exist within a single data set will vary, implying that observations flagged as anomalous will vary. This leaves anomaly detection somewhat in the company of clustering in the sense of having "no right or wrong answers".  Still, anomaly detection can be immensely useful, with two common applications being fraud detection and data cleansing. This author has used a simple anomaly detection process to help find errors in predictive model implementation code.

Association rule discovery attempts to identify patterns among data items which exhibit associations with one another. The classic example is individual items of merchandise in a retail setting (market basket analysis): Each purchase represents an association of a variety of distinct items with one another. After enough purchases, relationships among items can be inferred, such as the frequent purchase of coffee with sugar. Relationships among people, as evidenced by instances of telephone or electronic contact, have also been explored, both for marketing purposes and in law enforcement.

Further Reading

Neither anomaly detection nor association rule discovery receive nearly the press that the first two members of the data mining club do, but it is worth learning something about them. Some problems fall more naturally into their purview. To get started with these techniques, the standard references will do, such as Witten and Frank, or Han and Kamber. Also consider material on outliers in the traditional statistical literature.


Coepd said...

We at COEPD provides finest Data Science and R-Language courses in Hyderabad. Your search to learn Data Science ends here at COEPD. Here, we are an established training institute who have trained more than 10,000 participants in all streams. We will help you to convert your passion to learn into an enriched learning process. We will accelerate your career in data science by mastering concepts of Data Management, Statistics, Machine Learning and Big Data.

Coepd BA Trainings said...

We are glad to announce that in COEPD we have introduced Digital Marketing Internship Programs (Self sponsored) for professionals who want to have hands on experience. In affiliation with IT companies we are providing this program. Presently, this program is available in COEPD Hyderabad premises. We deem in real time practical Internship program. We guide participants through real-time project examples and assignments, giving credits for Real-Time Internship. Our digital marketing certified mentors tutor our learning people through modules of Digital Marketing in an exhaustive manner. This internship is intelligently dedicated to our avid and passionate participants predominantly acknowledging and appreciating the fact that they are on the path of making a career in Digital Marketing discipline. We upskill and master the nitty-gritty of the Digital Marketing profession. More than a training institute, COEPD today stands differentiated as a mission to help you "Build your dream career" - COEPD way.