Thursday, January 16, 2014

Data Science and Big Data Search Trends

These are from google trends.

Data science is growing, but still way behind other traditional terms for our field such as data mining, predictive analytics and machine learning. Big data on the other hand is growing rapidly and outpacing other fields.

This page for now is just an FYI that I may refer to in today's DM Radio show, "Blinded by Data Science"  (http://bit.ly/KfK74l ). I hope to turn it into a more cogent blog post soon (but no promises!)


“data science”


“data science” vs. “data mining”


“data science vs. predictive analytics”


“data science” vs. “machine learning”


“data science city”


“big data”


“big data” city


“big data” vs. “data science”



“big data” vs. “data mining”


“big data” vs. “predictive analytics”


“big data” vs. “machine learning”


“predictive analytics”


“big data” vs. “data science” city


“big data” vs. “data mining” city



“predictive analytics” city




Thursday, January 09, 2014

Speaking Engagements First Quarter 2014

I'll be speaking at several events this quarter

1) EITA Global Webinar:  Key Steps in Starting Your First Predictive Analytics Project
Tuesday, January 14, 2014, 1:00 PM EST

This 90 minute webinar will walk through a predictive analytics project from start to finish using the CRISP-DM process model, including

  • Data Needed for Predictive Modeling
  • Data Preparation
  • Top Modeling Algorithms: Decision Tree, Neural Networks, Regression, Clustereing
  • How to Assess Models
  • Model Deployment

Webinar cost: for viewing the webinar live, the cost is $239. There are volume discounts available.


2) Quebit Webinar: IBM SPSS Modeler Seminar Series: Techniques for Unbalanced Data
Dates Offered: January 22, 1-4 PM EST (lecture)
January 23, 1-2 PM EST (Q&A)

This webinar is divided into two days. On the first day, Keith McCormick and I will be discussing different strategies for handling unbalanced data. On the second day will cover questions from Day 1.

We will be using examples used in our recent book, The IBM SPSS Modeler Cookbook.

Use the code SPSSMODDA100 to get a $100 discount off of the webinar. It will also get you $300 off the annual subscription (which I strongly recommend--Keith and the Quebit team are the best Modeler trainers in the business).


3) UC / Irvine Predictive Analytics Certificate Program:  Algorithms, Modeling Methods, Verification & Validation
January 27, 2014 to March 16, 2014 (7 weeks, online)

Learn how to use the basics of predictive analytics and modeling data to determine which algorithms to use. Understand the similarities and differences and which options affect the models most. Discover how to verify and validate your model. Topics covered include predictive analytics algorithms for supervised learning, including decision trees, linear and logistic regression, neural networks, k-nearest neighbor, support vector machines, and model ensembles. Gain a deeper understanding of how algorithms work qualitatively by reviewing best practices and the influence of various options on predictive models.

Cost: $695

This is the algorithms course, part of the UC Irvine Predictive Analytics Certificate program. There are prerequisites for the course, but if you don't have them yet, take them and take the algorithms course in a future quarter.

4) 7th KNIME User Group Meeting (UGM) and Workshops
Feb 12-14, Zurich, Switzerland

Wed Feb 12: Real-Time Customer Intelligence, Dean Abbott (Abbott Analytics, Inc.)
Friday, Feb 14: Strategies for Building Predictive Models in KNIME, Dean Abbott (Abbott Analytics)

KNIME User Group Meeting and Free Workshops, February 12-14, have a fee of 200 €

5) The first San Diego KNIME Meetup

Date: Wednesday, February 26, 2014, 6:00 PM to 8:30 PM (free)
Location: DART Neuroscience, 12278 Scripps Summit Drive, San Diego, CA

6) Predictive Analytics World, San Francisco

Talk, March 18: Data Preparation from the Trenches: 4 Approaches to Deriving Attributes
My talk at PAW/SF this year is on data preparation, focusing on strategies for building derived attributes manually and automatically.

Workshop, March 19: Supercharging Prediction: Hands-On with Ensemble Models
This workshop discusses how to build model ensembles using a state-of-the-art software package. You will have a license for the software for use in the workshop and for a period of time subsequent to the workshop.

This workshop discusses the predictive analytics modeling process using a state-of-the-art software package. We will go from beginning to end (Business Understanding through Deployment) for one use case. You will have a license for the software for use in the workshop and for a period of time subsequent to the workshop.







Monday, November 18, 2013

A Good Business Objective Beats a Good Algorithm

Predictive Modeling competitions, once the arena for a few data mining conferences, has now become big business. Kaggle (kaggle.com) is perhaps the most well-known forum for modeling competitions, using a crowd-sourcing mentality: if more people try to solve a problem, the likelihood that someone will create an excellent solution to that problem increases.
The participants, and there have been 10s of thousands of participants since their 2011 beginning, sometimes have no predictive modeling background and sometimes an extensive data science background. Some very clever algorithms and solutions have been developed with, on some occasions, ground-breaking results
One conclusion to draw from these competitions is that what we need in the predictive analytics space is more data scientists with different, innovative ideas for solving problems, and perhaps more in-depth training of data scientists so they can create these innovative solutions. After all, the Netflix prize winner created a solution that was an ensemble of model ensembles, comprised of hundreds of models (not a Kaggle competition, but one created by and for Netflix).
This idea of the importance of machine learning expertise was the topic of a Strata conference debate in 2012, tackling the question, “which is more important, domain expertise or machine learning expertise”, or the way it was phrased for the debate, “who should your first hire be: a domain expert or data scientist?”
The conclusion of the majority at the Strata conference was the machine learning is more important, but even the moderator, Mark Driscoll, concluded the following,
“Could you currently prepare your data for a Kaggle competition?  If so, then hire a machine learner.  If not, hire a data scientist who has the domain expertise and the data hacking skills to get you there.” (http://medriscoll.com/post/18784448854/the-data-science-debate-domain-expertise-or-machine)
The point is that defining the competition objectives and the data needed to solve the problem is critically important. Non-domain experts, the data scientists, can not ever hope to understand the domain well enough to determine what the most effective question to answer would be, where to find the data to build a modeling data set, what the target variable should be, and how one should assess which model is best. These are business domain specific.
Even companies building the same kinds of models, let’s say customer retention or churn, will approach them differently depending on the kind of business, the lead time needed to act on potential churners, and the metrics for churn that relate to ROI for that company. I’ve build models for companies in the same domain area that took very different approaches; even though I had some domain experience from customer 1, that didn’t translate into developing business objectives well for company 2.
It’s the partnership that matters. I often think of these partnerships within an organization as the three-legged stool, all of which are needed for the modeling project to succeed: a business stakeholder who understands what business objectives matter to the company and how to articulate them, IT staff who know where the data is, what it means, and how to access it, and the analysts who know how to take the data and the business objectives and translate them into modeling objectives that address the business problem. Without all three, projects fail. We modelers could build the best models in the world that solve the wrong problem exceedingly well!
(first posted at http://www.predictiveanalyticsworld.com/patimes/a-good-business-objective-beats-a-good-algorithm/)

Saturday, September 07, 2013

On Data Mining Contests

Data mining contests have grown in popularity over the years, from the annual competitions at technical conferences to the continuous stream of events at sites like Kaggle. This has yielded several benefits, allowing many experts to work on difficult problems, giving novices a chance to work on real data and showcasing successful solutions. These competitions have even garnered the attention of the mainstream press. While believing that the spread of these technical contests has been largely positive, this author feels that it's worth noting the limitations of these contests.

Despite using real data, the problems, as formulated, are somewhat artificial. Questions of sampling and initial variable selection have already been decided, as have the evaluation function and the model's part in the ultimate solution. To some extent, these are necessary constraints, but they are constraints nonetheless. In real world data mining, all of these questions are the responsibility of the data miner and his or her clients, and they are not trivial considerations. In most larger organizations, the data is large enough that there is always "one more" table in the database which could be tapped for candidate predictors. Likewise, how the model might best be positioned as part of the total solution is not always obvious, especially in more complex problems. A minority of contests permit the use of outside data, but even this is somewhat unrealistic since real organizations have budgets for the purchase of outside data, such as demographic data to be appended to a customer population. I've yet to learn of anyone paying for outside variables to append to competition data, though.

Another issue is the large number of competitors which these contests attract. Though it is good to have many analysts take a crack at a problem, one must wonder about the statistical significance of having hundreds of statisticians test God-only-knows how many hypotheses against the same data. Further, the number of competitors and the similarity of top contestants' performance figures make selection of a single "winner" a dubious proposition.

Finally, it has become rather common for winners of the contests to construct solutions of vast proportions- typically ensembles of gigantic number of base models. While such models may be feasible to deploy in some circumstances, they far too computationally demanding to execute on many real databases quickly enough to be practical.

Some of these criticisms are probably unavoidable, especially the ones regarding the pre-selected, pre-digested contest data. Still, it'd be interesting to see future data mining competitions address at least some of these issues. For one thing, it might be interesting to see solution sizes (lines of SQL or C++ or something similar) limited to something which ordinary IT departments would be capable of executing during a typical overnight run. Averaging across an increased number of tasks might begin to improve the significance of differences among contestants' performances.

Wednesday, August 21, 2013

Beware Phantom Data

One of the perennial challenges facing the data analyst is missing values. A great deal has been written about the importance of identifying the source of missing values, the danger of overly simplistic solutions and, of course, the many and varied mechanisms for "filling them in" with synthetic data ("imputation").

Of the tremendous volume of material written on this subject, nearly all assumes that the analyst knows precisely which items are missing from the data. In reality, this is sometimes not the case. Relational databases and statistical software files, as a rule, have a special value to indicate "missing", though that does not mean that it is always used. Some file formats offer only indirect provision for missings, if any at all, and how software reacts to such missings varies.

Consider, too, the popular practice of using special values (such as -9999) to represent missing values. What could possibly go wrong? For one thing, the person writing the data may not consider whether the flag value might represent a legitimate value. Is it possible, for instance, to have an account balance of -9999 dollars (euros, etc.)? In my career, I have seen databases which used different flag values for each field (-99, -9999, -99999, etc.), making the writing of code against such data extremely tedious (and error-prone). I have also seen -9999 used to indicate one type of missing value, and -9998 to indicate another type of missing. When the hand-off of information from one person (system, process, etc.) to another is confused or incomplete, interpretation of the data becomes incorrect.

Another aspect of this problem is the precise definition given to fields, and their possible misinterpretation by data consumers (such as data miners). Imagine that a particular integer field is being used to record the number of times each customer has made a payment on their loan, within the past 6 months. As customers begin their tenure, this variable starts with a value of zero. Suppose our model included this field as an independent variable. Presumably low risk customers have higher values, while higher risk customers have lower values. Without missing any payments, early lifecycle customer are penalized arbitrarily by the model. One could make the argument that this variable should be undefined (recorded as a database missing value flag) until a customer has a full 6-month track record, but this is exactly the sort of conversation which very often fails to materialize in real organizations.

These are all instances of "phantom data": Items in the database which are missing values, but mistaken for real data. It shouldn't take much imagination on the reader's part to conjure similar problematic situations in his or her own field. The lesson is to look beyond the known missings for more subtle gaps in the data. Time spent investigating the nature of database systems, company procedures and so forth which generate data is insurance against being burned by serious misunderstanding of the data.

Tuesday, July 23, 2013

The NSA, Link Analysis and Fraud Detection

The recent leaks about the NSA’s use data mining and predictive analytics has certainly raised awareness of our field and has resulted in hours of discussions with family, relatives, friends and reporters about what predictive analytics can (and can’t) do with phone records, emails, chat messages, and other structured and unstructured data. Eric Siegel and I have been interviewed on multiple occasions to address this issue from a Predictive Analytics perspective, and in case, in the same article: “What the NSA can’t do with your data (probably)”. Part of my goal in these conversations has been to bring back to reality many of the inflated expectations of what can be done with predictive analytics: predictive analytics is a powerful approach to finding patterns in data, but it isn’t magic, nor is it fool-proof.
First, let me be clear: I have no direct knowledge of the analytics the NSA is doing. I have worked on many fraud detection projects for the U.S. Government and private sector, some including what I would describe as a “social networking” component to them where the connections between parties is an important part of the risk factors.
The phone call meta data shows simple information about each phone call: origination, destination, date of the call, duration, and perhaps some geographic information about the origination and destination. One of the valuable aspects of the data is that connections can be made between origination and destination numbers, and as a results, one can build social networks of every origination phone number in the data. The U.S. has more than 326.4 millions cell phones subscriptions as of December 2012 according to CTIA. The Pew Research survey found that individual cell phone users had on average 664 social connections (not all of which are cell connections). The number of links needed to build a U.S.-wide social map of phone call connections easily outstrips any possible visualization method, and therefore, without filtering connections and networks, these social maps would be useless. One of the factors working in our favor, if we are concerned with privacy issues related to this meta data, is therefore the sheer size of the network.
The networks of phone calls, I believe, are particularly useful in connecting high-risk individuals with others whom the NSA may not know beforehand are connected to the person of interest. In other words, a starting point is needed first and the social network is built from this starting point. If one has multiple starting points, one can also find linkages between networks even if the networks themselves don’t overlap significantly.
The strength of a link can include information such as number of calls, duration of calls, regularity of calls, most recent call, oldest call, and more. Think of these as a cell-phone version of RFM analysis. The networks can be pruned easily based on thresholds for these key features, simplifying the networks considerably.
But even if the connections are made, this data is woefully incomplete on its own. First, there is no connection to the person who actually made the call, only the phone number and who it is registered to. Finding who made the calls requires more investigation. Second, it doesn’t necessarily connect all the phones an individual might use. If a person uses 5 or 6 cell phones, one doesn’t know that the same person is behind these phone numbers. Third, one certainly doesn’t know the intent or content of the call.
Given these limitations, what value is there to the network of calls? These networks are usually best used as lead-generation engines. Which other phone numbers in a network are connected to multiple high-risk individuals (but weren’t here-to-fore considered high risk)? Is the timeline of calls temporally correlated with other known events?
Analytics, and link analysis in particular, provide tremendously powerful techniques to identify new leads and remove unfruitful leads by finding connections unlikely to occur randomly.
NOTE: this article first appeared as an article in the PATimes: http://www.predictiveanalyticsworld.com/patimes/the-nsa-link-analysis-and-fraud-detection/

Tuesday, June 18, 2013

Big Data is Not Enough

Big data is the big buzz word in the world of analytics today. According to google trends, shown in the figure, searches for "big data" have been growing exponentially since 2010 though perhaps is beginning to level off. Or take a look on amazon.com for books with Big Data in the title sometime: the publication dates, for the most part, are in 2012 or 2013.


But what's the key to unlock the big data door? In his interview with Eric Siegel on April 12, Ned Smith of Business News Daily (http://www.businessnewsdaily.com/4326-predictive-analytics-unlocks-big-data.html) starts with this apt insight: "Predictive Analytics is the 'Open Sesame' for the world of Big Data." Big data is what we have; predictive analytics (PA) is what we do with it.

Why is the data so big? Where does it come from? We who do PA usually think of doing predictive modeling on structured data pulled from a database, probably flattened into a single modeling table by a query so that the data is loadable into a software tool. We then clean the data, create features, and away we go with predictive modeling.

But according to a 2012 IBM study, "Analytics: The real-world use of big data", 88% of big data comes from transactions, 73% from log data, and significant proportions of data come from audio and video (still and motion). These are not structured data. Log files are often unstructured data containing nothing more than notes, sometimes freehand, sometimes machine-created, and therefore cannot be used without first preprocessing the data using text mining techniques. For all of us who have built models augmented with log files or other text data, we know how much work is involved in transforming text into useful attributes that can then be used in predictive models

Even the most structured of the big data sources, transactional data, often are nothing more than dates, IDs and very simple information about the nature of the transaction (an amount, time period, and perhaps a label about the nature of the transaction).

Transactional data is rarely used directly; it is usually transformed into a form more useful for predictive modeling. For example, rather than building models where each row is a web page transaction, we transform the data so that each row is a person (the ID) and the fields are aggregations of that person’s history for as long as their cookie has persisted; the individual transactions have to be linked together and aggregated to be useful.

The big data wave we are experiencing is therefore not helpful directly for improving predictive models, we need to first determine the level of analysis needed to build useful models, i.e., what a record in the model represents. The unit of analysis is determined by the question the model is intended to answer, or put another way, the decision the model is intended to improve within the organization. This is determined by defining the business objectives of the models, normally by a program manager or other domain expert in the organization, and not by the modeler.

The second step in building data for predictive modeling is creating the features to include as predictors for the models. How do we determine the features? I see three ways:
  1. the analyst can define the features based on his / her experience in the field, or do research to find what others have done in the field through google searching and academic articles. This assumes the analyst is, to some degree, a domain expert.
  2. the key features can be determined by other domain experts either handed down to the analyst or through interviews of domain experts by the analyst. This is better than a google search because the answers are focused on the organization’s perspective on solving the problem.
  3. the analyst can rely on algorithm-based features creation. In this approach, the analyst merely provides the raw input fields and allows the algorithms to find the appropriate transformations of individual fields (easy) or multivariate combinations (more complex). Some algorithms and implementations of algorithms in software can do this quite effectively. This third approach I see advocated implicitly by data scientists in particular.
In reality, a combination of all three is usually used and I recommend all three. But features based on domain expertise almost always provides the largest gains in model performance compared with algorithm-based (automatic) feature creation.

This is the new thee-legged stool of predictive modeling: big data provides the information, augmenting what we have used in the past, domain experts provide the structure for how to set up the data for modeling, including what a record means and the key attributes that reflect information expected to be helpful to solve the problem, and predictive analytics provides the muscle to open the doors to what is hidden in the data. Those who take advantage of all three will be the winners in operationalizing analytics.

First posted at The Predictive Analytics Times

Friday, June 07, 2013

Dean Abbott Featured in "Popular Mechanics" On-Line Article

Our own Dean Abbott has been consulted for an on-line Popular Mechanics article, "Why the NSA Wants All That Verizon Metadata" (Jun-06-2013), by Glenn Derene. Since the initial report connecting the NSA with Verizon, details have emerged suggesting similar large-scale information-gathering by the American government from other telecommunication and Internet companies.

Some applications of data mining to law enforcement and anti-terrorism problems have clearly been fruitful (for detection of money laundering, for instance, which is one source of funding for criminal and terrorist organizations). On the other hand, direct application of these techniques to plucking out the bad guys from large numbers of innocents strikes this author as dubious, and has long been criticized by experts, such as Bruce Schneier. What's plain is that people in democratic societies must remain vigilant of the balance of information and power granted to their governments, lest the medicine become worse than the disease.

Friday, April 26, 2013

Math and Predictive Analytics - A Personal Account

Last week I taught a workshop at Predictive Analytics World entitled Supercharging Prediction: Hands-On with Ensemble Models. The workshop was intended to introduce predictive modelers to the concept of ensembles through a combination of lecture to provide an overview of model ensembles and hands-on to gain experience building ensembles using Salford Systems SPM v7.0 (Salford Systems sponsored the workshop).

This morning, Heather Hinman, a Marketing Communications Manager at Salford Systems, posted comments on attending that workshop at the Salford Systems blog. Two comments were particularly interesting, especially their implications vis a vis my last blog post on math and predictive analytics:

I will admit I was intimidated at first to be participating in a predictive modeling workshop as I do not have a background in statistics, and only have basic training on decision tree tools by Salford Systems' team of in-house experts. Despite my basic knowledge of decision trees, I was thrilled that I was able to follow along with ease and understanding when learning about tree ensembles and modern hybrid modeling approaches. Marketing folk building predictive models? Yes, we can!
and
Now back at the office in San Diego, along with my usual responsibilities, I feel confident in my ability to build predictive models and gain insights into the data at hand to achieve the email marketing and online campaign goals for our communication efforts!  
In the post, Heather also outlines some of the principles she learned and how she used them to build the predictive models in the workshop.

The point is this: if one uses good software that uses solid principles for building predictive models, and one understands key principles of building predictive models, someone without a mathematics background can build good, profitable models.



Monday, April 01, 2013

Do Predictive Modelers Need to Know Math?

(Note: this post was first published in the March 2013 Edition of the Predictive Analytics Times)
Predictive analytics is just a bunch of math, isn’t it? After all, algorithms in the form of matrix algebra, summations, integrals, multiplies and adds are the core of what predictive modeling algorithms do. Even rule-based approaches need math to compute how good the if-then-else rules are.

I was participating in a predictive analytics course recently and the question a participant asked at the end of two days of instruction was this: “it’s been a long time since I’ve had to do this kind of math and I’m a bit rusty. Is there a book that would help me learn the techniques without the math?”

The question about math was interesting. But do we need to know the math to build models well? Anyone can build a bad model, but to build a good model, don’t we need to know what the algorithms are doing? The answer, of course, depends on the role of the analyst. I contend, however, that for most predictive analytics projects, the answer is “no”.

Let’s consider building decision tree models. What options does one need to set to build good trees? Here is a short list of common knobs that can be set by most predictive analytics software packages: 1. Splitting metric (CART style trees, C5 style trees, CHAID style trees, etc.) 2. Terminal node minimum size 3. Parent node minimum size 4. Maximum tree depth 5. Pruning options (standard error, Chi-square test p-value threshold, etc.)

The most mathematical of these knobs is the splitting metric. CART-styled trees use the Gini Index, C5 trees use Entropy (information gain), and CHAID style trees use the chi-square test as the splitting criterion. A book I consider the best technical book on data mining and statistical learning methods, “The Elements of Statistical Learning”, has this description of the splitting criteria for decision trees, including the Gini Index and Entropy:



To a mathematician, these make sense. But without a mathematics background, these equations will be at best opaque and at worst incomprehensible. (And these are not very complicated. Technical textbooks and papers describing machine learning algorithms can be quite difficult even for more seasoned, but out-of-practice mathematicians to understand).

As someone with a mathematics background and a predictive modeler, I must say that the actual splitting equations almost never matter to me. Gini and Entropy often produce the same splits or at least similar splits. CHAID differs more, especially in how it creates multi-way splits. But even here, the biggest difference for me is not the math, but just that they use different tests for determining "good" splits

There are, however, very important reasons for someone on the team to understand the mathematics or at least the way these algorithms work qualitatively. First and foremost, understanding the algorithms helps us uncover why models go wrong. Models can be biased toward splitting on particular variables or even particular records. In some cases, it may appear that the models are performing well but in actuality they are brittle. Understanding the math can help remind us that this may happen and why.

The fact that linear regression uses a quadratic cost function tells us that outliers affect overall error disproportionately. Understanding how decision trees measure differences between the parent population and sub-populations informs us why a high-cardinality variable may be showing up at the top of our tree, and why additional penalties may be in order to reduce this bias. Seeing the computation of information gain (derived from Entropy) tells us that binary classification with a small target value proportion (such as having 5% 1s) often won't generate any splits at all.

The answer to the question if predictive modelers need to know math is this: no they don’t need to understand the mathematical notation, but neither should they ignore the mathematics. Instead, we all need to understand the effects of the mathematics on the algorithms we use. “Those who ignore statistics are condemned to reinvent it,” warns Bradley Efron of Stanford University. The same applies to mathematics.