Monday, April 01, 2013

Do Predictive Modelers Need to Know Math?

(Note: this post was first published in the March 2013 Edition of the Predictive Analytics Times)
Predictive analytics is just a bunch of math, isn’t it? After all, algorithms in the form of matrix algebra, summations, integrals, multiplies and adds are the core of what predictive modeling algorithms do. Even rule-based approaches need math to compute how good the if-then-else rules are.

I was participating in a predictive analytics course recently and the question a participant asked at the end of two days of instruction was this: “it’s been a long time since I’ve had to do this kind of math and I’m a bit rusty. Is there a book that would help me learn the techniques without the math?”

The question about math was interesting. But do we need to know the math to build models well? Anyone can build a bad model, but to build a good model, don’t we need to know what the algorithms are doing? The answer, of course, depends on the role of the analyst. I contend, however, that for most predictive analytics projects, the answer is “no”.

Let’s consider building decision tree models. What options does one need to set to build good trees? Here is a short list of common knobs that can be set by most predictive analytics software packages: 1. Splitting metric (CART style trees, C5 style trees, CHAID style trees, etc.) 2. Terminal node minimum size 3. Parent node minimum size 4. Maximum tree depth 5. Pruning options (standard error, Chi-square test p-value threshold, etc.)

The most mathematical of these knobs is the splitting metric. CART-styled trees use the Gini Index, C5 trees use Entropy (information gain), and CHAID style trees use the chi-square test as the splitting criterion. A book I consider the best technical book on data mining and statistical learning methods, “The Elements of Statistical Learning”, has this description of the splitting criteria for decision trees, including the Gini Index and Entropy:



To a mathematician, these make sense. But without a mathematics background, these equations will be at best opaque and at worst incomprehensible. (And these are not very complicated. Technical textbooks and papers describing machine learning algorithms can be quite difficult even for more seasoned, but out-of-practice mathematicians to understand).

As someone with a mathematics background and a predictive modeler, I must say that the actual splitting equations almost never matter to me. Gini and Entropy often produce the same splits or at least similar splits. CHAID differs more, especially in how it creates multi-way splits. But even here, the biggest difference for me is not the math, but just that they use different tests for determining "good" splits

There are, however, very important reasons for someone on the team to understand the mathematics or at least the way these algorithms work qualitatively. First and foremost, understanding the algorithms helps us uncover why models go wrong. Models can be biased toward splitting on particular variables or even particular records. In some cases, it may appear that the models are performing well but in actuality they are brittle. Understanding the math can help remind us that this may happen and why.

The fact that linear regression uses a quadratic cost function tells us that outliers affect overall error disproportionately. Understanding how decision trees measure differences between the parent population and sub-populations informs us why a high-cardinality variable may be showing up at the top of our tree, and why additional penalties may be in order to reduce this bias. Seeing the computation of information gain (derived from Entropy) tells us that binary classification with a small target value proportion (such as having 5% 1s) often won't generate any splits at all.

The answer to the question if predictive modelers need to know math is this: no they don’t need to understand the mathematical notation, but neither should they ignore the mathematics. Instead, we all need to understand the effects of the mathematics on the algorithms we use. “Those who ignore statistics are condemned to reinvent it,” warns Bradley Efron of Stanford University. The same applies to mathematics.

11 comments:

John Elder said...

I agree! The level of Math needed is not high. And Statistics - far more important and useful than Calculus! - could be much better understood and used if the math parts were de-emphasized. They are only there because geniuses a century ago figured out shortcuts to get to answers (under strict assumed conditions) that they never would have needed to figure out if they had computers and could experiment. directly!

Dean Abbott said...

John: great to have you visit, and better still, to agree! I was struck years ago by your description of economist Julian Simon and resampling stats (juliansimon.com). This was maybe in the mid-90s while he was still alive. I found this lecture by him and it had a profound impact on my thinking about teaching technical ideas without requiring math.

Thanks!

Sandro Saitta said...

Nice post Dean! Let me answer in three points.

First, I agree on the fact that understanding equations is not needed for data mining "users". As long as concepts are understood, for example, using English words, everything is fine. I agree with John: statistics is more important than calculus. If one understand Gini's diversity index concept, there is no need to understand the equation in order to apply it. However, concepts such as probabilities, class distribution and overfitting are crucial to correctly apply predictive analytics.

Second, if we want to better understand why a choice is better than an other in predictive analytics, equations will help a lot. Let me take an example with forecasting: an expert can tell you that Root Mean Square Error (RMSE) is a bad choice to compare accuracy between different time series (one should rather use Mean Absolute Scaled Error, for example). However, it's only when looking a the RMSE equation that you will understand why (it is scale dependent).

Third, as soon as one need to improve/combine existing techniques or develop new ones (let's call this the machine learning field), I think equations are necessary (or at least a plus). In conclusion, although not needed, it is clearly a plus in most situations.

Will Dwinnell said...

I only partially agree with the three of you. I think it should be plain that the more math one has at one's disposal, the better a job can be done. I do concede that the math threshold for getting started in this field is probably lower than most people might think.

On the other hand, some of the things you describe as "understood using English words" (to use Sandro's phrase) still *are* math.

Sandro Saitta said...

You're right Will, the concepts, even explained using English words are still math. What I wanted to say is that equations are not needed if you explain concepts in English. I guess math is needed anyway.

Could a concept such as overfitting be explained without Math? You need to understand signal, noise and fitting (i.e. Math) to understand overfitting, right? So you need to know Math. The question is: to which level?

Anonymous said...

I donot agree completely with you, as in 90% of cases a Analyst can build models without even understanding any of the mathematical concepts behind (As most Models application are at crude level). But if you need really good model than you need to throughly understand pros and cons of all techniques being weighed, without which you are bound to make mistake.
So knowing fundamentals is what makes difference between good and bad analyst

Anonymous said...

In response to the previous post, is there a difference between the fundamentals and the mathematical background? Math explains why certain techniques are better in certain situations, or the fundamentals as defined in the comment. One can identify general rules for techniques, such as this algorithm is better on this type of data, but to really understand a model, I think that one has to understand why and when model works, which can only be truly done by understanding the math behind the model. This seems to me to be the real fundamentals of modeling. That being said, I do
agree that the amount of math needed for beginning modelers is lower than most people would expect. It doesn't take much background to implement a model. But the more one understands the mathematical fundamentals of the models, the better job they will do at modeling.

Jared said...

I think that this very issue is being actively debated and decided. As a mathematician, when I hear this question my response is “Of course you need math!” Really though, this question goes a step deeper - what we are really asking is “What is a predictive modeller?” It doesn’t help that data mining, machine learning, big data, etc. are buzz words that people are trying to latch onto as the skill set is seeing increased demand.

Mathematics certainly is not required for someone to clean data and adjust knobs to spit out an answer. As data mining and predictive modelling penetrates business, software will be developed that will allow someone without training to produce useful results efficiently. In many cases, you don’t need to reinvent the wheel or create something new.

That said, a business of any size can be confronted by a pressing data problem that, until it is solved, will cost them dearly. It could be that they start collecting a new kind of data, or that a competitor hired someone that knew math and statistics to do something really cool and they are being out-innovated.

By it’s very nature, software used in business is not cutting edge, and building a new decision tree simply won’t cut it in some cases. I feel that in these cases, you are going to want someone that knows math to come in and take care of the new bottleneck. You’ll need to combine that with the best people on hand that understand the problem, and possible create something new.

You probably wouldn’t need that person full time, but you will need them eventually to stay relevant. In general, you can’t expect people to create something new without a fundamental understanding of what they are working on. In general, predictive modellers can not gain that insight without first understanding the mathematics.

Unknown said...
This comment has been removed by the author.
Philip said...

You pointed out that understanding the mathematics of an algorithm can help you understand the nuances about the algorithm and results the algorithm gives you. However you imply that this is an extra bonus; someone should know how this works but not necessarily the one using the model. I am definitely biased because I am trained as a mathematician but I would argue that this insight is not just a bonus but is vital. I agree that math and statistics courses may not be the best taught classes but to be truly good at using something you must understand how it works at least to some degree. If a predictive modeler’s job is just to be a data preparer and processor than maybe mathematical understanding is not required but they will not be very good at their job.

Dean Abbott said...

Philip: Thanks for your comments, and I take them seriously coming from a fellow mathematician (and I wasn't even a "proper" mathematician--I was an Applied Mathematician!

However, I still disagree that one needs to know how algorithms work mathematically and have seen it in action. I've had great data miners working for me who don't have a math or stats degree, nor did they understand the math behind trees, regression or neural networks. In fact, I worked with one guy who really only had an algebra background, but he did very well building predictive models (including neural networks).

They reason they could be so successful without knowing the math itself is because they understood *how* the algorithms behaved. You don't have to know the math to understand that outliers effect linear regression. You don't have to know the math to understand that trees are biased towards splitting on high-cardinality categorical variables. And you don't need to know math to see which records have the largest errors and understand how the most important inputs to the model could result in such large errors.

The math helps us understand *why* the algorithms behave the way they do. I also believe that a mathematics background is critical for algorithm development, if you want to modify algorithms in some way.

So is it better to understand the math? Sure. In fact, most of the best "untrained" modelers I've worked with go on to get further education just so they can understand the *why* of algorithms. But it isn't necessary for them to build good models.