What Programming do
Predictive Modelers Need to Know?
Dean Abbott
Note: this post, almost 7 years after posting 5/22/2017), has been flagged for copyright or some kind of digital rights issue. Of course, the specific issue wasn't identified so I'm left guessing. I'm guessing it was a screen capture of one of the software products. (??). So to be safe I took them all out. The irony is that the images are all from the vendor websites. I dont think any of them are current (I certainly hope not!)
I hope the article continues to provide the value it was intended to provide.
-- Dean, 1/13/2024
In most lists of the most popular software for doing data
analysis, statistics, and predictive modeling, the top software tools are
Python and R—command line languages rather than GUI-based modeling packages.
There are several reasons for this, perhaps most importantly that they are
free, they are robust programming languages supported by a very broad user community,
and they have extensive sets of algorithms.
For those interested in getting a job in analytics, the
article provides very useful rankings of software by the number of job
postings, led by Java, SAS, Python, C/C++/C#, R, SPSS, and Matlab. They also
provide a few examples of the trending in the job postings of the tools over
the past 7 years—important trending to consider as well. This reveals for
example a nearly identical increase in Python and R compared with a decrease in
SAS over the past few years (SAS is still #2 overall in job postings though
because of the huge SAS install base).
The user interface that appears to have won the day in commercial
software for predictive modeling is the workflow-style interface where a user
connects icons that represent functions or tasks into a flow of functions. This
kind of interface has been in use for decades, and one that I was first
introduce to in the software package Khoros / Cantata in the early 90s (
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.22.9854&rep=rep1&type=pdf).
Clementine was an early commercial tool using this paradigm (now IBM SPSS Modeler),
and now most tools, including those that have historically used drop-down
“Windows-style” menus, are embracing a workflow interface. And it’s not just
commercial tools that are built from the ground up using a workflow-style
interface: many open source software like KNIME and RapidMiner embraced this
style from their beginnings.
Even though workflow interfaces have won the day, there are
several excellent software tools that still use a typical drop-down file menu
interface for a variety of reasons: some legacy and some functional. I still use
several of them myself.
There are several reasons I like the workflow interface.
First, it is self-documenting, much like command-line interfaces. You see
exactly what you did in the analysis, at least at a high level. To be fair,
some of these nodes have considerable critical customization options set inside
the nodes, but the function is self-evident nevertheless. Command line
functions have the same issue: there are often numerous options one has to
specify to call a function successfully.
Second, you can reuse the workflow easily. For example, if
you want to run your validation data through the exact same data preparation
steps you used in building your models, you merely connect a new data source to
the workflow. Third, you can explain what you did to your manager very easily
and visually without the manager needing to understand code.
Another way to think of the workflow interface is as a visual programming interface. You string
together functional blocks from a list of functions (nodes) made available to
you by the software. So whether you build an analysis in a visual workflow or a
command line programming language, you still do the same thing: string together
a sequence of commands to manipute and model the data. For example, you may
want to load a csv file, replace missing values with the mean, transform your
positively-skewed variables with a log transform, split you data into training
and testing subsets, then build a decision tree. Each of these steps can be
done with a node (visual programming) or a function (programming).
From this perspective, the biggest difference between visual
programming and command line programming is that R and Python have a larger set
of functionas available to you. From an algorithm standpoint, this difference is
primarily manifested in obscure or new, cutting-edge algorithms. This is one
important reason why most visual programming interface tools have added R and
Python integration into their software, typically through a node that will run
the external code within the workflow itself. The intent isn’t to replace the
software, but to enhance it with functions not yet added to the software
itself. This is especially the case with leading-edge algorithms that have
support in R or Python already because of their ties with the academic
community.
Personally, I used to create code regularly for building
models, primarily in C and FORTRAN (3rd generation languages) though
also in other scripting (4th generation) languages like unix shell
programming (sh, csh, ksh, bash), Matlab, Mathematica, and others. But
eventually I used commercial software tools because my consulting clients used
them, and they contained most of what I needed to do to solve data mining
problems. Since they all have limited sets of functions, I would have to
sometimes make creative use of the existing functions to accomplish what I
needed to do, but it didn’t stop me from being successful with them. And I
didn’t have to write and re-write code for each of these clients.
Each of these tools have their own way of performing an
analysis. Command line tools also have their own way of performing an analysis.
Much of what separates novices and experts in a software tool is not an
awareness of the particular functions or building blocks, but an understanding
of how best to use the existing building blocks. This is why I recommend
analysts learn a tool and learn it well, becoming an expert in the tool so that
the tool is used to its fullest potential.
Examples of workflows in some of the most popular and
acclaimed advanced analytics software packages are shown below. Note that the style that has dominated these top tools is the visual programming interface, and they are very similar in how the user builds these workflows.
|
Figure 2: Statistica workflow from my Predictive
Analytics World workshop, Advanced Methods Hands-On
|
|
Figure 3: IBM SPSS Modeler Stream, from https://34f2c.https.cdn.softlayer.net/8034F2C/dal05/v1/AUTH_db1cfc7b-a055-460b-9274-1fd3f11fe689/5b0fd91ef0e1a6f21f6e983ccc775a37/offering_3d4b6acc-2e09-4451-b4d0-3385f4385aa8.png
|