Applied Data Science and Machine Learning

Friday, November 24, 2023

How Confident Are We of Machine Learning Model Predictions?

When we build binary classification models using algorithms like Neural Networks, XGBoost, Random Forests, etc., we get as an output of the models a prediction that ranges from 0 to 1. But how sure are we that this is a stable prediction? Does a score of 0.8 really mean 0.8? There is a difference between 0.8 +/- 0.05 and 0.8 +/- 0.4 after all!

One reason we love models grounded in statistics is that because of the strong assumptions they have, we can compute many metrics to provide insight into how sure we are the the coefficients are correct and what confidence intervals exist for model predictions. For example, see "Calculating Confidence Intervals for Logistic Regression" (https://stats.stackexchange.com/questions/354098/calculating-confidence-intervals-for-a-logistic-regression) or books like "Applied Linear Statistical Models" (https://a.co/d/a7BR3pa).

However, for other model types (non-parametric for example), we don't have the benefit of these kinds of measures. Over the past decade or more, when I've needed this kind of information, I've used bootstrap sample this way:

Build my model. This is the baseline. For the testing data set, each record gets a score.
Create 100 bootstrap samples of the training data.
Build 100 models (one for each bootstrap sample) using same the protocol as the baseline model
Run each model through the testing set. We now have 100 scores for every record in the test set...a distribution of scores
compute the 90% confidence interval equivalent by identifying the probabilities (or model scores) at 5th and 95th percentiles (ranks 5 and 95 of the 100 scores). For the 95% confidence interval, one would need to interpolate between 2nd and 3rd, and also the 97th and 98th ranked scores

This works fine for any algorithm. However, I'd never seen a formal treatment of this topic; this is really to my discredit as I had never really done a signfiicant search to find any theory related to this topic.

At this year's Machine Learning Week Europe (https://machinelearningweek.eu/), there was a talk on this subject given by Dr. Michael (Naatz) Allgöwer (https://www.linkedin.com/in/allgoewer/) entitled "NFORMAL PREDICTION: A UNIVERSAL METHOD FOR UNCERTAINTY QUANTIFICATION" that introduced another way of accomplsihing this objective. A Wikipedia summary of the approach is here (https://en.wikipedia.org/wiki/Conformal_prediction). I like what I've heard from Dr. Allgöwer at the conference and would like to experiment with this approach to learn how it works and what limitations might exist for the approach.

I hope to compare the approaches, with pros and cons, in the coming weeks. Stay tuned!

Thursday, November 02, 2023

What if Generative AI Turns out to be a Dud?

I follow posts on twitter from different sides of the generative AI debates, including Yann LeCun (whom I've followed for decades) and Gary Marcus (whom I discovered just in the past few years). I'll post at some other time about my views, but found this post by Marcus to be intriguing. I first published my comments here on LinkedIn

Key quotes at the end of the article,

"Everybody in industry would probably like you to believe that AGI is imminent. It stokes their narrative of inevitability, and it drives their stock prices and startup valuations. Dario Amodei, CEO of Anthropic, recently projected that we will have AGI in 2-3 years. Demis Hassabis, CEO of Google DeepMind has also made projections of near-term AGI.

I seriously doubt it. We have not one, but many, serious, unsolved problems at the core of generative AI — ranging from their tendency to confabulate (hallucinate) false information, to their inability to reliably interface with external tools like Wolfram Alpha, to the instability from month to month (which makes them poor candidates for engineering use in larger systems)."

This is exactly how it comes across to me and is consistent with what I've experienced myself and what my closest colleagues who have used generative AI have also experienced.

The Marcus article: https://garymarcus.substack.com/p/what-if-generative-ai-turned-out

Monday, May 22, 2017

What Programming do Predictive Modelers Need to Know?

Dean Abbott

SmarterHQ and Abbott Analytics

Note: this post, almost 7 years after posting 5/22/2017), has been flagged for copyright or some kind of digital rights issue. Of course, the specific issue wasn't identified so I'm left guessing. I'm guessing it was a screen capture of one of the software products. (??). So to be safe I took them all out. The irony is that the images are all from the vendor websites. I dont think any of them are current (I certainly hope not!)

I hope the article continues to provide the value it was intended to provide.

-- Dean, 1/13/2024

(First published in Predictive Analytics Times, http://www.predictiveanalyticsworld.com/patimes/what-programming-do-predictive-modelers-need-to-know-0408152-2/5129/, 2014. Updated and edited here.)

In most lists of the most popular software for doing data analysis, statistics, and predictive modeling, the top software tools are Python and R—command line languages rather than GUI-based modeling packages. There are several reasons for this, perhaps most importantly that they are free, they are robust programming languages supported by a very broad user community, and they have extensive sets of algorithms.

One recent survey of software was published at r4stats.com (http://r4stats.com/articles/popularity/) and contains additional metrics not usually found in software comparisons, including scholarly articles that include software, google scholar hits, and job trends in addition to the more typical summaries by user-identified use of software such as with the Rexer Analytics surveys (http://www.rexeranalytics.com/Data-Miner-Survey-Results-2013.html) and polls on kdnuggets.com and reviews of software by technology research companies such as Gartner (http://pages.alteryx.com/GartnerMQAdvancedAnalyticsNowAvailable-T.html), Forrester (http://global.sap.com/campaign/na/usa/CRM-XU13-BIP-PATDWS/index.html?urlid=CRM-XU13-BIP-PATDWS), and Hurwitz & Associates (http://www.sas.com/content/dam/SAS/en_us/doc/analystreport/hurwitz-advanced-analytics-107212.pdf). Thanks to r4stats for providing these links in their article.

For those interested in getting a job in analytics, the article provides very useful rankings of software by the number of job postings, led by Java, SAS, Python, C/C++/C#, R, SPSS, and Matlab. They also provide a few examples of the trending in the job postings of the tools over the past 7 years—important trending to consider as well. This reveals for example a nearly identical increase in Python and R compared with a decrease in SAS over the past few years (SAS is still #2 overall in job postings though because of the huge SAS install base).

The user interface that appears to have won the day in commercial software for predictive modeling is the workflow-style interface where a user connects icons that represent functions or tasks into a flow of functions. This kind of interface has been in use for decades, and one that I was first introduce to in the software package Khoros / Cantata in the early 90s (http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.22.9854&rep=rep1&type=pdf). Clementine was an early commercial tool using this paradigm (now IBM SPSS Modeler), and now most tools, including those that have historically used drop-down “Windows-style” menus, are embracing a workflow interface. And it’s not just commercial tools that are built from the ground up using a workflow-style interface: many open source software like KNIME and RapidMiner embraced this style from their beginnings.

Even though workflow interfaces have won the day, there are several excellent software tools that still use a typical drop-down file menu interface for a variety of reasons: some legacy and some functional. I still use several of them myself.

There are several reasons I like the workflow interface. First, it is self-documenting, much like command-line interfaces. You see exactly what you did in the analysis, at least at a high level. To be fair, some of these nodes have considerable critical customization options set inside the nodes, but the function is self-evident nevertheless. Command line functions have the same issue: there are often numerous options one has to specify to call a function successfully.

Second, you can reuse the workflow easily. For example, if you want to run your validation data through the exact same data preparation steps you used in building your models, you merely connect a new data source to the workflow. Third, you can explain what you did to your manager very easily and visually without the manager needing to understand code.

Another way to think of the workflow interface is as a visual programming interface. You string together functional blocks from a list of functions (nodes) made available to you by the software. So whether you build an analysis in a visual workflow or a command line programming language, you still do the same thing: string together a sequence of commands to manipute and model the data. For example, you may want to load a csv file, replace missing values with the mean, transform your positively-skewed variables with a log transform, split you data into training and testing subsets, then build a decision tree. Each of these steps can be done with a node (visual programming) or a function (programming).

From this perspective, the biggest difference between visual programming and command line programming is that R and Python have a larger set of functionas available to you. From an algorithm standpoint, this difference is primarily manifested in obscure or new, cutting-edge algorithms. This is one important reason why most visual programming interface tools have added R and Python integration into their software, typically through a node that will run the external code within the workflow itself. The intent isn’t to replace the software, but to enhance it with functions not yet added to the software itself. This is especially the case with leading-edge algorithms that have support in R or Python already because of their ties with the academic community.

Personally, I used to create code regularly for building models, primarily in C and FORTRAN (3^rd generation languages) though also in other scripting (4^th generation) languages like unix shell programming (sh, csh, ksh, bash), Matlab, Mathematica, and others. But eventually I used commercial software tools because my consulting clients used them, and they contained most of what I needed to do to solve data mining problems. Since they all have limited sets of functions, I would have to sometimes make creative use of the existing functions to accomplish what I needed to do, but it didn’t stop me from being successful with them. And I didn’t have to write and re-write code for each of these clients.

Each of these tools have their own way of performing an analysis. Command line tools also have their own way of performing an analysis. Much of what separates novices and experts in a software tool is not an awareness of the particular functions or building blocks, but an understanding of how best to use the existing building blocks. This is why I recommend analysts learn a tool and learn it well, becoming an expert in the tool so that the tool is used to its fullest potential.

Examples of workflows in some of the most popular and acclaimed advanced analytics software packages are shown below. Note that the style that has dominated these top tools is the visual programming interface, and they are very similar in how the user builds these workflows.

Figure 2: Statistica workflow from my Predictive Analytics World workshop, Advanced Methods Hands-On

Figure 3: IBM SPSS Modeler Stream, from https://34f2c.https.cdn.softlayer.net/8034F2C/dal05/v1/AUTH_db1cfc7b-a055-460b-9274-1fd3f11fe689/5b0fd91ef0e1a6f21f6e983ccc775a37/offering_3d4b6acc-2e09-4451-b4d0-3385f4385aa8.png