Thursday, April 05, 2012

Why Defining the Target Variable in Predictive Analytics is Critical

Every data mining project begins with defining what problem will be solved. I won't describe the CRISP-DM process here, but I use that general framework often when working with customers so they have an idea of the process.

Part of the problem definition is defining the target variable. I argue that this is the most critical step in the process that relates to the data, and more important than data preparation, missing value imputation, and the algorithm that is used to build models, as important as they all are.

The target variable carries with it allthe information that summarizes the outcome we would like to predict from the perspective of the algorithms we use to build the predictive models. Yet this can be misleading is many ways. I'm addressing one way we can be fooled by the target variable here, and please indulge me to lead you down the path.

Let's say we are building fraud models in our organization. Let's assume that in our organization, the process for determining fraud is first to identify possible fraud cases (by tips or predictive models), then assign the case to a manager who determines which investigator will get the case (assuming the manager believes there is value in investigating the case), then assign the case to an investigator, and if fraud is found, the case is tried in court, and ultimately a conviction is made or the party is found not guilty.

Our organization would like to prioritize which cases should be sent to investigators using predictive modeling. It is decided that we will use as a target variable all cases that were found to be fraudulent, that is, all cases that had been tried and a conviction achieved. Let's assume here that all individuals involved are good at their jobs and do not make arbitrary or poor decisions (which of course is also a problem!)

Let's also put aside for a moment the time lag involved here (a problem itself) and just consider the conviction as a target variable. What does the target variable actually convey to us? Of course our desire is that this target variable conveys fraud risk. Certainly when the conviction has occurred, we have high confidence that the case was indeed fraudulent, so the "1"s are strong and clear labels for fraud.

But, what about the "0"s? Which cases do they include?
--cases never investigated (i.e., we suspect they are not fraud, but don't know)
--cases assigned to a manager who never assigned the case (he/she didn't think they were worth investigating).
--cases assigned to an investigator but the investigation has not yet been completed, or was never completed, or was determined not contain fraud
--cases that went to court but was found "not guilty"

Remember, all of these are given the identical label: "0"

That means that any cases that look on the surface to be fraudulent, but there were insufficient resources to investigate them, are called "not fraudulent. That means cases that were investigated but the investigator was taken off the case to investigate other cases are called "not fraudulent". It means too that court cases that were thrown out of court due to a technicality unrelated to the fraud itself are called "not fraud".

In other words, the target variable defined as only the "final conviction" represents not only the risk of fraud for a case, but also the investigation and legal system. Perhaps complex cases that are high risk are thrown out because they aren't (at this particular time, with these particular investigators) worth the time. Is this what we want to predict? I would argue "no". We want our target variable to represent the risk, not the system.

This is why when I work on fraud detection problems, the definition of the target variable takes time: we have to find measures that represent risk and are informative and consistent, but don't measure the system itself. For different customers this means different trade-offs, but usually it means using a measure from earlier in the process.

So in summary, think carefully about the target variable you are defining, and don't be surprised when your predictive models predict exactly what you told them to!

34 comments:

James Taylor said...

Hey Dean
Nice post. Blogged a "and another thing" response on my blog
James

Sqiar BI said...

Tableau consultant
SQIAR (http://www.sqiar.com/solutions/technology/tableau) is a leading Business Intelligence company.Sqiar Consultants Provide Tableau Software Consultancy To small and Medium size of organization.

Sasha Jacobson said...

Hey Dean, great post. Since you're interested in analytics for massive data and other predictive analytics topics, just wondering if you're familiar with Modern Analytics and their suite of cutting-edge solutions?

Balaji said...

Thank you for taking the time to provide us with your valuable information. We strive to provide our candidates with excellent care and we take your comments to heart.As always, we appreciate your confidence and trust in us
Analytics Training In Chennai

John Nash said...

Excellent Blog very imperative good content, this article is useful to beginners and real time Employees. DevOps Online Training

Interview Gig said...

This is an awesome post. Really very informative and creative contents. These concept is a good way to enhance the knowledge. Here You can find some Frequently Asked DevOps Interview Questions and Answers with explanation

Chef Interview Questions and Answers

Docker Interview Questions and Answers

GIT Interview Questions and Answers

Jenkins Interview Questions and Answers

Maven Interview Questions and Answers

Nagios Interview Questions and Answers

Puppet Interview Questions and Answers

Rohit Kumar said...

Thanks for this lesson. It is great. Matlab assignment help

Guest Posting World said...

Nice and informative blog post. Guest Posting World

Guest Posting World said...

Nice and informative Blog post. Keep writing!!

shalinipriya said...

This is such a great post and was thinking much the same myself. Another great update.
Data science training in Bangalore
Data Science training in marathahalli
Data Science training in btm
Data Science training in rajaji nagar
Data Science training in chennai
Data science training in kalyan nagar
Data Science training in USA


Teju Teju said...

Thank you.Well it was nice post and very helpful information on Data Science online Course


simbu said...

This is a terrific article, and that I would really like additional info if you have got any. I’m fascinated with this subject and your post has been one among the simplest I actually have read.
java training in annanagar | java training in chennai

java training in marathahalli | java training in btm layout

java training in rajaji nagar | java training in jayanagar

java training in chennai

Mouni yoga said...

This is a terrific article, and that I would really like additional info if you have got any. I’m fascinated with this subject and your post has been one among the simplest I actually have read.
python training in rajajinagar
Python training in btm
Python training in usa

gowsalya said...

Really very nice blog information for this one and more technical skills are improve,i like that kind of post.
Devops training in sholinganallur
Devops training in velachery
Devops training in annanagar
Devops training in tambaram

Unknown said...

myTectra Placement Portal is a Web based portal brings Potentials Employers and myTectra Candidates on a common platform for placement assistance

Anonymous said...

That was a great message in my carrier, and It's wonderful commands like mind relaxes with understand words of knowledge by information's.

angularjs Training in chennai
angularjs Training in chennai

angularjs-Training in tambaram

angularjs-Training in sholinganallur

angularjs-Training in velachery

Sankar lp said...

Great Article
Data Mining Projects IEEE for CSE
Final Year Project Domains for CSE

John Alert said...

Great Article
Data Mining Projects IEEE for CSE
Final Year Project Domains for CSE

Sadhana Rathore said...

Thanks for sharing this pretty post, it was good and helpful. Share more like this.
AngularJS Training in Chennai
Angular 6 Training in Chennai
ReactJS Training in Chennai
AWS Training in Chennai
DevOps Training in Chennai
RPA Training in Chennai
R Programming Training in Chennai
Data Science Course in Chennai

Urlaub in Belize said...

Das Green Valley Inn bietet den besten Urlaub in Belize und Guatemala. Planen und genie├čen Sie Ihren Urlaub in Belize mit Peter Wolf Reise-Team mit den besten Angeboten.

Urlaub in Belize

jefrin adams said...

Very impressive thanks for sharing


R programming training in chennai

Sivanandhana Girish said...

It’s a nice information being shared. The admin has given a full fledged importance for this blog.
Spoken English Classes in Velachery
Spoken English in Velachery
Spoken English Classes in Tambaram
Spoken English Class in Chrompet
Spoken English Classes in OMR Chennai
Spoken English Classes in Navalur
Spoken English Class in Ambattur
Spoken English Class in Avadi

janu said...


Thank you for taking time to provide us some of the useful and exclusive information with us.
r programming training in chennai | r training in chennai
r language training in chennai | r programming training institute in chennai
Best r training in chennai

lekha mathan said...

It was really an interesting blog, Thank you for providing unknown facts.
Aviation Academy in Chennai
Air hostess training in Chennai
Airport management courses in Chennai
Ground staff training in Chennai
Aviation Courses in Chennai
best air hostess training institute in chennai
Airport Management Training in Chennai
airport ground staff training courses in chennai

yuva rani said...

Great info. The content you wrote is very interesting to read. This will loved by all age groups.
Angularjs Training in Chennai
Angularjs Course in Chennai
CCNA Training in Chennai
Salesforce Training in Chennai
Angular5 Training in Chennai
Angular6 Training in Chennai
Angular7 Training in Chennai
Angularjs Training in Chennai
Angularjs Course in Chennai

jenifer irene said...

This is really a valuable post... The info shared is helpful and valuable. Thank you for sharing.
Aviation Academy in Chennai
Air hostess training in Chennai
Airport management courses in Chennai
Ground staff training in Chennai
z

ProPlus Logics said...

Hey Nice Blog!! Thanks For Sharing!!!Wonderful blog & good post.Its really helpful for me, waiting for a more new post. Keep Blogging!
SEO company in coimbatore
SEO company
web design company in coimbatore

Raji said...

This post is very interesting and we got some new knowledge from this. Thank you for sharing
R Training Institute in Chennai | R Programming Training in Chennai

Zinavo said...

Thanks for sharing with us, This article gives more useful information to me. Great post, keep updating.
Website Development Company in Bangalore | Web Development Company in Bangalore | Web Design Company in Bangalore

anusha said...



AngularJS Training in Chennai AngularJS Training in Chennai at BITA Academy. We are Best AngularJS Training Institute in Chennai. Our AngularJS training courses are taught by Experts.

Shadeep Shree said...

Thanks for your blog.... your blog is supreme... Waiting for your upcoming blogs...
Hacking Course in Coimbatore
ethical hacking course in coimbatore
ethical hacking course in bangalore
hacking classes in bangalore
PHP Course in Madurai
Spoken English Class in Madurai
Selenium Training in Coimbatore
SEO Training in Coimbatore
Web Designing Course in Madurai

VRITPROFESSIONALS said...

Nice post. Thanks for sharing! I want people to know just how good this information is in your article. It’s interesting content and Great work.
Thanks & Regards,
VRIT Professionals,
No.1 Leading Web Designing Training Institute In Chennai.

And also those who are looking for
Web Designing Training Institute in Chennai
SEO Training Institute in Chennai
Photoshop Training Institute in Chennai
PHP & Mysql Training Institute in Chennai
Android Training Institute in Chennai

spot said...

andaman tour packages
andaman holiday packages
andaman tourism package
family tour package in andaman
Andaman tourism package

Edward said...

Very well written topic. Keep update to explore analytics ideas and solutions.

Blockchain Development Services

Product Development Services

Google Analytics Consulting Services