Sunday, January 13, 2008

Data Mining: Interesting Ethical Questions

Data mining permits useful extrapolation from sometimes obscure clues. Information which human experts have ignored as irrelevant has been eagerly snapped up by data mining software. This leads to interesting ethical questions.

Consider the risk of selling an individual automobile insurance for one year. Many factors are related to this risk. Some are obvious, such as incidence of previous accidents, traffic violations or average number of miles driven per year. Other risk factors may not be so obvious, but are nonetheless real. Suppose that it could be shown statistically that, when added to information already in use, late payment of utility bills incrementally improved prediction.

One might take the perspective that this is a business of prediction, not explanation, so- whatever the connection- this information should be added to the insurance risk model. This perspective reasons: if the connection is statistically significant, however strange it may seem, we should conclude that it is real and it should be exploited for business purposes.

Obviously, there is a countervailing perspective which has the customer asking, "What the... ? What do my utility bills have to do with my car insurance?" Even extremely laissez-faire governments may intervene in markets and forsake economic efficiency in favor of other priorities. In the United States, for example, certain types of discrimination in lending is illegal.

Another thing to consider (again, granting that the utility bill-automobile risk connection is real) is that, in prohibiting the use of utility bill payments in auto insurance risk prediction implies that less risky customers will be paying for riskier customers.



Sandro Saitta said...

I think the opinion of the customer is useless. I mean useless in the sense that the customer will be biased by the fact that he wants to pay less.

Knowing this, customers will agree to base the risk on some personal data which will certainly be different for each customer. Since the models are far too complex to be intuitive, the customer will never know which of his personal data will make his risk increase or, on the contrary, decrease.

On the other side, insurance companies want to earn more (or loose less) money, of course. That's why they use data mining. So, they are biased as well by the money factor, but the models they create are done (or at least should be done) automatically. If this is the case, and that some obscure parameter influence the risk factor, I think it is fine.

Anonymous said...

Will's post seem to be mixing two issues here, which he does acknowledge at least in part. But I think it is worth pointing out. The connection alluded to between utility bill payments and auto insurance suggests that there is *correlation* between these two variables. As statisticians constantly remind us data miners, correlation is not the same as *causation*, and just because two variables are correlated does not mean that one implies the other. In other words, utility bill payments may be correlated with auto insurance but not have any meaningful causal relationship with it. One should always be careful not to mix the two concepts. Once this is understood and causation is established, rather than mere correlation, (and by the way, the only reliable method to establish causation is through controlled trials), the second issue is indeed an ethical one as suggested by Will. And often (too often) these tend to be regulated by governments or other outside forces. It seems that some causal relationships do not sit well with many of us. Ian Ayres' book "Super Crunchers" offers a number of excellent examples of causal relationships (he uses controlled trials, the book is nicely written and most enjoyable) that defy intuition or traditional "wisdom", and do indeed raise some interesting ethical questions. I think Will may be thinking along these lines...

Anonymous said...

I don't think that insurance companies or any other business that would use data mining would or necessarily should care about the difference between correlation and causation in factors they don't have any control. (exceptions, of course, for anything medical or legal)
If they can determine that people with freckles have less car accidents, why shouldn't they offer people with freckles lower rates? If mullets correlate with speeding tickets, chop it off or pay more.

When money is the driver, causation is irrelevant, and money is usually the driver.

Will Dwinnell said...

Jamie makes a good point. The question of correlation versus causation will be of only philosophical interest to a data mining practitioner, assuming that the underlying behavior being modeled does not change (and this will often be a safe bet).

An illustration should make this subtlety clear. Suppose that insurance data indicates that people who play the board game Monopoly are better life insurance risk than people who do not. An insurance company might very well like to take advantage of such knowledge. Is their necessarily a causal arrow between these two items? No, of course not. Monopoly might not "make" someone live longer, and living longer may not "make" someone play Monpoly. Might there exist another characteristic which gives rise to both of these items (such as being a home-body who avoids death by automobile)? Yes, quite possibly. The insurance company does not care, as long as the relationship continues to hold.

The point is that while A may not cause B, both A and B may be cause by unknown factor C, with transition probabilities such that the presence of A can be used as a statistical predictor of B.

Importantly: this relationship might well collapse if the popular media report that "Monopoly makes you live longer". Suddenly, every uncritical boob begins playing Monopoly in the hope of living an extra few years. The relationship is wrecked (at least until the fad passes), and the utility of this indirect information is decreased.

Christophe Giraud-Carrier said...

I think we can reconcile the two points of view here. It seems to me that the statements "assuming that the underlying behavior being modeled does not change" and "as long as the relationship continues to hold" can be viewed (in some way, see below) as effectively equivalent to what statisticians (I use the term loosely, I am not a statistician myself) regard as "controlling for variables". By taking this kind of dynamic approach where the relationship (or behavior) is "continuously" monitored for validity and the action is taken only as long as that relationship holds, the user is, I agree, relieved from the problem of lurking variables.

Using Will's example. Statisticians would indeed argue that there may be a confounding variable that explains the finding, one that has nothing to do with playing monopoly. Will proposed one: "being a home-body". I'll continue the argument with that one. In this case, it may therefore be that there are more home-body monopoly players than not; and it is the "home-bodyness" (if such a word exists) that explains the lower risk for life insurance (and not the monopoly-playing). Now, a statistician would be right in this case, and if one had to come up with the "correct" answer and build a model that remains accurate for now AND the future, you would have to accept the statistician's approach and build your model using home-bodyness rather than monopoly playing. There is little arguing here. I think what Will (and probably Jaime) may be getting at is that there is a way to, in some sense, side-step this issue; namely: monitor the relationship. Indeed, if I keep on looking and checking that the correlation continues to hold, then I don't care about any confounding effect. If there are none, then the correlation also manifests a causation and I am safe; if there are some confounding effects, they will become manifest over time as the observed correlation is weakened. Hence, I can choose at that time to invalidate my model. But in the meantime, it served me right, was accurate, and I did not worry about controlling anything.

Going back to the example, as long as the correlation is strong, I am OK. If it turns out that it is home-bodyness that causes the lower risk, I may eventually see more and more non monopoly players with low risk who also turn out to be monopoly player. In this case the originally observed correlation will decrease telling me that I may wish to discontinue the use of my model.

Another way to look at all this is as follows: the statistician seeks the true cause, the one that remains valid through time (and which indeed may be viewed as only of philosophical interest, at least in the context of business; in medicine, one may have a different perspective as also pointed out by Jaime and Will) On the other hand, the (business) practitioner seeks mainly utility or applicability, which may become invalid over time. The drawback of course is that when the model is no longer valid, the practitioner has no idea what may be the cause and where to go next. But maybe, as suggested by Jaime and Will, he/she does not care. From a strictly business standpoint, he/she was able to quickly build a model with high utility (maybe over a shorter period of time) instead of having to expand a lot of resources (and maybe not even be able to) to build a "causation" model.

Will Dwinnell said...

Yes, Christophe, I believe I would agree with everything in your second comment. Thank you!

Anonymous said...

Looking at this on a moral ground is it fair to change rates based on statistical analysis of factors even if the factors are things a person can not change. Should your race, religion or sex(well that one is accepted) impact your rating. If blondes were more prone to accidents can they dye their hair to lower their car insurance? There has to be regulation on factors out of your control or pertaining to protected rights like religion.

James said...

This question is particularly relevant these days. With increased automation in predictive analytics (which insurance companies use to predict/mitigate risk) they will be able to find correlations a lot quicker, though I am not sure with more accuracy. That is my concern; how accurate are predictive analytics solutions, and does the increased speed make it harder to keep a moral check on companies?