Applied Data Science and Machine Learning: Random things...

Thursday, November 08, 2007

Random things...

I was just looking at my favorite economics blog, The Skeptical Optimist, and saw a post on randomness based on two books the blog author, Steve Conover is reading called The Black Swan and Fooled by Randomness. This caught my eye--a quote from one of the two books (it was unclear to me which one):

Here's an example of his point about randomness: How many times have you heard about mutual fund X's "superlative performance over the last five years"? Our typical reaction to that message is that mutual fund X must have better managers than other funds. Reason: Our minds are built to assign cause-and-effect whenever possible, in spite of the strong possibility that random chance played a big role in the outcome.

He then gives an example of two stock pickers, one of whom gets it "right" about 1/2 the time, and a second who gets it right 12 consecutive times. The punch line is this:

Taleb's point: Randomness plays a much larger role in social outcomes than we are willing to admit—to ourselves, or in our textbooks. Our minds, uncomfortable with randomness, are programmed to employ hindsight bias to provide retroactive explanations for just about everything. Nonetheless, randomness is frequently the only "reason" for many events.

I personally don't agree philosophically with the role of randomness (I would prefer to say that many outcomes are unexplained then say randomness is the "reason" or "cause"--randomness does nothing itself, it is our way of saying "I don't know why" or "it is too hard to figure out why").

But that said, this is an extremley important principal for data miners. We have all seen predictive models that apparently do well on one data set, and then does poorly on another. Usually this is attributed to overfit, but it doesn't have to be solely an overfit problem. David Jensen of UMass described in one paper the phenomenon of oversearching for models in the paper Multiple Compisons in Induction Algorithms, where you could happen upon a model that works well, but is just a happenstance find.

The solution? One great help in overcoming these problems is through sampling--the train/test/validate subset method, or by resampling methods (like bootstrapping). But having the mindset of skepticism about models helps tremendously in digging to ensure the models truly are predictive and not just a random matching of the patterns of interest.

9 comments:

Anonymous6:52 AM
interesting observation ... i specifically agree to your last few statements :)
- A
ReplyDelete
Replies
Anonymous12:24 AM
I believe you can say that many outcomes are random iff you also say how you define randomness.

Rewrite your data as a finite binary string and relate randomness with compressibility as follows: a string is (Kolmogorov) random iff it is not compressible. Then it is easy to show that most strings of a fixed length have to be random (since only a few strings have enough structure in order to be compressed). More information can be found in section 2 of my technical report and references therein.
ReplyDelete
Replies
Will Dwinnell12:39 AM
I personally don't agree philosophically with the role of randomness (I would prefer to say that many outcomes are unexplained then say randomness is the "reason" or "cause"--randomness does nothing itself, it is our way of saying "I don't know why" or "it is too hard to figure out why").

There is an interesting comment about exactly this sort of thing in
"Mathematical Methods for Artificial Intelligence and Autonomous Systems", by Dougherty and Giardina:

"Given the same input, an intelligent being might appear to act in varying ways. Of course, we might argue that the observer is just lacking sufficient information to pin down the stimulus-response relationship. Be that as it may, scientifically it is the observational model which is paramount, not speculation regarding this or that occult cause."

I suppose one's perspective defines this issue. A seeker of "ultimate knowledge" may well reject the idea of randomness in the description of behaviors, whereas an analyst seeking to describe, simulate or predict a given behavior, known only from historical examples, might be satisfied with a probabilistic ("partially random") description.
ReplyDelete
Replies
Dean Abbott6:32 AM
Stijn: thanks for your post. I took a quick look at Chapter 2 of your Technical Report (which can be found here and found it very interesting.

One statement struck me, which is related to you comment, is "a string is (Kolmogorov) random iff it is not compressible". In the technical report, it says that

"we need to show for the second se-
quence that its description has length not significantly different from the length of the sequence itself" (emphasis mine).

Significance is quite interesting here because, as Will implies in his comment, if our objective is a model that predicts well, then given enough exmaples, additional complexity will be welcomed if the predictive accuracy is better, even if the model isn't readily understood (wording here compatible with Will's post), or if the compression of the information is not "significant" (wording of your post)--at least it seems to me that this is the case.

But that said, has the Kolmogorov test been used (say in commercial software or in some current papers) as a test of model accuracy? That would be quite interesting to say to what degree a model's predictions are not compressible (i.e., appear random).
ReplyDelete
Replies
Sandro Saitta7:51 AM
We have all seen predictive models that apparently do well on one data set, and then does poorly on another. Usually this is attributed to overfit, but it doesn't have to be solely an overfit problem.

I definitely agree on this, and I think it is also worth mentioning that the problem is also present in feature selection for example. When selecting a subset of features it may give very good classification accuracy by chance. Even using boosting, we have no guarantee that the selected features are in general the best (since the solution space is generally enormous).

By the way, your discussion for this post is very interesting. Thanks!
ReplyDelete
Replies
Unknown12:49 PM
A little Devil's Advocate:

In relation to testing with validation sets, Taleb would claim that no matter what sort of validation set you use, you will not have enough data to incorporate all unpredictable events.

For example, most will not use enough data to incorporate Black Monday of 1987. Even for those who do, lots of modelers exclude that as an outlier. Or if you were making models in 1986 and you were thorough enough to test with every single tick of historical data, you wouldn't have considered 1987 a possibility.

That's ok and you can still be successful as a trader. But big moves can wipe out years of steady gains if not positioned properly to limit risk.

Mandelbrot too claims that there just isn't enough historical data to test back over any possibility. That's why he made fractal productions of artificial time series.

As for randomness being the cause of trading success, it's not a matter of "I can't explain it, so it's random". But if markets were efficient and can't be beat. And all traders have a 50/50 chance of beating the market. With 1,000 traders, you expect 62 star traders who beat the market for 5 years straight...explained only by pure randomness.
ReplyDelete
Replies
Anonymous5:42 PM
Hi,

A little more. For some "back monday" was an outlier on the positive side of profit. Part of the (subjective) side of the traders ability is to think of and test for what hasn't happened yet. It should be obvious that history will not contain an explicit representation of everything possible.

It is not only the results that are important but also how one acheived those results.

Jay
ReplyDelete
Replies
Anonymous5:51 PM
Further...

One should also not assume that all history is from the same history.

The problem is far more complex then many imagine but this need not mean the model(s) and result solution is complex.

Further everything is relative to what's "next" to it. A person may acheive "decent" results one way and someone else may alsoways achieve slightly (or far) better results in a far more complex fashion.

Jay
ReplyDelete
Replies
Anonymous10:50 AM
The randomness of markets is clearly the product of many different non-random processes that we do not understand but that we could follow back in theory.

Also, all random number generators are producing pseudo-random numbers, i.e. based on a deterministic algorithm

However, there is also fundamental randomness, namely in quantum mechanics. When you do a measurement, you can only measure the position plus some random noise, which it seems is irreducible.
ReplyDelete
Replies

Add comment