Friday, February 16, 2007

Another Perspective on Data Mining and Terrorism

Recently, much has written specifically about data mining's likely usefulness as a defense against terrorism. This posting takes "data mining" to mean a sophisticated and rigorous statistical analysis, and excludes data gathering functions. Privacy issues aside, claims have recently been made regarding data mining's technical capabilities as a tool in combating terrorism.

Very specific technical assertions have been made by other experts in this field, to the effect that predictive modeling is unlikely to provide a useful identification of individuals imminently carrying out physical attacks. The general reasoning has been that, despite the magnitude of their tragic handiwork, there have been too few positive instances for accurate model construction. As far as this specific assertion goes, I concur.

Unfortunately, this notion has somehow been expanded in the press, and in the on-line writings of authors who are not expert in this field. The much broader claim has been made that "data mining cannot help in the fight against terrorism because it does not work". Such overly general statements are demonstrably false. For example, a known significant component of international terrorism is its financing, notably through its use of money laundering, tax evasion and simple fraud. These financial crimes have been under attack by data mining for over 10 years.

Further, terrorist organizations, like other human organizations, involve human infrastructure. Behind the man actually conducting the attack stands a network of support personnel: handlers, trainers, planners and the like. I submit that data mining might be useful in identifying these individuals, given their much larger number. Whether or not this would work in practice could only be known by actually trying.

Last, the issues surrounding data mining's ability to tackle the problem of terrorism have frequently been dressed up in technical language by reference to the concepts of "false positives" and "false negatives", which I believe to be a straw-man argument. Solutions to classification problems frequently involve the assessment of probabilities, rather than simple "terrorist" / "non-terrorist" outputs. The output of data mining in this case should not be used as a replacement of the judicial branch, but as a guide: Estimated probabilities can be used to prioritize, rather than condemn, individuals under scrutiny.


Dean Abbott said...

I think there are two particular issues that should be separated here.

First, if by "terrorism", we mean the big-attack variety, like 9/11 or the London subway attacks, data mining algorithms are not well-suited for providing insight for these problems because they are so rare. There still will be value in exploratory data analysis, but I believe it would be difficult to identify (with sufficiently high confidence) particular events taking actually taking place.

However, if we are looking at other aspects of the terrorist enterprise, I think there is much that can be done with data mining, as Will points out in the post. A big reason for this is the presense of far more cases that can be used as representative examples during modeling. In addition to discovering financial networks, I've seen several DoD SBIR solicitations for the prediction of where IEDs will be placed, such as this Navy SBIR for 2007, and I think there is great potential here.

All of these kinds of solutions though are most likely to be useful by providing ways to help prioritize what human analysts look at, rather than automating a complete solution. For the IEDs, for example, one can indicate where IEDs might be place so that greater care is taken or countermeasures are put into place. For financial networks, modeling could tell an analyst, "Person X has some suspicious looking transactions and links to other individuals already considered suspicious--look more closely at him/her." But I doubt they will become fully automatic any time soon.

Oracep said...

At Oracep Technologies,we mine and rate thinking in text relative to Bloom's taxonomy of cognition, using natural language processing. We call the process Coning.

For starters, 'Another Perspective on Data Mining and Terrorism'Cones at 79%- good tight thinking.

Coning Index
<49% high level background/minimal analysis
50-59% mid level background and low level analysis/judgement
60-69% mid level background/mid level analysis/low level judgement
70-79% low level background/high level analysis/mid level judgement
80-100% low level background/high level analysis/high level judgement

We process on two levels:paragraph and whole document. Graphs allow users to see the degree of thinking in each paragraph and XML format colour codes the text.

Perhaps more importantly, users can compare thinking across text. Here's how some eminent bloggers Coned on average over 4 substantial posts each.
Michelle Malkin 65.5%
Arianna Huffington 82%
Marcos Moulitsas 70.75%
John Battelle 74.25%

Now there's food for thought.