Tuesday, December 27, 2005

Math Problem

Via Real Clear Politics, Matt Yglesias notes that a data-mining surveillance program with a ten percent error rate will misidentify a heck of a lot more innocent people as terrorists than it will identify actual terrorists as terrorists:

Suppose we had a group of 1,000 people we were interested in monitoring and 900 of them are terrorists. The program will correctly itentify 810 terrorists as terrorists. 10 terrorists will evade its clutches. Out of the 100 non-terrorists, 90 will be correctly identified as innocent, and 10 will be wrongly labeled as terrorists. That seems pretty useful.

But say we have a group of 1,000 suspects and only 100 of them are terrorists. A ten percent shot that a given person is a terrorists doesn't reach the "probable cause" standard, but seeing as how thousands of lives could easily be on the line, maybe we want to relax the burden of proof and run the 1,000 through the program. Well, we'll catch 90 terrorists out of the 100, which is good. But out of the 900 non-terrorists, 90 innocent people are going to get labeled terrorists. In other words, out of the 180 people the program will say are terrorists, we can expect half to actually be innocent. Thus, even though the algorithm only has a very small 10 percent error rate, the overall surveillance program makes a lot of mistakes.

Indeed, if we assume a population 0f 100,000,000 innocent people, 100 actual terrorists, and a 10% error rate, it is really bad. We'd identify 90,000,000 of the 100,000,000 innocent people as actually innocent and miss 10 of the actual terrorists by identifying them in error as innocent.

On the other side we'd identify 10,000,000 0f the innocent as terrorists in error and accurately identify 90 terrorists. If we rounded up these suspects, we'd have darned few terrorists in a pool of 10,000,090 suspects.

This is disturbing and if this is how the program works, it should be halted. And people fired for stupidity at the least even if the program is entirely legal. We'd create hostility by going after innocents and possibly radicalize some percentage of the innocents and therefore recruit more bad guys than we'd catch.

But if we don't arrest these people and instead keep their communications under data-mining attention, the next run will clear 9,000,000 of the innocent and identify 81 of the guilty. Then we have a suspected terrorist population of 1,000,081 where only 81 are real terrorists. This is still too many to identify terrorists by picking up the suspects for questioning.

If we keep running this dwindling population through a data-mining program four more times (and I have no idea what counts as one run through the system--assume there is some definition but that we don't need to know what it is for this exercise), we get down to a suspected terrorist population of 154 where 54 are actual terrorists. Now we're talking. Though of course, our data miners don't know how many terrorists are out there in the population or what the error rate is, so this quantified amount of terrorists and innocent people is purely illustrative.

Also, higher error rates would require more repeats to get a reasonable population for closer scrutiny, but I assume that unless there is outside intelligence to indicate somebody on the guilty list (or somebody on the innocent list for that matter) is planning to bomb something soon, it is possible to keep the data-mining going as long as you like to narrow the target population.

And we have to ask whether identifying 54 terrorists gets us all the cells or only some. If 5-man cells, did we roll up 10 of the 20 cells and miss 9 completely? Or did we get at least one terrorist from each cell meaning we might roll up all of them?

And if we are really focusing on just overseas calls, our starting population is much smaller than my assumption. I still don't see a problem with overseas call monitoring. If one end is outside the country it is not domestic spying in my book, and the law seems clearly on the side of the president. Even the NY Times article saying the net was wider than the president admits (gee, at one leak he doesn't just spill it all?) does not claim anything other than international calls were scrutinized.

I do have problems with domestic monitoring outside of the courts. Perhaps a short-term program after 9-11 is excusable (and I would say yes), but if it goes on years, that is wrong. Congress should have been brought in to make a process for protecting civil liberties with proper oversight. Even a well-run program run without oversight could degenerate into real domestic spying over time. But I still see nothing that indicates this was actually domestic spying no matter how many times NPR calls it domestic spying.

The problem as Yglesias describes it is surely a program killer. But I have to assume that more was done. This just doesn't add up if it is done the way Yglesias describes the math. If it did, I think we'd have read about the 10,000,000 Americans languishing in prison camps in the New York Times by now. Book release or no book release.