Lesson 2: The Made-up Probability
My prior discussion regards improperly multiplying probabilities when using statistics in court. But what about when someone simply makes up the probabilities? Surely, that wouldn’t happen in a court of law would it?
Before we get to court cases, let’s consider another investigation – that of the 1986 Challenger Shuttle disaster. An engineering failure caused the Challenger to explode shortly after takeoff, killing everyone on board. When physicist Richard Feynman (part of a commission to investigate the disaster) analyzed what the problem was, he found a consistent claim in NASA regarding the probability of mission failure: one in 100,000. Although one NASA employee said, “If a guy tells me the probability of failure is 1 in 100,000, I know he's full of crap” (see http://scilib-physics.narod.ru/Feynman/WDYC/en/What_Do_You_Care.html#coldfacts), this idea of a tiny failure rate nonetheless persisted, with the obvious disastrous consequences.
And the data did not even come close to supporting this tiny probability. For example, Feynman pointed out that the failure rate of a single component (solid rocket boosters) is about 1 in 25 based on past data. NASA officials response “these figures are for unmanned rockets but since the Shuttle is a manned vehicle ‘the probability of mission success is necessarily very close to 1.0.’" (see https://science.ksc.nasa.gov/shuttle/missions/51-l/docs/rogers-commission/Appendix-F.txt). But no evidence is provided showing that the manned missions have better solid rocket boosters than prior unmanned missions, and certainly not that there was a 4,000-fold improvement!
Shuttle missions continued after the Challenger disaster, and with the Columbia disaster in 2003, it is not clear the lessons of over-confidence that plagued NASA before the Challenger disaster were learned. The space shuttle program ended in 2011, with a failure rate of about 1 in 70 (two failures in 135 missions).
My lesson from this: very smart people (it’s NASA right?) make up probabilities based on their hopes and beliefs, ignoring all evidence against them. With that in mind, let’s turn the legal case I want to discuss, where the probability claims of another government entity are put to the test.
After the Madrid train bombings in 2004, fingerprints were found on a bag containing detonating devices. The FBI ran these fingerprints through their databases, and, based on a “100% identification” using those fingerprints, a lawyer in Oregon was arrested and held as a material witness to the bombings. [My discussion of this case is drawn from the Significance magazine article regarding matches found here https://rss.onlinelibrary.wiley.com/doi/10.1111/j.1740-9713.2019.01251.x and authored by Robin Mejia, Maria Cuellar, Dana Delger, and Bill Eddy.]
The idea of a 100% match seems a bit crazy -- nothing is ever 100% except death and taxes, right? But even if it’s not 100% you would think that behind that claim there would be some very high and proven probability that the fingerprint on the detonator bag and the matched print is from the same person. In fact there’s not.
Fingerprint evidence has been subject to very little statistical scrutiny. A working group at the AAAS (American Association for the Advancement of Science) concluded that fingerprints cannot be traced to a single source with 100% certainty (see https://www.aaas.org/news/fingerprint-source-identity-lacks-scientific-basis-legal-certainty ), and this study cites multiple other studies that say the claim of 100% certainty is simply “indefensible.” They also conclude that there is basically no data that could tell you how many people might have the same print.
How good are fingerprint matches? It appears that in the best case scenario, they are quite good, with false matches (called a “false positive”) in 1 of 1,000 instances (false non-matches are higher but not relevant here). That would mean that it would take about 1,000 legal cases before you had a false positive, IF you only tested one print from evidence and one potential matching print per case. However, that “IF” is critical. Suppose for each case where you had a fingerprint in evidence, you pulled 10,000 prints from the database for comparison. Then you’d have about 10 false positives and a nearly 100% chance of at least one false positives.
Statisticians have typically limited these false positive rates in studies to 5% or less, and this seems like a very loose standard for someone who is facing jail time. Using this loose standard of 5%, the rate of 1 in 1,000 is fine (its 0.1%). However, as discussed above, testing against a large database rather than against a single suspect, you are almost sure to get a false positive. Mathematically, if you are testing a database of N fingerprints, the false positive rate will be 1 – (.999)^N. This means that with 50 fingerprints in your database, even the seemingly low false positive rate of 1 in 1,000 on a single fingerprint becomes 5%, or 1 in 20. Biometrics researchers have recognized this issue and suggest that the false positive rate on a single test needs to be far lower than 1 in 1,000 if testing against a database (see, for example: http://biometrics.cse.msu.edu/Publications/Fingerprint/JainFpMatching_IEEEComp10.pdf ).
Now let’s turn back to the Madrid bombings and the FBI. It turns out that the FBI tested the fingerprint found on the detonator bag against a huge database of approximately 2 million fingerprints! Despite the claim of a 100% identification, they actually found 20 people with matching prints. All 20 were investigated, but Mayfield became a primary target. In a 300-plus page report examining the false identification, the DOJ concluded that “his Muslim religion...likely contributed to the examiners’ failure to sufficiently reconsider the identification after legitimate questions about it were raised” (p. 12 of this report: https://oig.justice.gov/special/s0601/final.pdf).
While of course one lesson here is that religious prejudices should not influence scientific evaluation, the statistical problem would be there without that. With 20 matches, there was going to be someone with characteristics that were suspicious – perhaps the hair color and build matched someone seen near the crime, or the kind of car they drove matched a car seen near the bombings. The statistical problem is in identification of suspects on nothing more than a fingerprint match. Unless the accuracy of these matches improves by several orders of magnitude, they are not appropriate for making determinations of who should be suspected. Instead they can only be statistically reasonable when a very small number of suspects have already been identified. Even in those cases, the false positive rate needs to be considered, but at least in those cases, it will tend to be smaller.