Recently I have written about the fundamental divide between Frequentist and Bayesian statistics that lies at the heart of many interpretations of the p-value debate. Perhaps the biggest weapon of the Bayesian camp in the intellectual dispute is the surprising fact of how often you can actually be wrong even if you have a p-value smaller than 0.05. A quite extreme example is put forward in this article.

Imagine testing 1,000 different drugs, one at a time, to sort out which works and which doesn’t. You’d be lucky if 10 per cent of them were effective, so let’s proceed by assuming a prevalence or prior probability of 10 per cent. Say we observe a ‘just significant’ result, for example, a

p= 0.047 in a single test, and declare that this is evidence that we have made a discovery. That claim will be wrong, not in 5 per cent of cases, as is commonly believed, but in 76 per cent of cases. That is disastrously high.

(Note how the use of a prior probability unmasks the Bayesian approach.) Unfortunately, even in less radical examples the problem prevails.

If we repeat the drug calculations using a prevalence of 50 per cent rather than 10 per cent, we get a false positive rate of 26 per cent, still much bigger than 5 per cent.

Initially, it might be disturbing to see that your precious statistically significant effect, which was so hard to obtain, has so little value.

In his excellent comparison of different approaches towards statistical inference Barnett (1999) resolves the paradox.

“A distinction can be drawn between quite different notions: of initial precision and of final precision as we have already noted. This is a crucial distinction.

On the one hand, the procedure of using the sample mean (or some other measure) to estimate could be assessed in terms of how well we expect it to behave; that is, in the light of different possible sets of data that might be encountered. It will have some average characteristics that express the precision we initially expect, i.e. before we take our data.

[…] The alternative concept of final precision aims to express the precision of an inference in the specific situation we are studying. Thus, if we actually take our sample and find that , how are we to answer the question ‘how close is 29.8 to ’?” (ch. 5.7.1)

And further:

“Classical statistics assesses initial precision; Bayesian inference, final precision” (p. 209)

Deborah Mayo argues in her book “Error and the Growth of Experimental Knowledge” that assessing initial precision is all what is needed to do good science. It is a measure of how well our tools of scientific inquiry (tests) perform in the long-run, if applied repeatedly. Science is a repeated and collective endeavor in which we want to employ the best tools possible to minimize the source of error in the process. This is exactly what initial precision tells us.

At this point, however, I should note that the math only checks out if we try to replicate scientific studies as often as possible. If instead we take a study with for granted and never question its results again–and unfortunately, rewards for novelty in science encourage exactly that kind of behavior–we should not be surprised of being wrong fairly often.