The Problem with Statistical Significance

Throughout my career I have advocated rigorous evaluations to inform judgement about the benefits/risks of new medicines and I have been gratified by the progression in the science of clinical evaluations. The most effective science begins with a clearly defined hypothesis that is fully justified by prior data, then a well-designed experiment with all the necessary controls. Once an experiment is complete, a scientist’s job is to thoroughly and dispassionately evaluate all the data to extract as much new knowledge as possible. That process begins with asking whether the data support the hypothesis; but that is just the beginning of the analysis and at least as much can be learned from results that fail to support the hypothesis as those that confirm the hypothesis, and additional analyses of the data are often critical. Then, a scientist must report his conclusions on what the experiment teaches about reality. And a good scientist extrapolates results to broader perspectives very carefully.  He or she also reports all the data and the caveats related to the conclusions drawn.

Statistical analyses represent a very valuable tool in the assessment of the validity of an experiment’s results and the conclusions drawn. But they are just a tool, not a substitute for good judgement. Often questions boil down to whether there are differences between the control arm of an experiment versus the active arm. A variety of statistical analytic approaches can be used to address the question of whether apparent differences are likely to be valid. These usually result in a probability statement or p value. A p value of < 0.05 means very simply that if the experiment was repeated 100 times, in 95 of those replicates, the data would result in the same conclusions. That is a powerful statement, but it can only statistically confirm or refute that a conclusion may represent reality. Clinical trials are the largest, most complex biological experiments performed, but they are experiments.

Because they are so large and potentially important to the health and well-being of humans, it is even more vital to learn everything possible from each of these complex experiments. Given the variability in human populations and the difficulty in designing appropriate controls, the design, conduct and analyses of the results of such experiments rely extensively on statistical approaches to sort out conclusions that are truly supported by the data from those that may seem true. Of particular importance and complexity is the determination of whether a new medicine brings enough benefit compared to risks to be commercialized. Typically, the key question is: does a specific dose of a medicine delivered on a specific schedule to a group of patients with similar phenotypes have an acceptable benefit/risk profile? Statistical analyses play vital roles in supporting these enormously complex and important judgements.

In my view, the problem in the role of statistical analyses of clinical experiments has been driven by regulatory agencies that, on many occasions, substitute rigorous commitments to specific p values rather than exercising effective judgements about the value of the therapeutic agent in the patient population studied. This attitude has been exacerbated by a variety of “statistical penalties” imposed to protect against the effects of multiple analyses. Secondly, regulators demand that analyses be prioritized as primary, secondary and exploratory. Because statistical penalties are imposed when multiple endpoints are analyzed, clinical experiment lists often must limit the number of primary endpoints. For example, if it makes sense to have two primary endpoints, it is often the case that a statistical penalty is imposed that means p <0.05 is insufficient proof. Obviously, every time one does an analysis, the demand for statistical probabilities goes up exponentially, narrowing choices on primary endpoints. Moreover, regulators often deny approval despite multiple secondary endpoints being positive if the primary endpoint did not meet its predefined statistical significance. Furthermore, post-hoc analyses are generally dismissed. While each of these statistical requirements is in itself reasonable, the overall impact is that effective therapeutic agents may be rejected and arguably, even worse, deep learning from clinical trials is inhibited. These standards, to a large extent, have been adopted by editorial boards of clinical scientific journals. 

Thus, the recent Nature editorial “It’s time to talk about ditching statistical significance” strikes a note that resonates with me. The editors argue not that statistical analyses are inappropriate. Nor do they argue that p values shouldn’t be reported. What they argue, in my opinion, is that rigorous requirements for p values related to primary endpoints cannot substitute for effective scientific judgement. Nor should they provide a justification to reject agents that may be effective. Rather, each clinical experiment should be well designed, thoroughly evaluated and cautiously determined using all the tools available, including statistical analyses. I support their views.