In this post we will try to provide some tips on how a researcher could answer the fateful question: should I adjust or not the significance level for multiple testing? The problem of multiplicity is a problem that concerns almost all scientific research and has not got yet a complete and definitive answer. Indeed, there are many scientific publications that ignore the multiplicity of hypotheses tested and the inflation of the first type error. To make you understand the problem better, we will provide you an example of a paper published in 2014, that estimated the percentage of these jobs at around 50% in multi-arm trials. About half: is it too much? Is it an acceptable percentage?
It is not easy to answer, because there is no consensus in the scientific community, nor even at the regulatory authorities. For example, the guidelines on this problem provided by EMA’s (European Medicines Agency) leave room for a thousand interpretations, also due to the fact that the possible study designs of a clinical trial are potentially infinite.
In short: we must arm ourselves with a bit of common sense, a spirit of clarity and truth, and possess some decision criteria.
The Multiple Testing Problem
Although it is not possible to address the problem of multiplicity in depth, before leaving we try to understand briefly what we encounter when we ignore the presence of multiple testing in our study and do not adjust the level of a statistical significance.
Let’s say you’re conducting an experiment using two anti-asthmatic drugs and would like to measure the difference between them estimating two outcomes: an efficacy, for example the FEV1, and the quality of life of patients (measured through a particular questionnaire).
You conduct your tests and declare the limit of statistical significance considered at 5%. Now, considering the 5% limit for both tests, your chance to make the first type error in at least one of two tests will increase from 5 to 9.75%.
The formula to get 9.75% is very simple:
P=(1-(1-α)k
Where:
P is the probability of at least one significant test;
α is the level of significance;
k is the number of tests conducted.
How can we protect ourselves from these possible “false positives”? There is a number of techniques that allow us to distribute that fateful 5% significance level by dividing it into different tests that are conducted within the study. The most used technique is the Bonferroni correction.
In this post we will not talk about variety of techniques to manage multiplicity. The objective of this article is to understand:
how to decide whether to adjust or not for the multiplicity?
As it said above, there is still much debate about it. What you will read below is my personal opinion, but it is largely supported by scientific literature.
Let’s start.
What to do?
The most solid strategy is to establish the nature of the study and of the test performed: exploratory or confirmatory?
Generally, when your study has an exploratory nature (such as studies in which there is no estimate of the sample size, or a formal power calculation), the strategy could be:
don’t correct for multiplicity;
declare clearly the exploratory nature of the study;
indicate in the methods chapter that it was not corrected for the multiplicity given the exploratory nature of the study; clarify the decision in the discussion chapter.
When the study and the tests have a confirmative nature then the goal is “to be cautious” and to consider the possibility of obtaining falsely significant results.
Below is a list of the most frequent scenarios.
When you have multiple endpoints where none of them is not considered the most important, then it is necessary to correct for multiplicity. A very trivial example: 3 cardiovascular outcomes (arterial pressure, heart rate, alterations of the electrocardiographic pattern); none of them has a clinical relevance higher than others and each of them then get tested by an ad hoc test separately from each other.
It is necessary to correct for multiplicity when all tests of the endpoints were statistically significant.
When your study has repeated measures over time and the test is performed at different timepoints (for example to see the effect of a treatment after two months, 6 months and 12 months), then even here the correction becomes necessary.
When you have a confirmatory multi-arm study (with more than two treatments) and with more pairwise comparisons (for example: treatment 1 vs 2, 2 vs 3, 1 vs 3), then even here it is necessary to correct. In this case some authors argue that we should consider whether the treatments are related or not, or whether they are “of the same nature”. For example: if the three arms of the study are two doses of drug and a placebo, then the correction is appropriate. When instead the arms are of a completely unrelated nature the multiplicity problem can become superfluous.
For example: if in a trial we want to evaluate an educational and a pharmacological approach against a control group in the treatment of alcoholism, then the “educational arm” vs. control and “pharmacological arm” vs. control comparisons can be considered as two different trials and it could be conceptually wrong to correct for multiplicity in this case.
Analyses by subgroup and all other post-hoc tests are almost always considered to be exploratory, so the correction becomes useless. I have written “almost always” because in some cases the subgroup analysis can have a confirmatory role in a hypothesis unrelated to the main hypothesis of the study.
For example: study with two arms, drug treatment vs. educational program for treating an addiction. The main analysis is the evaluation of the effectiveness of the family treatment. The analysis on a subgroup may aim, for example, at identifying the link between the degree of alcoholism and the fertility level of males.
In this case, imho, it is useless to correct for multiplicity (the analysis is secondary and uncorrelated with the main one) but at the same time we must also consider that very often these ancillary analyses have statistical power problems. And this should be carefully measured and discussed.
When you are using two data sets to perform the same analysis (for example the “per protocol” or “intention to treat” data set) in general it is not necessary to correct for multiplicity.
When you conduct interim analysis you need to correct for multiplicity. In this case, the strategy would be using a very high level of a significance for the interim analysis (for example 1%) in order to leave a remaining 4% for the final test of the study.
As per observational studies, again in my opinion, the situation does not change. It remains necessary to establish very clearly what the confirmatory and exploratory tests of the study are. It must be said, in this regard, that in observational studies this effort of clarity is always less common than in clinical trials and often studies are peppered with dozens of p-values that are often useless.
The judgment of Kenneth Rothman, one of the fathers of modern epidemiology, is interesting. According to Kenneth, the correction for multiplicity should always be avoided because it would make scientific research too conservative. It is better to obtain falsely significant results and further explore hypotheses in subsequent studies, instead of “castrating” too much evidence by excessively lowering alpha levels.