Pascal Michaillat

Critical Values robust to p-hacking


authors・Adam McCloskey, Pascal Michaillat
date・June 2022
repository・arXiv
identifier・arXiv:2005.04141v4
doi・https://doi.org/10.48550/arXiv.2005.04141

paper

abstract・P-hacking occurs when researchers engage in various behaviors that increase their chances of reporting statistically significant results. P-hacking is problematic because it reduces the informativeness of hypothesis tests—by making significant results much more common than they are supposed to be in the absence of true significance.

Despite its prevalence, p-hacking is not taken into account in hypothesis testing theory: the critical values used to determine significance assume no p-hacking. To address this problem, we build a model of p-hacking and use it to construct critical values such that, if these values are used to determine significance, and if researchers adjust their behavior to these new significance standards, then significant results occur with the desired frequency. Because such robust critical values allow for p-hacking, they are larger than classical critical values.

​As an illustration, we calibrate the model with evidence from the social and medical sciences. We find that the robust critical value for any test is the classical critical value for the same test with one fifth of the significance level—a form of Bonferroni correction. For instance, for a z-test with a significance level of 5%, the robust critical value is 2.31 instead of 1.65 if the test is one-sided and 2.57 instead of 1.96 if the test is two-sided.

figure 3・Critical value robust to p-hacking for z-test at 5% significance level.
ICCV  for two-sided hypothesis tests with 5% significance level, in various situations and calibrations.