Statistical Significance and Experimental Replication – notes by Justin C. Fisher, SMU Philosophy

Many scientific fields, including medicine and social psychology, have recently entered into what some have called a “replication crisis” where many key studies that had been taken for granted by many researchers turn out not to be replicable when scientists have carefully tried to run the experiments again. This has drawn attention to many questionable research practices, and to the great care that must be taken to ensure that the scientific method will work as intended. Examining this crisis also provides an opportunity to better understand what it means when scientists claim to have found “statistically significant” results, and to understand how persuaded we should be by such findings.

Statistical significance

Many scientific studies compare some interesting “alternative hypothesis” against a much more boring “null hypothesis”. A “null hypothesis” typically holds that there is no relation between the things that are being studied, and that any apparent relation would just be due to chance. When we had you pick the horoscope that best fit your day, the null hypothesis would say that everyone had a 1-in-12 chance of choosing the horoscope published with their star sign, whereas the alternative hypothesis we were testing held that people would be more likely to choose the horoscope for their star sign than for others.

The p-value for an experimental outcome reflects how probable it would be for an outcome at least that extreme to occur if the “null hypothesis” were true. If an experimental value has p=.05, this means that, if the null hypothesis were true, then you would expect there to be just a 5% chance of getting an outcome at least this extreme – i.e., if you ran the experiment 20 times, you’d expect about 1 out of the 20 times to yield an outcome this extreme, and the rest to be less extreme. (In the typical case where the null hypothesis says that two measures will be independent of each other, a “more extreme” outcome involves a stronger observed correlation between the two measures.)

An outcome is said to be “statistically significant with threshold p=.05” if that outcome had p-value .05 or lower. Different sciences conventionally use different p-values as their threshold for statistical significance. Many “softer” sciences, like psychology, biology, and medicine, use .05 as their threshold. Even if scientists are studying random noise, 1 out of every 20 studies should meet this threshold by sheer coincidence. “Harder” sciences like chemistry typically use 0.01 as their threshold – only 1 out of 100 studies of random noise would coincidentally meet this threshold. Some use lower thresholds still. E.g., the “sigma-5” standard employed in the search for the Higgs Boson was equivalent to p=0.0000003 – only about 1 in every 3 million studies of random noise would coincidentally meet this standard. Of course this search took thousands of scientists and many millions of dollars, so it’s understandable that we can’t hold most experiments to this high of a standard!

What’s so “significant” about .05? Nothing really. Back in 1925, R.A.Fisher proposed this as a rule of thumb for detecting which correlations might be worth further investigation. Fisher didn’t intend this as a strict cutoff, nor did he think it should be used in lieu of more nuanced statistical measures. However, Fisher’s rule of thumb was picked up by people who wrote early handbooks trying to popularize the use of statistical methods among scientists. The convention of using .05 as the threshold for “statistical significance” is now deeply engrained in many fields, and it is now very difficult for scientists to publish results that don’t meet this threshold. But really there’s nothing special about .05. It’s like setting the drinking age at 21 -- it’s not like 21-year-olds are a whole lot better at handling alcohol than 20-year-olds, but it seemed like a good idea to draw a line somewhere, and that’s the line that stuck.

What statistical significance is not.

Statistical significance is not a measure of effect size, of how much different from the null hypothesis the world actually is. Imagine that a study finds that a particular tonic increases bald men’s hair count by a statistically significant amount with p = .05. This tells us that the increase in hair count for the men who used the tonic was higher on average than for those who don’t, and that this difference was big enough that only 5% of studies of random noise would be expected to yield a difference as big as the one they observed. But this by itself doesn’t tell us how big that difference was. If a lot of men were included in the study, then even a very small increase like having +2 more hairs would count as “significantly” different from what you would expect from random chance. In many cases, we’re interested not just in whether something is better than chance, but instead in how much better than chance it is, and p-values don’t tell us that.

Furthermore, p-values can even lead us to invest our resources unwisely. Imagine that a second study reveals that using a different tonic greatly increases men’s hair count on average, but this study is small enough that its p-value is only .06, so that you’d expect 6% of studies of random noise to yield that big of an effect, not the 5% or less that it would take to count as “significantly significant”. Do you think that bald men would rather that researchers invest more time on the first tonic we described, which had demonstrated a “statistically significant” increase of +2 hairs, or would they rather that scientists instead invest in studying this second tonic which showed a fairly good, but not quite “statistically significant”, chance of yielding a much greater improvement in hair count? For many purposes, effect size is more important than whether or not a study passes the arbitrarily set .05 threshold.

Statistical significance is not a measure of the statistical power of an experiment. The statistical power of an experiment to detect an effect of a given size is the probability that that sort of experiment would correctly reject the null hypothesis in the case where the actual effect size is as given. In general, any given experiment has more power to detect large effects than it does small effects, and in general, the larger the sample size in an experiment, the greater its statistical power to detect a given effect. One red flag that a scientific paper can raise is reporting a series of under-powered experiments, experiments whose sample size is so small that they would have been unlikely to have turned up statistically significant results if the actual effect size were as observed. Experiments with smaller sample sizes are cheaper to run and more prone to statistical flukes, so when you see under-powered experiments, you should worry that you may just be seeing statistical flukes rather than a true picture of what the world is actually like.

Statistical significance is also not a measure of strength of our evidence for a claim, nor of the probability that that claim is true. A common mistake is to think that, if some effect is “statistically significant” with p = .05, that this means that we can be 95% confident that the claim is true, and that there’s only a 5% chance that the claim is false. But that’s not at all what statistical significance means!

Consider the example described in this XKCD cartoon, where scientists study 20 different colors of jelly beans, and find that just one color, green, is “statistically significantly” correlated with acne, with threshold p = .05. If you’re like me, you would have gone into this thinking that it’s quite unlikely that any color causes acne. When you know that 20 possible correlations have been tested for, this prior probability tells you to expect that about 1 of the 20 will turn out “statistically significant” and the rest won’t. Hence, the outcome of these experiments is pretty much exactly what you should have expected, and that means that observing this outcome shouldn’t change your probabilities much at all from what they were prior to the experiment. So your probability that green jelly beans cause acne should still be very low even after these experiments – it clearly shouldn’t be 95%! This shows that the naïve view that takes “statistically significant” to mean 95% confidence is just wrong!

In contrast, if the experiments showed that 5 out of the 20 different colors of jelly beans were each “statistically significantly” related to acne, then this is quite a bit more than you would have antecedently expected, so this really should start shifting your probabilities. Or if a replication just of the green jelly bean study again shows the link, this too would be a surprising result, and should increase your confidence that the green jelly beans really do cause acne.

Take “Statistical Significance” with a grain of salt.

Hopefully you’re already seeing that you should take claims of “statistical significance” with a grain of salt. In particular, you should usually ask, “how many other correlations did scientists look for too”? If you think that scientists looked for 19 other correlations, and that you’re just seeing the one that turned out to be “statistically significant” then you shouldn’t be moved by the study at all. The 19 other correlations could be other studies by the same scientist, looking for similar correlations, or they could be other studies by other scientists looking at similar issues. We know that every big collection of studies will have a top 5%, and if you think you’re just seeing that top 5% then what you’re seeing basically tells you nothing that you shouldn’t have expected already, and the actual details won’t matter any more than it matters that it was green jelly beans that coincidentally were correlated with acne.

There are two cases where these sorts of concerns are especially likely to arise. One is the case of data-mining, where scientists start with a huge data-set and look for correlations. Some coincidental correlations are bound to be present in virtually any large data-set – indeed for pairs of variables that are truly unrelated, 5% of those pairs will show up as being correlated with p-value .05 or lower – so the fact that these “statistically significant” correlations are present is not surprising at all. At best, such data-mined correlations are suggestions for potential further research, but data-mined correlations provide little to no evidence of genuine relations.

Second is the case where you know that interested parties are funding a lot of research in a particular area. There’s strong evidence that scientists are biased in lots of subtle ways towards finding the results that their financial backers want them to find, e.g., being more likely to triple-check or to opt not to publish any results that will make their backers unhappy. But even if individual scientists were perfectly responsible, and not swayed at all by where their funding is coming from, still the fact that lots of searching is done in a particular area means that enough studies will be done for some to turn up as “statistically significant” purely by chance, so when you see those studies reported, they shouldn’t sway your beliefs much at all.

Replication helps a lot.

Many of these concerns can be assuaged by replicating an experiment – by doing it again and seeing if we get the same result. The first iteration of an experiment often doesn’t mean much: given how many experiments are being run, we should expect lots of first iterations – at least 1 in 20 – to turn out “statistically significant”, even if all the scientists are looking at is random noise. However, if all the scientists are looking at is random noise, then we would not expect the first attempt to replicate an experiment also to turn out statistically significant – or rather, we should expect this to happen just 1 out of 20 times.

The same old concerns do arise somewhat, even in the case of replications. If a whole lot of attempted replications are being done in an area, then we shouldn’t be surprised to see that some are “statistically significant”, so especially in areas where you know lots of funding is being thrown at researchers, or in cases where you know researchers had quite easy access to additional datasets that could be used to datamine “replications” of correlations initially seen in other datasets, then you should wait for multiple replications before you start being really persuaded.

Many scientific papers present a series of related experiments, each of which yielding “statistically significant” results that are presented as being congruent with the others. In some cases, the later experiments include a replication of the results of earlier ones, while also recording and controlling for additional factors. These sorts of replications are quite evidentially valuable, except for cases where you suspect that the scientist had actually attempted quite a few such extensions, and is reporting only the few that panned out.

In many cases though, the later experiments in a series instead are just kind of similar to earlier experiments, but not a direct replication. The evidential value of this sort of series is typically weaker, especially in cases where you think the researcher looked at lots of different variations on a theme, and is reporting only the few that panned out. For example, a famous paper by Cornell psychologist Daryl Bem presented a series of experiments each showing a statistically significant correlation between people’s behavior at one time and later randomly determined events, which Bem and took to be evidence of a sort of pre-cognition. One red flag (among many) that Bem’s paper raises is that his later experiments did not replicate his earlier ones, and that there is no sense of how many other such potential correlations he might have examined, but didn’t pan out. Indeed, when other researchers attempted to replicate Bem’s experiments, the results did not match his.

How many reported scientific findings are false?

Unfortunately, systematic attempts to replicate scientific results have often turned out disappointing. For example, in “The Reproducibility Project” a large group of scientists attempted to replicate 100 well-known psychological findings. Only 36% of the replications met the standard for “statistical significance”, and their effect size was half of what had been originally reported. Similar findings have been found in medicine. For example, a study by Glenn Begley found that only 6 out of 53 landmark studies in cancer research held up under replication. Better hope you don’t get cancer! As a result of failed replication studies like these, a number of scientific fields including social psychology now face a “replication crisis”, trying to determine which of their findings really were good ones, and trying to keep people from completely losing confidence in the field.

Why would it be that such a large percentage of published scientific results would fail to hold up under replication? Part of this may be due to nefarious behavior by scientists, who certainly have strong career incentives to churn out publishable research, and aren’t all that closely monitored. But it turns out that, even without any nefariousness, we actually should expect a pretty large percentage of scientific studies to be false, for much the same reasons that we should expect tests for rare diseases to produce a large percentage of false positives. Even when science is working perfectly well, you should expect a large percentage of initial findings just to be false positives that will be corrected later.

To see why this is, let’s talk through an example. The following video presents a similar example. Depending on what sort of learner you are, it may help to watch the video first. https://youtu.be/42QuXLucH3Q

Suppose 1000 scientific studies are being done. Scientists’ careers will be best served by exciting ground-breaking results, not just boring results that we already knew, so they tend to explore hypothesized connections that are unlikely to be true, say only a 10% chance of being true. So out of our 1000 studies, 100 will be studying factors that are genuinely connected, and the other 900 will just be looking at random noise. Of the 100 that are studying genuine connections, some studies will be too small to detect the correlation as statistically significant, and others may get unlucky, though on the flip side, scientists may use “p-hacking” tricks to help make these results appear significant, even in cases where this didn’t quite happen on its own. Let’s optimistically suppose that 90 out of these 100 studies of genuine correlations will end up (by means fair or foul) detecting that correlation, so there’ll be 90 publishable “true positive” results. What about the other 900 studies, the ones that were looking for connections that weren’t actually there? By definition, 5% of those cases should register as statistically significant with threshold p = .05. So that’s 5% of 900, or 45 cases where scientists will get a “false positive”. So out of our 1000 studies, we should expect 135 publishable results: 90 true positives and another 45 false positives. Out of these 135 publishable results, two thirds (90/135) would be true positives reporting on genuine connections, and the other third (45/135) would be “false positives” reporting a correlation that appeared to be statistically significant, but was actually just random noise.

Note: This is much like the reasoning involved in determining the probability that someone has an illness, given information about the base rate of that illness in the population, and given a positive result from a test that has some chance of delivering false positives when administered to individuals who don’t have the illness. Here, studying a genuine connection is the analog of having the disease, and getting a statistically significant finding is the analog of testing positive for the disease. Many people, including trained doctors, commit “base rate fallacies” in reasoning about cases like these, usually giving answers that stray much too far away from the base rate and that are much too confident that the positive result can be trusted. In general, positive test results for rare diseases are quite likely to be false positives, and we just saw that something similar is true for positive test results in science more generally. To avoid such base rate fallacies, it often helps to formulate these as frequencies in a population of 1000, as we did above.

Of course, it may be that a number of additional studies that weren’t quite “statistically significant” on their own could be fiddled with to make them appear statistically significant. In our example, 855 of the 900 studies of random noise would end up yielding an initial negative result. However, these 855 scientists will understandably be reluctant to give up after having spent this much effort gathering data, so many of them will be tempted to find rationalizations to exclude some data, to gather enough more data to cross the statistical significance threshold, or to use other “p-hacking” tricks to manage to turn the study into a publishable “statistically-significant” result after all. If this happens in just 5% of studies, that’d be enough to make the number of false positives equal the number of true positives, so then only half of the published studies would be true positives, a quarter would be false positives that result from the fact that 5% of studies of noise will be false positives, and a quarter would be false positives arrived at by p-hacking. So, if you’re looking for an estimate of how many scientific findings we should expect to be true, 50% is a quite plausible ballpark estimate. (Of course the details will depend on assumptions about the base rate, about false-negative rates, and about rates of p-hacking.)

This 50% number might seem demoralizing. After all that work, it’s still just a coin-flip whether they got it right! What was the point? However, it’s worth emphasizing that, before the study, we should have assigned probability 10% to each hypothesis’ being true – that was the base rate we assumed of true hypotheses among the 1000 hypotheses being studied. 50% is a lot more than 10%, so the study did actually help us to change our probabilities quite a lot. Even if a single study shouldn’t make us totally confident that its conclusion was true, it should make us consider that conclusion quite a bit more seriously than we did before: up to 50% from the 10% chance it had of being true before. Further replications can then help us learn whether we should become more confident still.

Remedies.

Science is self-correcting. Eventually we’ll discover when old results don’t pan out. But that takes time, especially when “publication biases” that favor the printing of positive results hinder negative findings from becoming well known.

One important remedy is to provide venues for scientists to report failed attempts to replicate others’ experiments. This is improving, thanks in part to increasing awareness of this problem, and thanks in part to our now having internet technology that makes publication of such results quite cheap to do.

Another important remedy is to insist that scientists register in advance what they’re testing for and what methods they’ll use. This can prevent many forms of data-mining and p-hacking, making it more likely that published results will reflect genuine correlations, and won’t just be hyped-up statistical noise. Professional norms on this are slowly shifting in a good direction, in part because some journals now demand that the studies they publish have been pre-registered.

Another important remedy is for the media to be more responsible in its science reporting, and for consumers of media to better understand the real (lack of) significance of many results that get widely reported. The media has a strong incentive to publish things that are exciting and new. This means they’ll give special emphasis to initial studies, but are likely not to publicize successful or failed replications. This is unfortunate because, just as with the green jelly bean case above, we know that a bunch of scientists are doing studies, so we know that at least 5% of them will be “statisitically significant”, so it shouldn’t be surprising at all that there will be a steady stream of new and exciting results for the media to write about, even if the scientists were all looking at pure random noise. What’s much more informative is replications, but sadly the media often doesn’t bother reporting those, because those are just “old news”. If the public becomes more aware of these issues, we may be able to demand better science reporting. In the meantime, it’s probably best to take initial studies you hear about with a big grain of salt, and say “I’ll believe it when I see it replicated!”