10 Statistical Power Analysis

The goal of experimental design is not to attain statistical significance with some (high) level of probability. The point of designing an experiment and conducting power analysis is to have a sense, both before and after data have been collected, of what can reasonably be learned from the statistical analysis.

Statistical Power

Statistical power is the probability, before a study is performed, that a particular comparison will achieve “statistical significance” at some predetermined level.

More precisely, statistical power is the probability that the study will not return a false negative result, known in statistics jargon as a “Type II error”.

So, suppose we design an experiment to have 80% (0.8) power…

10.1 Describing the predicted power of an experiment

“The experiment has 80% (0.8) power.”

Stating the estimated power of an experiment just as a “percentage power” (i.e. the probability of a false negative result) is not enough, by itself. We need also to declare two important values that could reasonably differ, for an experiment with the same statistical power:

the “statistical significance” threshold (i.e. P-value) that has been decided on for the experiment
the effect size we are aiming to detect

If we had decided that \(P=0.05\) was a suitable threshold for statistical significance, and that a change in blood glucose concentration of 2mM/L in response to administration of some drug was meaningful, we could say instead:

“The experiment has 80% (0.8) power at \(\alpha = 0.05\)¹, for an effect size of 2mM/L.”

Note

The choice of what makes a result significant, and what degree of statistical power is acceptable, is under researcher control, but might also be required by a potential funder. Funding agencies might insist, for example, that a study has at least an 80% chance of delivering a statistically significant result.

Typical values you might see in the literature include “80% power at 5% significance,” but there is no gold standard and choices should be made to suit the situation appropriately.

10.2 Interpreting the predicted power of an experiment

Suppose that we’re conducting an experiment, and administering a drug we hope will reduce blood glucose concentrations. We have designed an experiment such that the predicted power can be written as:

“The experiment has 80% (0.8) power at \(\alpha = 0.05\), for an effect size of 2mM/L.”

At this point, we don’t know whether the drug really works or not. We haven’t done the experiment. So let’s consider our options:

The drug doesn’t work ²

Assume that the drug has no effect. Our statistical test has \(\alpha = 0.05\)³, which implies that we expect to reject the null hypothesis (e.g. that the drug has “no effect”) incorrectly 5% of the time.

So, if we ran the experiment 100 times using a drug that really had no effect, we would expect the result to appear as though the drug was effective five times.

The drug works

Our experiment has predicted power of 0.8 (\(\beta = 1 - \textrm{power} = 1 - 0.8 = 0.2\)⁴), which implies that we expect to accept the null hypothesis (e.g. that the drug has “no effect”) incorrectly 20% of the time.

So, if we ran the experiment 100 times using a drug that really does work, we would expect the result to appear as though the drug was effective 80 times⁵.

10.3 How likely is it that the drug works?

Our experiment does not actually tell us whether the drug works.

Our experiment provides a result (“the drug does/does not appear to work”) and our choices of statistical significance and the power of the experiment tells us how confident we should be in believing that outcome.

We have two possibilities (models/hypotheses):

The drug works
The drug does not work

And there are two experimental outcomes:

The drug appears to work (positive outcome; null hypothesis rejected)
The drug does not appear to work (negative outcome; null hypothesis not rejected)

Knowing these possibilities, we can calculate a Bayes factor to help us interpret how strongly the evidence supports an assertion that the drug is effective in reducing blood glucose level.

The experiment gave a positive result

For a positive experimental outcome (i.e. the result suggests the drug is effective), we can calculate:

\(Pr(\textrm{positive outcome}|\textrm{drug works})\) - the probability that we see a positive result, if the drug works; this is our power: 0.8
\(Pr(\textrm{positive outcome}|\textrm{drug does not work})\) - the probability that we see a positive result, if the drug does not work; this is our false positive rate: \(\alpha = 0.05\)

Calculating the Bayes factor, \(K\):

\(K = \frac{Pr(\textrm{positive outcome}|\textrm{drug works})}{Pr(\textrm{positive outcome}|\textrm{drug does not work})} = 0.8 / 0.05 = 16\)

The experiment gave a negative result

For a negative experimental outcome (i.e. the result suggests the drug is not effective), we can calculate:

\(Pr(\textrm{negative outcome}|\textrm{drug works})\) - the probability that we see a negative result, if the drug works; this is our false negative rate: \(\beta = 1 - \textrm{power} = 0.2\)
\(Pr(\textrm{negative outcome}|\textrm{drug does not work})\) - the probability that we see a negative result, if the drug does not work; this is \(1 - \alpha\): 0.95

Calculating the Bayes factor, \(K\):

\(K = \frac{Pr(\textrm{negative outcome}|\textrm{drug works})}{Pr(\textrm{negative outcome}|\textrm{drug does not work})} = 0.2 / 0.95 = 0.21\)

10.3.1 Interpreting Bayes Factors

A value of \(K > 1\) here means that the hypothesis that the drug works is more strongly supported by the data; a value of \(K < 1\) means that the hypothesis that the drug doesn’t work is more strongly supported.

In general, a value of \(K > 10\) is considered to be strong evidence in favour of a hypothesis (Kass and Raftery (1995)).

What our experiment can tell us:

A positive outcome implies \(K = 16\), which is greater than 10, so the experiment strongly supports that the drug is effective.
A negative outcome implies \(K = 0.21\), which is less than 1, so the experiment supports the conclusion that the drug is not effective.

Note

If an experiment is designed with 80% power and \(\alpha = 0.05\), a positive result delivers strong evidence to reject the null hypothesis.

This is the same thing as \(P = 0.05\).↩︎
Ashcroft et al. (1997)↩︎
\(\alpha\), aka the P-value threshold, is what we set as our acceptable false positive rate (type I error).↩︎
\(\beta\) is what we set as our acceptable false negative rate (type II error)↩︎
Though note that we would expect the result to appear as though the drug was ineffective 20 times.↩︎