MP968 Experimental Design Workshop

Leighton Pritchard

University of Strathclyde

2025-11-24

Why do we need experimental design?

We should not cause unnecessary suffering

We should always minimise suffering

This may mean not performing an experiment at all. Not all new knowledge or understanding is worth causing suffering to obtain it.

Where there is sufficient justification to perform an experiment, we are ethically obliged to minimise the amount of distress or suffering that is caused, by designing the experiment to achieve this.

Why we need statistics

It may be easy to tell whether an animal is well-treated, or whether an experiment is necessary.

But what is an acceptable (i.e. the least possible) amount of suffering necessary to obtain an informative result?

Challenge

Quiz question

Suppose you are running a necessary and useful experiment with animal subjects, where the use of animals is morally justified. You are comparing a treatment group to a control group. Which of the following choices will cause the least amount of suffering?

  • Use three subjects per group so a standard deviation can be calculated
  • Use just enough subjects to establish that the outcome is likely to be correct
  • Use just enough subjects to be certain that the outcome is correct
  • Use as many subjects as you have available, to avoid wastage

How many individuals?

The appropriate number of subjects

The appropriate number of animal subjects to use in an experiment is always the smallest number that - given reasonable assumptions - will satisfactorily give the correct result to the desired level of certainty.

  • What assumptions are reasonable?
  • What is an appropriate level of certainty?

By convention1 the usual level of certainty for a hypothesis test is: “we have an 80% chance of getting the correct true/false answer for the hypothesis being tested”

Design experiments to minimise suffering

Experimental design and statistics are intertwined

Once a research hypothesis has been devised:

  • Experimental design is the process of devising a practical way of answering the question
  • Statistics informs the choices of variables, controls, numbers of individuals and groups, and the appropriate analysis of results

Design your experiment for…

  • your population or subject group (e.g. sex, age, prior history, etc.)
  • your intervention (e.g. drug treatment)
  • your contrast or comparison between groups (e.g. lung capacity, drug concentration, etc.)
  • your outcome (i.e. is there a measurable or clinically relevant effect)

The 2009 NC3Rs systematic survey

The importance of experimental design

“For scientific, ethical and economic reasons, experiments involving animals should be appropriately designed, correctly analysed and transparently reported. This increases the scientific validity of the results, and maximises the knowledge gained from each experiment. A minimum amount of relevant information must be included in scientific publications to ensure that the methods and results of a study can be reviewed, analysed and repeated. Omitting essential information can raise scientific and ethical concerns.” (Kilkenny et al. (2009))

We rely on the reporting of the experiment to know if it was appropriate

Causes for concern 1

“Detailed information was collected from 271 publications, about the objective or hypothesis of the study, the number, sex, age and/or weight of animals used, and experimental and statistical methods. Only 59% of the studies stated the hypothesis or objective of the study and the number and characteristics of the animals used. […] Most of the papers surveyed did not use randomisation (87%) or blinding (86%), to reduce bias in animal selection and outcome assessment. Only 70% of the publications that used statistical methods described their methods and presented the results with a measure of error or variability.” (Kilkenny et al. (2009))

We cannot rely on the literature for good examples of experimental design

Causes for concern 2

No publication explained their choice for the number of animals used

We cannot rely on the verbal authority of ‘published scientists’ or ‘experienced scientists’ for good experimental design

Very strong cause for concern

Power analysis or other very simple calculations, which are widely used in human clinical trials and are often expected by regulatory authorities in some animal studies, can help to determine an appropriate number of animals to use in an experiment in order to detect a biologically important effect if there is one. This is a scientifically robust and efficient way of determining animal numbers and may ultimately help to prevent animals being used unnecessarily. Many of the studies that did report the number of animals used reported the numbers inconsistently between the methods and results sections. The reason for this is unclear, but this does pose a significant problem when analysing, interpreting and repeating the results.” (Kilkenny et al. (2009))

Important

As scientists, you - yourselves - need to understand the principles behind the statistical tests you use, in order to choose appropriate tests and methods, and to use appropriate measures to minimise animal suffering and obtain meaningful results.

You cannot simply rely on the word of “experienced scientists” for this.

The ARRIVE guidelines

The following year Kilkenny et al. (2010) proposed the ARRIVE guidelines: a checklist to help researchers report their animal research transparently and reproducibly.

  • Good reporting is essential for peer review and to inform future research
  • Reporting guidelines measurably improve reporting quality
  • Improved reporting maximises the output of published research

ARRIVE guidelines highlightes

Many journals now routinely request information in the ARRIVE framework, often as electronic supplementary information. The framework covers 20 items including the following (Kilkenny et al. (2010)):

ARRIVE guidelines (highlights)

    1. Objectives: primary and any secondary objectives of the study, or specific hypotheses being tested
    1. Study design: brief details of the study design, including the number of experimental and control groups, any steps taken to minimise the effects of subjective bias, and the experimental unit
    1. Sample size: the total number of animals used in each experiment and the number of animals in each experimental group; how the number of animals was decided
    1. Statistical methods: details of the statistical methods used for each analysis; methods used to assess whether the data met the assumptions of the statistical approach
    1. Outcomes and estimation: results for each analysis carried out, with a measure of precision (e.g., standard error or confidence interval).

A vital step

Warning

“A key step in tackling these issues is to ensure that the next generation of scientists are aware of what makes for good practice in experimental design and animal research, and that they are not led into poor or inappropriate practices by more senior scientists without a proper grasp of these issues.”

Recommended reading

Bate and Clark (2014)

Some Statistical Concepts

Random variables

Your experimental measurements are random variables

Important

This does not mean that your measurements are entirely random numbers

Caution

Random variables are values whose range is subject to some element of chance, e.g. variation between individuals

  • Tail length (e.g. timing of developmental signals, distribution of nutrients)
  • Blood concentrations (e.g. circulatory heterogeneity, transient measurement differences)
  • Survival time (e.g. determining point of death)

Probability distributions

The probability distribution of a random variable \(z\) (e.g. what you measure in an experiment) takes on some range of values1

The mean of the distribution of \(z\)

  • The mean (aka expected value or expectation) is the average of all the values in \(z\)
    • Equivalently: the mean is the value that is obtained on average from a random sample from the distribution
  • Written as \(\mu_{z}\) or \(E(z)\)

The variance of a distribution of \(z\)

  • The variance of the distribution of \(z\) represents the expected mean squared difference from the mean \(\mu_z\) (or \(E(z)\)) of a random sample from the distribution.
    • \(\textrm{variance} = E((z - \mu_z)^2)\)

Understanding variance

A distribution where all values of \(z\) are the same

  • Every single value in the distribution (\(z\)) is also the mean value (\(\mu_z\)), therefore

\[z = \mu_z \implies z - \mu_z = 0 \implies (z - \mu_z)^2 = 0\] \[\textrm{variance} = E((z - \mu_z)^2) = E(0^2) = 0\]

All other distributions

In every other distribution, there are some values of \(z\) that differ so, for at least some values of \(z\)

\[z \neq \mu_z \implies z - \mu_z \neq 0 \implies (z - \mu_z)^2 \gt 0 \] \[\implies \textrm{variance} = E((z - \mu_z)^2) \gt 0 \]

Standard deviation

Standard deviation is the square root of the variance

\[\textrm{standard deviation} = \sigma_z = \sqrt{\textrm{variance}} = \sqrt{E((z - \mu_z)^2)} \]

Advantages

  • The standard deviation (unlike variance) takes values on the same scale as the original distribution
    • Standard deviation is a more “natural-seeming” interpretation of variation

Note

We can calculate mean, variance, and standard deviation for any probability distribution.

Normal Distribution 1

\[ z \sim \textrm{normal}(\mu_z, \sigma_z) \]

Note

We only need to know the mean and standard deviation to define a unique normal distribution

Tip

Measurements of variables whose value is the sum of many small, independent, additive factors may follow a normal distribution

Important

There is no reason to expect that a random variable representing direct measurements in the world will be normally distributed!

Normal Distribution 2

Tip

  • For a normal distribution, the mean value is the value at the peak of the curve
  • The curve is symmetrical, so standard deviation describes variability equally well on both sides of the mean

(Non-)Normal Distribution 3

Tip

  • Here, the mean may not be the same value as the peak of the curve (i.e. the mode)
  • The curve is asymmetrical, so standard deviation does not describe variation equally well on either side of the mean

Binomial Distribution 1

Suppose you’re taking shots in basketball

  • how many shots?
  • how likely are you to score?
  • what is the distribution of the number of successful shots?

Tip

This kind of process generates a random variable approximating a probability distribution called a binomial distribution.

It is different from a normal distribution.

Binomial Distribution 2

\[ z \sim \textrm{binomial}(n, p) \]

Tip

  • number of shots, \(n = 20\), probability of scoring, \(p = 0.3\)

\[z \sim \textrm{binomial}(20, 0.3) \]

mean and sd

\[ \textrm{mean} = n \times p \] \[ \textrm{sd} = \sqrt{n \times p \times (1-p)}\]

Design note

You need to design your experiments and analyses to reflect the appropriate process/probability distributions of your data. E.g., does \(p\) differ between two conditions?

Poisson distribution 1

In prior experiments the frequency of calcium events in WKY was 3.8 \(\pm\) 1.1 events/field/min compared to 18.9 \(\pm\) 7.1 in SHR

This is not normal (or binomial)

Something that happens a certain number of times in a fixed interval generates a Poisson distribution.

This is different from a normal or binomial distribution.

Poisson distribution 2

\[z \sim \textrm{poisson}(\lambda)\]

Poisson distribution

\[ \textrm{mean} = \lambda \] \[ \textrm{sd} = \sqrt{\lambda} \]

Expectation (\(\lambda\))

  • Only one parameter is provided, \(\lambda\): the rate with which the measured event happens

  • Suppose a county has population 100,000, and average rate of cancer is 45.2mn people each year

\[z \sim \textrm{poisson}(45,200,000/100,000) = \textrm{poisson}(4.52) \]

Design note

You need to design your experiments and analyses to reflect the appropriate process/probability distributions of your data

  • E.g., does \(\lambda\) differ between two conditions?

Binomial and Poisson distributions

Some important features

  • All measured values (and \(n\)) are positive whole numbers or zero; \(\lambda\), \(p\) may be positive real numbers or zero
  • The distributions may not be unimodal
  • The mean is not always the peak value (mode)
  • The distributions are not always symmetrical (so sd may not describe variation equally either side of the mean)

Distributions in Practice

Distributions are starting points

  • Distributions arise from and represent distinct generation processes (relate this to your biological system)
    • Normal distributions are generated by sums, differences, and averages
    • Poisson distributions are generated by counts (per unit interval)
    • Binomial distributions are generated by success/failure outcomes
  • Design experiments with analyses that reflect these processes

Warning

  • All statistical distributions are idealisations that ignore many features of real data
  • No real world data should be expected to exactly match any statistical distribution
  • Poisson models tend to need adjustment for overdispersion

Normal Distribution Redux

Probability mass

  • approximately 50% of the distribution lies in the range \(\mu \pm 0.68\sigma\)
  • approximately 68% of the distribution lies in the range \(\mu \pm \sigma\)
  • approximately 95% of the distribution lies in the range \(\mu \pm 2\sigma\)
  • approximately 99.7% of the distribution lies in the range \(\mu \pm 3\sigma\)

Estimates, standard errors, and confidence intervals

Parameters

Parameters are unknown numbers that determine a statistical model

A linear regression

\[ y_i = a + b x_i \]

  • Parameters are:
    • \(a\) (the intercept)
    • \(b\) (the gradient)

A normal distribution representing your data

\[ z \sim \textrm{normal}(\mu_z, \sigma) \]

  • Parameters are: \(\mu_z\) and \(\sigma\)

Estimands

An estimand (or quantity of interest) is a value that we are interested in estimating

A linear regression

\[ y_i = a + b x_i\]

  • We want to estimate values for:
    • \(a\) (the intercept)
    • \(b\) (the gradient)
    • predicted outcomes at important values of \(x_i\)

These are all estimands, and estimates are represented using the “hat” symbol: \(\hat{a}\), \(\hat{b}\), etc.

A normal distribution representing your data

\[ z \sim \textrm{normal}(\mu_z, \sigma) \]

  • Estimands are: \(\mu_z\) and \(\sigma\)
    • Maybe you want to determine the 95% confidence interval - this is also an estimand

Standard Errors and Confidence Intervals

  • The standard error is the estimated standard deviation of an estimate
    • It is a measure of our uncertainty about the quantity of interest

Note

  • Standard error gets smaller as sample size gets larger
    • You know more about the most likely value, the more data/information you collect
    • Standard error tends to zero as sample size gets large enough
  • The confidence interval (or CI) represents a range of values of a parameter or estimand that are roughly consistent with the data

Important

  • In repeated applications, the 50% confidence interval will include the true value 50% of the time
    • A 95% confidence interval will include the true value 95% of the time

Tip

  • The usual 95% confidence interval rule of thumb for large samples (assuming a normal distribution) is to take the estimate \(\pm\) two standard errors

Statistical significance and hypothesis testing

Statistical significance 1

  • Some scientists choose to consider a result to be “stable” or “real” if it is “statistically significant
  • They may also consider “non-signifcant” results to be noisy or less reliable

Warning

I, and many other statisticians, do not recommend this approach (though we will not cover alternatives today)

However, the concept is widespread and we need to discuss it

Statistical significance 2

A common definition

  • Statistical significance is conventionally defined as a threshold (commonly, a \(p\)-value less than 0.05) relative to some null hypothesis or prespecified value that indicates no effect is present.

  • E.g., an estimate may be considered “statistically significant at \(P < 0.05\)” if it:

    • lies at least two standard errors from the mean
    • is a difference that lies at least two standard errors from zero
  • More generally, an estimate is “not statistically significant” if, e.g.

    • the observed value can reasonably be explained by chance variation consistent with the null hypothesis
    • it is a difference that lies less than two standard errors from zero

Most tests rely on probability distributions

  • We need to relate the measured values in the real world to an appropriate statistical distribution that approximates them

A simple example: The experiment

The experiment

  • Two drugs, \(C\) and \(T\) lower cholesterol1, and we want to compare their effectiveness
  • We randomise assignment of \(C\) and \(T\) to members of a single cohort of comparable individuals, whose pre-treatment cholesterol level is assumed to be drawn from the same distribution (i.e. be approximately the same)
  • We measure the post-treatment cholesterol levels \(y_T\) and \(y_C\) for each individual in the two groups.
  • We calculate the average measured \(\bar{y}_T\) and \(\bar{y}_C\) for the treatment and control groups as estimates for the true post-treatment levels \(\theta_T\) and \(\theta_C\).
    • We also calculate standard deviation for the two groups, \(\sigma_T\) and \(\sigma_C\)

A simple example: The hypotheses

  • We want to know if the treatments have different sizes of effect
    • If they do, there should be a difference between the (average) post-treatment cholesterol level in each group
    • The true post-treatment levels are \(\theta_T\) and \(\theta_C\)
    • We have estimated means, \(\bar{y}_T\) and \(\bar{y}_C\) for post-treatment levels

The hypotheses

  • We are interested in \(\theta = \theta_T - \theta_C\), the expected true post-test difference in cholesterol between the two groups \(T\) and \(C\).
  • Our null hypothesis (\(H_0\)) is that \(\theta = 0\), i.e. there is no difference (\(\theta_C = \theta_T\))
  • Our alternative hypothesis (\(H_1\)) is that there is a difference, so \(\theta \neq 0\), (i.e. \(\theta_C \neq \theta_T\))

A simple example: The distribution 1

  • To perform a statistical test, we may assume a distribution and parameters for the null hypothesis
    • We can then test the observed estimate against that distribution to see how likely it is that the null hypothesis would have generated it

The distribution

  • We use a probability distribution reflecting generation of the null hypothesis: \(\theta_C = \theta_T\)
    • This allows us to define a test statistic \(T_\textrm{crit}\) (i.e. a threshold probability of “significance”) in advance
  • We test the estimated value from the experiment (\(\bar{y}_T - \bar{y}_C\)) to calculate a \(p\)-value for our estimate: \(p = \textrm{Pr}(T(y_{\textrm{null}}) > T_\textrm{crit} = T(\bar{y}_T - \bar{y}_C))\)

A simple example: The null hypothesis

The null hypothesis

  • Assume that the true difference \(\theta\) is normally-distributed with \(\mu_\theta=0\), \(\sigma_\theta=1\)

A simple example: The estimated difference

Observed between post-treatment levels: \(\bar{y}_T - \bar{y}_C = -1.4\)

  • Is this an unlikely outcome given the null hypothesis?

A simple example: A significance threshold

We choose a significance threshold in advance

  • Suppose we set a threshold \(T\) corresponding to the 90% confidence interval (i.e. \(P<0.1\))
    • If the estimate is not in the central 90% of the distribution, we’ll say it’s “significant”

A simple example: Compare the estimate

Compare the estimate to the threshold

  • The estimate lies outwith the threshold, so we call the difference “significant”

A simple example: Another threshold

We choose a significance threshold in advance

  • Suppose we set the threshold \(T\) corresponding to the 95% confidence interval (i.e. \(P<0.05\)) instead?

A simple example: Another outcome

Compare the estimate to the threshold

  • The estimate lies within the threshold, so the difference is “not significant”

A simple example: What changed?

What did not change

  • The null hypothesis was the same
  • The observed estimate of difference was the same

What changed

  • Our choice of significance threshold changed

Significance threshold choice

  • Once the estimate is known, it is always possible to find a threshold that makes it “significant” or “not significant”
  • It is dishonest to select a threshold deliberately to make your result “significant” or “not significant”
  • Always choose and record (preregister) your threshold for significance ahead of the experiment

Tailed tests: two-tailed

Use two tails if direction of change doesn’t matter

  • With a two-tailed hypothesis test, we do not care which direction of change is significant

Tailed tests: one-tailed (left)

Use one-tailed tests when direction matters

  • If we’re testing specifically for a significant negative difference/reduction, use a left-tailed test
  • e.g. if we wanted to know if \(T\) reduced post-test levels with respect to \(C\) at a threshold of \(P < 0.05\)

Tailed tests: one-tailed (right)

Use one-tailed tests when direction matters

  • If we’re testing specifically for a positive difference/increase, use a right-tailed test
  • e.g. if we wanted to know if \(T\) increased post-test levels with respect to \(C\) at a threshold of \(P < 0.05\)

Problems with statistical significance 1

Warning

It is a common error to summarise comparisons by statistical significance into “significant” and “non-significant” results

Statistical significance is not the same as practical importance

  • Suppose a treatment increased earnings by £10 per year with a standard error of £2 (average salary £25,000).
    • This would be statistically, but not practically, significant
  • Suppose a different treatment increased earnings by £10,000 per year with a standard error of £10,000
    • This would not be statistically significant, but could be important in practice

Problems with statistical significance 2

Warning

It is a common error to summarise comparisons by statistical significance into “significant” and “non-significant” results

Non-significance is not the same as zero

  • Suppose an arterial stent treatment group outperforms the control
    • mean difference in treadmill time: 16.6s (standard error 9.8)
    • the 95% confidence interval for the effect includes zero, \(p ≈ 0.20\)
  • It’s not clear whether the net treatment effect is positive or negative
    • but we can’t say that stents have no effect

Problems with statistical significance 3

The difference between ‘significant’ and ‘not significant’ is not statistically significant

  1. At a \(P<0.05\) threshold, only a small change is required to move from \(P < 0.051\) to \(P < 0.049\)
  2. Large changes in significance can correspond to non-significant differences in the underlying variables

Problems with statistical significance 4

The difference between ‘significant’ and ‘not significant’ is not statistically significant

  1. At a \(P<0.05\) threshold, only a small change is required to move from \(P < 0.051\) to \(P < 0.049\)
  2. Large changes in significance can correspond to non-significant differences in the underlying variables

Standard errors, sample size, and statistical significance

Standard errors

Important

We cannot make an infinite number of measurements of \(z\). We can only take a sample.

The mean and standard deviation we estimate in an experiment will not match those of the infinitely large population.

Standard Error (of the Mean)

The standard error of the mean reflects the uncertainty in our estimate of the mean.

When estimating the mean of an infinite population, given a simple random sample of size \(n\), the standard error is:

\[ \textrm{standard error} = \sqrt{\frac{\textrm{Variance}}{n}} = \frac{\textrm{standard deviation}}{\sqrt{n}} = \frac{\sigma}{\sqrt{n}} \]

Standard error and sample size

Tip

Uncertainty in the mean estimate \(\mu\) reduces proportionally to the square root of the number of samples, \(n\)

Standard error and hypothesis testing 1

Hypothesis test statistics

  • Test statistic \(t\) is a point on the distribution representing a significance threshold

\[ t = \frac{Z}{s} = \frac{Z}{\sigma/\sqrt{n}} \]

  • \(Z\) is some function of the data (difference between estimate and true value); \(s\) is standard error of the mean

This is true for many hypothesis test methods

Standard error and hypothesis testing 2

One-sample \(t\)-test

\[ t = \frac{Z}{s} = \frac{\bar{X} - \mu}{\hat{\sigma}/{\sqrt{n}}} = \frac{\bar{X} - \mu}{s(\bar{X})} \]

  • \(\bar{X}\) is the sample mean; \(\mu\) is the hypothesised population mean (being tested)
  • \(\hat{\sigma}\) is the sample standard deviation; \(n\) is the sample size; \(s(\bar{X})\) is the standard error of the mean

Wald test

\[ \sqrt{W} = \frac{Z}{s} = \frac{\hat{\theta} - \theta_0}{s(\hat{\theta})} \]

  • \(\hat{\theta}\) is the estimated maximising argument of the likelihood function; \(\theta_0\) is the hypothesised value under test
  • \(s(\hat{\theta})\) is the standard error of \(\hat{\theta}\)

Standard error and hypothesis testing 3

Hypothesis test statistics

  • Test statistic \(t\) is a point on the distribution representing a significance threshold

\[ t = \frac{Z}{s} = \frac{Z}{\sigma/\sqrt{n}} \]

  • \(Z\) is some function of the data (difference between estimate and true value); \(s\) is standard error of the mean

What happens if we hold \(Z\) and \(\sigma\) constant and vary sample size?

Sample size and hypothesis testing 1

We reject the null hypothesis when $t > \(t_\textrm{crit}\)

  • Suppose we set \(t_\textrm{crit} = 2\) (and \(\sigma=1\), \(n=3\))
  • We can calculate \(t = \frac{Z}{s} = \frac{Z}{\sigma/\sqrt{n}}\) for any value of \(Z\)

Sample size and hypothesis testing 2

The difference (\(Z\)) we need to see to reject the null varies with sample size

  • Set \(t_\textrm{crit} = 2\)
  • \(n=3 \implies Z_\textrm{crit} \approx 1.15\); \(n=15 \implies Z_\textrm{crit} \approx 0.5\)

Statistical significance and effect size

Statistical significance is not the point

  • We can meet any statistical significance threshold for a difference by sufficiently increasing the sample size

Statistical significance is not the same as practical importance

  • Suppose a treatment increased earnings by £10 per year with a standard error of £2 (average salary £25,000).
    • This would be statistically, but not practically, significant

What matters is effect size

In all our experiments we should be concerned not with “statistical significance,” but with how likely it is that, if there is a meaningful effect of the treatment, our experiment will be able to detect it?

  • We need to be concerned with statistical power

Statistical power

Statistical power

Important

Statistical power is defined as the probability, before a study is performed, that a particular comparison will achieve “statistical significance” at some predetermined level (e.g. \(P < 0.05\)) given an assumed true effect size.

The process

  1. Hypothesise an appropriate effect size (e.g. what effect will improve health?)
  2. Determine the \(p\)-value threshold you consider “statistically significant”
  3. Make reasoned assumptions about the variation in the data (e.g. what distribution? what variance?)
  4. Choose a sample size
  5. Use probability calculations to determine the chance that your observed \(p\)-value will be below the threshold (accept the null hypothesis) for the hypothesised effect size

Effect sizes

Power analysis depends on an assumed effect size

  • The true effect size is almost never known ahead of time
    • Determining the effect size is usually why we’re doing the study

How to choose effect sizes

  • Try a range of values consistent with relevant literature
  • Determine what value would be of practical interest (e.g. improvement in outcomes of 10%)

How not to choose effect size

  • DO NOT USE AN ESTIMATE FROM A SINGLE NOISY STUDY!
    • Noisy studies suffer from The Winner’s Curse

The Winner’s Curse 1

A low-powered pilot study

  • Suppose we ran a small pilot study with only a few individuals
  • The study, by design, has low statistical power
    • The variance of the data is relatively large, compared to the true effect size

The Winner’s Curse 2

You get a statistically significant result!

  • You think you won, but you lost! (The Winner’s Curse)
    • The estimate is either eight times too large (at least 16% instead of 2%) or
    • The estimate has the wrong sign (a negative change instead of positive)

The Winner’s Curse 3

The trap

Any apparent success of low-powered studies masks larger failure

When signal (effect size) is low and noise (standard error) is high, “statistically significant” results are likely to be wrong.

Low-power studies tend not to replicate well

Warning

Low-power studies have essentially no chance of providing useful information

We can say this even before data are collected

Published results tend to be overestimates

Statistical power and ethics

It is unethical to under-power animal studies

Under-powered in vivo experiments waste time and resources, lead to unnecessary animal suffering and result in erroneous biological conclusions (NC3Rs Experimental Design Assistant guide)

It is unethical to over-power animal studies

Ethically, when working with animals we need to conduct a harm–benefit analysis to ensure the animal use is justified for the scientific gain. Experiments should be robust, not use more or fewer animals than necessary, and truly add to the knowledge base of science (Karp and Fry (2021))

So how should we appropriately power animal studies?

Statistical power and error 1

We often refer to two kinds of statistical error

Type I Error (\(\alpha\))

  • Type I error is the probability of rejecting a null hypothesis, when the null hypothesis is true
    • Also known as a “false positive error”
  • Represented by the Greek letter \(\alpha\)

Type II Error (\(\beta\))

  • Type I error is the probability of accepting a null hypothesis, when the null hypothesis is false
    • Also known as a “false negative error”
  • Represented by the Greek letter \(\beta\)

Statistical power is \(1 - \beta\)

Statistical power and error 2

Statistical power needs context: the expected error rates of the experiment at a given effect size, e.g.

The experiment has 80% power at \(\alpha = 0.05\) for an effect size of 2mM/L

How to read this

  • “an effect size of 2mM/L”: we are aiming to detect an effect of at least 2mM/L (e.g. blood glucose concentration)
  • \(\alpha = 0.05\)”: we are using a significance test threshold (\(\alpha\), type I error rate) of \(P < 0.05\)
  • “80% power”: we expect the study to report a significant effect, where one truly exists, 80% of the time

Statistical power and error 3

The experiment has 80% power at \(\alpha = 0.05\) for an effect size of 2mM/L

If the drug truly has no effect

  • The test has \(\alpha = 0.05\), so we would expect to reject the null hypothesis incorrectly 5% of the time
  • If we ran the experiment 100 times, we would expect to see a result implying that the drug was effective five times

If the drug truly has an effect

  • The test has predicted power \(1 - \beta = 0.8\), so the type II error rate \(\beta = 0.2\) and we would expect to accept the null hypothesis incorrectly 20% of the time
  • If we ran the experiment 100 times, we would expect to see a result implying that the drug was effective eighty times

Statistical power and sample size 1

What we need, to calculate appropriate sample size

  • An acceptable false positive rate (type I error, \(\alpha\))
  • An acceptable false negative rate (type II error, \(\beta\))
    • This is equivalent to knowing the target statistical power (\(1 - \beta\))
  • The expected effect size and variance
  • The statistical test being performed

Important

We need this information to calculate an appropriate, ethical sample size

Statistical power and sample size 2

Typical funders’ requirements

  • False positive rate \(\alpha = 0.05\)

  • Power \(1 - \beta = 0.8\) (80% power)

  • These are only a starting point - other values may be more appropriate depending on circumstance

Under experimenter control

  • Effect size and variance
  • The appropriate statistical approach

G*Power Walkthrough

G*Power

  • G*Power is a powerful software tool that can compute several types of power calculations for a very wide range of statistical analyses (Faul et al. (2007))

Note

It is beyond the workshop scope to consider all the possible uses of G*Power.

We will only be introducing you to basic operation of the program.

The Experiment 1

Is the average weight of a group of mice statistically different from 25g?

Which statistical test?

  • Means of real value measurements like height, weight, tend to follow a normal distribution. A \(t\)-test may be appropriate
  • We are testing the mean of a sample against a single hypothesised (null) value, so a one-sample \(t\)-test would be the appropriate variant
  • We want to know if the sample mean is different from the single hypothesised (null) value and are not concerned with direction, so we would use a two-tailed test

The Experiment 2

Is the average weight of a group of mice statistically different from 25g?

What do we want to know from the power calculation?

We want to determine an appropriate sample size for our experiment, given an expected effect size, variance, and choices of power and statistical significance threshold

The Experiment 3

Is the average weight of a group of mice statistically different from 25g?

What kind of power calculation should we perform?

We want to calculate an appropriate sample size for our experiment before we perform it, so we require an a priori power calculation

The Experiment 4

Is the average weight of a group of mice statistically different from 25g?

What statistical power should we use?

We will use standard funders’ requirements of 80% power at \(P < 0.05\)

\[\alpha=0.05, \beta=0.2, 1-\beta=0.8 \]

The Experiment 5

Is the average weight of a group of mice statistically different from 25g?

What is the expected effect size and variation in the sample?

There are no absolute values that constitute a meaningful difference in weight under all circumstances, but we will make the following assumptions

  • A 10% difference in weight from the expected 25g null would be meaningful, for an effect size of 2.5g
  • We assume that the weight of individuals in the sample has a standard deviation of 1.25g, so \(\sigma = 1.25\), on the basis of having previously weighed groups of mice.

Let’s start!

Note

We now have enough information to calculate an appropriate sample size

parameters

  • two-tailed, one sample \(t\)-test
  • a priori power calculation to compute required sample size
  • \(\alpha = 0.05\)
  • Power \(1 - \beta = 0.8\)
  • Effect size = 2.5
  • Standard deviation, \(\sigma = 1.25\)
  • Cohen’s \(d = \frac{\mu - \mu_0}{\sigma}\)
Figure 1: G*Power start screen

Experimental Design Assistant

NC3Rs EDA

Figure 2: NC3Rs EDA

The Experiment 1

We will use NONcNZO10/LtJ mice (JAX) to represent a polygenic form of diabetes. Mice in the treatment group will receive a single subcutaneous injection of drug A (up to 30mg/kg). Mice in the control group will receive a single subcutaneous injection of vehicle. Mice will be randomly assigned to receive either drug A or vehicle only without regard to the sex of the animal. 48 hours after administration of drug or vehicle, the blood glucose level will be measured.

The Experiment 2

Our beliefs about the experiment

  • We are testing the effect of a new drug - drug A - on plasma glucose levels
    • There is a single experimental variable of interest: whether drug A is present (treatment) or absent (control)
    • The plasma glucose level is our outcome measure
  • Our experimental subjects are diabetic (NONcNZO10/LtJ) mice
  • We will divide individuals into two groups, by different pharmacological treatment
    • Group 1 will receive the vehicle with no active drug (the control)
    • Group 2 will receive the vehicle, containing active drug (the treatment)
  • Individuals will be allocated to each group randomly, by complete randomisation
  • We are not testing for the effect of sex on drug performance as an experimental variable of interest
    • We are only monitoring for the effect of sex on drug performance as a nuisance factor

The Experiment 3

Causal relationships in the experiment

Plasma glucose level is potentially dependent on three causal relationships

  1. The pharmacological effect of drug A
  2. The pharmacological effect of the vehicle
  3. Individual differences between experimental subjects (mice)

Non-causal effects

Plasma glucose level is potentially also dependent on other factors including

  • sex
  • time the drug is administered

Some of these may be systematic and important (nuisance variables), and others random and not so important (“noise”)

The Experiment 4

Treatment vs no treatment

The causal influence of administering drug A (or not) on plasma glucose levels of the experimental subjects is the central reason for performing the experiment

  • Presence or absence of drug A is our independent variable (aka explanatory variable)

Independent variables

The presence/absence of drug A is an independent variable as whether it is administered or not is entirely under our control, as experimenters.

The Experiment 5

Control vs treatment groups

We cannot administer the drug in isolation!

Like most drugs, drug A is carried in a vehicle - a substance expected to be inert in the contect of the treatment.

Controlling for variables

It would be unwise to assume we could administer drug A alone to one group of mice and apply no intervention in the other group.

Physiological responses may be affected by: - the vehicle - the procedure (e.g. injection)

So we must follow the same protocol in the control and treatment groups, with the only variable being presence/absence of drug A.

The Experiment 6

Random assignment to groups

Individual mice may respond differently to drug A, the vehicle, and the procedure, due to underlying differences between them.

We therefore need to randomise subjects into experimental groups, assuming that all individuals are drawn from the same population (a single pool whose response varies according to some common distribution).

  • This means we expect the underlying plasma concentrations, and responses to treatment, in each group to be comparable.
  • Differences in measured outcome between groups are then expected to derive only from our experimental interventions and presence/absence of drug A

An EDA diagram

Figure 3: EDA diagrams are composed of nodes representing aspects of an experiment, and links representing relationships between the nodes

Our target diagram

Figure 4: The final EDA diagram describing our experiment

Summary

Power calculations

Power calculations describe what can be learned from statistical analysis of an experiment

  • We defined statistical power, and how statistical power tells us what conclusions can reasonably be drawn from statistical analysis of an experiment.
    • The goal of experimental design is not to attain statistical significance with some (high) level of probability. The point of designing an experiment and conducting power analysis is to have a sense, both before and after data have been collected, of what can reasonably be learned from the statistical analysis.

G*Power

  • Using G*Power we worked through a practical example of how to estimate sample size for a desired statistical power

Experimental design

The NC3Rs EDA can help us design better experiments

We worked through the complete design, critique, and analysis of an experiment using the NC3Rs EDA tool, including how to share the resulting design with others.

What next?

Experimental design is a large and diverse field, and to become competent in experimental design takes time, and requires practice (which also involves making mistakes, at times).

What no-one tells you

Even experienced scientists with many publications and long track records of research funding may never have received any formal training in experimental design. This can lead to entrenched and outdated ideas being propagated.

It is always advisable to consult with funding bodies and specialists in statistics, experimental design, and research ethics.

References

References

Bate, Simon T., and Robin A. Clark. 2014. The Design and Statistical Analysis of Animal Experiments. Cambridge University Press.
Faul, Franz, Edgar Erdfelder, Albert-Georg Lang, and Axel Buchner. 2007. G*Power 3: A Flexible Statistical Power Analysis Program for the Social, Behavioral, and Biomedical Sciences.” Behav. Res. Methods 39 (2): 175–91.
Karp, Natasha A, and Derek Fry. 2021. “What Is the Optimum Design for My Animal Experiment?” BMJ Open Sci. 5 (1): e100126.
Kilkenny, Carol, William J Browne, Innes C Cuthill, Michael Emerson, and Douglas G Altman. 2010. “Improving Bioscience Research Reporting: The ARRIVE Guidelines for Reporting Animal Research.” PLoS Biol. 8 (6): e1000412.
Kilkenny, Carol, Nick Parsons, Ed Kadyszewski, Michael F W Festing, Innes C Cuthill, Derek Fry, Jane Hutton, and Douglas G Altman. 2009. “Survey of the Quality of Experimental Design, Statistical Analysis and Reporting of Research Using Animals.” PLoS One 4 (11): e7824.