University of Strathclyde
2025-11-24
We should always minimise suffering
This may mean not performing an experiment at all. Not all new knowledge or understanding is worth causing suffering to obtain it.
Where there is sufficient justification to perform an experiment, we are ethically obliged to minimise the amount of distress or suffering that is caused, by designing the experiment to achieve this.
Why we need statistics
It may be easy to tell whether an animal is well-treated, or whether an experiment is necessary.
But what is an acceptable (i.e. the least possible) amount of suffering necessary to obtain an informative result?
Quiz question
Suppose you are running a necessary and useful experiment with animal subjects, where the use of animals is morally justified. You are comparing a treatment group to a control group. Which of the following choices will cause the least amount of suffering?
The appropriate number of subjects
The appropriate number of animal subjects to use in an experiment is always the smallest number that - given reasonable assumptions - will satisfactorily give the correct result to the desired level of certainty.
By convention1 the usual level of certainty for a hypothesis test is: “we have an 80% chance of getting the correct true/false answer for the hypothesis being tested”
Experimental design and statistics are intertwined
Once a research hypothesis has been devised:
Design your experiment for…
“For scientific, ethical and economic reasons, experiments involving animals should be appropriately designed, correctly analysed and transparently reported. This increases the scientific validity of the results, and maximises the knowledge gained from each experiment. A minimum amount of relevant information must be included in scientific publications to ensure that the methods and results of a study can be reviewed, analysed and repeated. Omitting essential information can raise scientific and ethical concerns.” (Kilkenny et al. (2009))
We rely on the reporting of the experiment to know if it was appropriate
“Detailed information was collected from 271 publications, about the objective or hypothesis of the study, the number, sex, age and/or weight of animals used, and experimental and statistical methods. Only 59% of the studies stated the hypothesis or objective of the study and the number and characteristics of the animals used. […] Most of the papers surveyed did not use randomisation (87%) or blinding (86%), to reduce bias in animal selection and outcome assessment. Only 70% of the publications that used statistical methods described their methods and presented the results with a measure of error or variability.” (Kilkenny et al. (2009))
We cannot rely on the literature for good examples of experimental design
No publication explained their choice for the number of animals used
We cannot rely on the verbal authority of ‘published scientists’ or ‘experienced scientists’ for good experimental design
“Power analysis or other very simple calculations, which are widely used in human clinical trials and are often expected by regulatory authorities in some animal studies, can help to determine an appropriate number of animals to use in an experiment in order to detect a biologically important effect if there is one. This is a scientifically robust and efficient way of determining animal numbers and may ultimately help to prevent animals being used unnecessarily. Many of the studies that did report the number of animals used reported the numbers inconsistently between the methods and results sections. The reason for this is unclear, but this does pose a significant problem when analysing, interpreting and repeating the results.” (Kilkenny et al. (2009))
Important
As scientists, you - yourselves - need to understand the principles behind the statistical tests you use, in order to choose appropriate tests and methods, and to use appropriate measures to minimise animal suffering and obtain meaningful results.
You cannot simply rely on the word of “experienced scientists” for this.
The following year Kilkenny et al. (2010) proposed the ARRIVE guidelines: a checklist to help researchers report their animal research transparently and reproducibly.
Many journals now routinely request information in the ARRIVE framework, often as electronic supplementary information. The framework covers 20 items including the following (Kilkenny et al. (2010)):
ARRIVE guidelines (highlights)
Warning
“A key step in tackling these issues is to ensure that the next generation of scientists are aware of what makes for good practice in experimental design and animal research, and that they are not led into poor or inappropriate practices by more senior scientists without a proper grasp of these issues.”
Recommended reading
Bate and Clark (2014)
Your experimental measurements are random variables
Important
This does not mean that your measurements are entirely random numbers
Caution
Random variables are values whose range is subject to some element of chance, e.g. variation between individuals
The probability distribution of a random variable \(z\) (e.g. what you measure in an experiment) takes on some range of values1
The mean of the distribution of \(z\)
The variance of a distribution of \(z\)
A distribution where all values of \(z\) are the same
\[z = \mu_z \implies z - \mu_z = 0 \implies (z - \mu_z)^2 = 0\] \[\textrm{variance} = E((z - \mu_z)^2) = E(0^2) = 0\]
All other distributions
In every other distribution, there are some values of \(z\) that differ so, for at least some values of \(z\)
\[z \neq \mu_z \implies z - \mu_z \neq 0 \implies (z - \mu_z)^2 \gt 0 \] \[\implies \textrm{variance} = E((z - \mu_z)^2) \gt 0 \]
Standard deviation is the square root of the variance
\[\textrm{standard deviation} = \sigma_z = \sqrt{\textrm{variance}} = \sqrt{E((z - \mu_z)^2)} \]
Advantages
Note
We can calculate mean, variance, and standard deviation for any probability distribution.
\[ z \sim \textrm{normal}(\mu_z, \sigma_z) \]
Note
We only need to know the mean and standard deviation to define a unique normal distribution
Tip
Measurements of variables whose value is the sum of many small, independent, additive factors may follow a normal distribution
Important
There is no reason to expect that a random variable representing direct measurements in the world will be normally distributed!
Tip
Tip
Suppose you’re taking shots in basketball
Tip
This kind of process generates a random variable approximating a probability distribution called a binomial distribution.
It is different from a normal distribution.
\[ z \sim \textrm{binomial}(n, p) \]
Tip
\[z \sim \textrm{binomial}(20, 0.3) \]
mean and sd
\[ \textrm{mean} = n \times p \] \[ \textrm{sd} = \sqrt{n \times p \times (1-p)}\]
Design note
You need to design your experiments and analyses to reflect the appropriate process/probability distributions of your data. E.g., does \(p\) differ between two conditions?
In prior experiments the frequency of calcium events in WKY was 3.8 \(\pm\) 1.1 events/field/min compared to 18.9 \(\pm\) 7.1 in SHR
This is not normal (or binomial)
Something that happens a certain number of times in a fixed interval generates a Poisson distribution.
This is different from a normal or binomial distribution.
\[z \sim \textrm{poisson}(\lambda)\]
Poisson distribution
\[ \textrm{mean} = \lambda \] \[ \textrm{sd} = \sqrt{\lambda} \]
Expectation (\(\lambda\))
Only one parameter is provided, \(\lambda\): the rate with which the measured event happens
Suppose a county has population 100,000, and average rate of cancer is 45.2mn people each year
\[z \sim \textrm{poisson}(45,200,000/100,000) = \textrm{poisson}(4.52) \]
Design note
You need to design your experiments and analyses to reflect the appropriate process/probability distributions of your data
Some important features
Distributions are starting points
Warning
Probability mass
Parameters are unknown numbers that determine a statistical model
A linear regression
\[ y_i = a + b x_i \]
A normal distribution representing your data
\[ z \sim \textrm{normal}(\mu_z, \sigma) \]
An estimand (or quantity of interest) is a value that we are interested in estimating
A linear regression
\[ y_i = a + b x_i\]
These are all estimands, and estimates are represented using the “hat” symbol: \(\hat{a}\), \(\hat{b}\), etc.
A normal distribution representing your data
\[ z \sim \textrm{normal}(\mu_z, \sigma) \]
Note
Important
Tip
Warning
I, and many other statisticians, do not recommend this approach (though we will not cover alternatives today)
However, the concept is widespread and we need to discuss it
A common definition
Statistical significance is conventionally defined as a threshold (commonly, a \(p\)-value less than 0.05) relative to some null hypothesis or prespecified value that indicates no effect is present.
E.g., an estimate may be considered “statistically significant at \(P < 0.05\)” if it:
More generally, an estimate is “not statistically significant” if, e.g.
Most tests rely on probability distributions
The experiment
The hypotheses
The distribution
The null hypothesis
Observed between post-treatment levels: \(\bar{y}_T - \bar{y}_C = -1.4\)
We choose a significance threshold in advance
Compare the estimate to the threshold
We choose a significance threshold in advance
Compare the estimate to the threshold
What did not change
What changed
Significance threshold choice
Use two tails if direction of change doesn’t matter
Use one-tailed tests when direction matters
Use one-tailed tests when direction matters
Warning
It is a common error to summarise comparisons by statistical significance into “significant” and “non-significant” results
Statistical significance is not the same as practical importance
Warning
It is a common error to summarise comparisons by statistical significance into “significant” and “non-significant” results
Non-significance is not the same as zero
The difference between ‘significant’ and ‘not significant’ is not statistically significant
The difference between ‘significant’ and ‘not significant’ is not statistically significant
Important
We cannot make an infinite number of measurements of \(z\). We can only take a sample.
The mean and standard deviation we estimate in an experiment will not match those of the infinitely large population.
Standard Error (of the Mean)
The standard error of the mean reflects the uncertainty in our estimate of the mean.
When estimating the mean of an infinite population, given a simple random sample of size \(n\), the standard error is:
\[ \textrm{standard error} = \sqrt{\frac{\textrm{Variance}}{n}} = \frac{\textrm{standard deviation}}{\sqrt{n}} = \frac{\sigma}{\sqrt{n}} \]
Tip
Uncertainty in the mean estimate \(\mu\) reduces proportionally to the square root of the number of samples, \(n\)
Hypothesis test statistics
\[ t = \frac{Z}{s} = \frac{Z}{\sigma/\sqrt{n}} \]
This is true for many hypothesis test methods
One-sample \(t\)-test
\[ t = \frac{Z}{s} = \frac{\bar{X} - \mu}{\hat{\sigma}/{\sqrt{n}}} = \frac{\bar{X} - \mu}{s(\bar{X})} \]
Wald test
\[ \sqrt{W} = \frac{Z}{s} = \frac{\hat{\theta} - \theta_0}{s(\hat{\theta})} \]
Hypothesis test statistics
\[ t = \frac{Z}{s} = \frac{Z}{\sigma/\sqrt{n}} \]
What happens if we hold \(Z\) and \(\sigma\) constant and vary sample size?
We reject the null hypothesis when $t > \(t_\textrm{crit}\)
The difference (\(Z\)) we need to see to reject the null varies with sample size
Statistical significance is not the point
Statistical significance is not the same as practical importance
What matters is effect size
In all our experiments we should be concerned not with “statistical significance,” but with how likely it is that, if there is a meaningful effect of the treatment, our experiment will be able to detect it?
Important
Statistical power is defined as the probability, before a study is performed, that a particular comparison will achieve “statistical significance” at some predetermined level (e.g. \(P < 0.05\)) given an assumed true effect size.
The process
Power analysis depends on an assumed effect size
How to choose effect sizes
How not to choose effect size
A low-powered pilot study
You get a statistically significant result!
The trap
Any apparent success of low-powered studies masks larger failure
When signal (effect size) is low and noise (standard error) is high, “statistically significant” results are likely to be wrong.
Low-power studies tend not to replicate well
Warning
Low-power studies have essentially no chance of providing useful information
We can say this even before data are collected
Published results tend to be overestimates
It is unethical to under-power animal studies
Under-powered in vivo experiments waste time and resources, lead to unnecessary animal suffering and result in erroneous biological conclusions (NC3Rs Experimental Design Assistant guide)
It is unethical to over-power animal studies
Ethically, when working with animals we need to conduct a harm–benefit analysis to ensure the animal use is justified for the scientific gain. Experiments should be robust, not use more or fewer animals than necessary, and truly add to the knowledge base of science (Karp and Fry (2021))
So how should we appropriately power animal studies?
We often refer to two kinds of statistical error
Type I Error (\(\alpha\))
Type II Error (\(\beta\))
Statistical power is \(1 - \beta\)
Statistical power needs context: the expected error rates of the experiment at a given effect size, e.g.
The experiment has 80% power at \(\alpha = 0.05\) for an effect size of 2mM/L
How to read this
The experiment has 80% power at \(\alpha = 0.05\) for an effect size of 2mM/L
If the drug truly has no effect
If the drug truly has an effect
What we need, to calculate appropriate sample size
Important
We need this information to calculate an appropriate, ethical sample size
Typical funders’ requirements
False positive rate \(\alpha = 0.05\)
Power \(1 - \beta = 0.8\) (80% power)
These are only a starting point - other values may be more appropriate depending on circumstance
Under experimenter control
G*Power WalkthroughG*PowerG*Power is a powerful software tool that can compute several types of power calculations for a very wide range of statistical analyses (Faul et al. (2007))Note
It is beyond the workshop scope to consider all the possible uses of G*Power.
We will only be introducing you to basic operation of the program.
Is the average weight of a group of mice statistically different from 25g?
Which statistical test?
Is the average weight of a group of mice statistically different from 25g?
What do we want to know from the power calculation?
We want to determine an appropriate sample size for our experiment, given an expected effect size, variance, and choices of power and statistical significance threshold
Is the average weight of a group of mice statistically different from 25g?
What kind of power calculation should we perform?
We want to calculate an appropriate sample size for our experiment before we perform it, so we require an a priori power calculation
Is the average weight of a group of mice statistically different from 25g?
What statistical power should we use?
We will use standard funders’ requirements of 80% power at \(P < 0.05\)
\[\alpha=0.05, \beta=0.2, 1-\beta=0.8 \]
Is the average weight of a group of mice statistically different from 25g?
What is the expected effect size and variation in the sample?
There are no absolute values that constitute a meaningful difference in weight under all circumstances, but we will make the following assumptions
Note
We now have enough information to calculate an appropriate sample size
parameters
G*Power start screen
We will use NONcNZO10/LtJ mice (JAX) to represent a polygenic form of diabetes. Mice in the treatment group will receive a single subcutaneous injection of drug A (up to 30mg/kg). Mice in the control group will receive a single subcutaneous injection of vehicle. Mice will be randomly assigned to receive either drug A or vehicle only without regard to the sex of the animal. 48 hours after administration of drug or vehicle, the blood glucose level will be measured.
Our beliefs about the experiment
Causal relationships in the experiment
Plasma glucose level is potentially dependent on three causal relationships
Non-causal effects
Plasma glucose level is potentially also dependent on other factors including
Some of these may be systematic and important (nuisance variables), and others random and not so important (“noise”)
Treatment vs no treatment
The causal influence of administering drug A (or not) on plasma glucose levels of the experimental subjects is the central reason for performing the experiment
Independent variables
The presence/absence of drug A is an independent variable as whether it is administered or not is entirely under our control, as experimenters.
Control vs treatment groups
We cannot administer the drug in isolation!
Like most drugs, drug A is carried in a vehicle - a substance expected to be inert in the contect of the treatment.
Controlling for variables
It would be unwise to assume we could administer drug A alone to one group of mice and apply no intervention in the other group.
Physiological responses may be affected by: - the vehicle - the procedure (e.g. injection)
So we must follow the same protocol in the control and treatment groups, with the only variable being presence/absence of drug A.
Random assignment to groups
Individual mice may respond differently to drug A, the vehicle, and the procedure, due to underlying differences between them.
We therefore need to randomise subjects into experimental groups, assuming that all individuals are drawn from the same population (a single pool whose response varies according to some common distribution).
Figure 3: EDA diagrams are composed of nodes representing aspects of an experiment, and links representing relationships between the nodes
Figure 4: The final EDA diagram describing our experiment
Power calculations describe what can be learned from statistical analysis of an experiment
G*Power
G*Power we worked through a practical example of how to estimate sample size for a desired statistical powerThe NC3Rs EDA can help us design better experiments
We worked through the complete design, critique, and analysis of an experiment using the NC3Rs EDA tool, including how to share the resulting design with others.
What next?
Experimental design is a large and diverse field, and to become competent in experimental design takes time, and requires practice (which also involves making mistakes, at times).
What no-one tells you
Even experienced scientists with many publications and long track records of research funding may never have received any formal training in experimental design. This can lead to entrenched and outdated ideas being propagated.
It is always advisable to consult with funding bodies and specialists in statistics, experimental design, and research ethics.
MP968 Experimental Design Workshop