I am worried that I have a pair of loaded dice: biased to roll one number more often than expected by chance alone.
My experiment
I roll my two dice and the “biased” number shows on one die; check if both dice show the same number.
If both dice show the same number more often than chance alone would suggest, I will accept my dice are loaded.
Let’s set a P-value threshold for accepting the null hypothesis
\(H_0\): both dice are fair and show that same number by chance alone.
Ooh! Risky!
Figure 1: How to bias dice by heating them in an oven at about 121degC for 10min. Don’t use a microwave or blame me for the consequences/if you get caught.
If one die shows the pre-named “bias” number, what is the probability that both dice are fair and showed the same number, by chance alone?
A definition
The probability of an event occurring is: the proportion of all possible outcomes that are that event.
Tossing a coin
Outcomes: heads or tails (two outcomes, assuming a fair coin and toss)
Probability of showing heads: \(\frac{1}{2} = 0.5\) as it is one of two outcomes
Caution
Verbal and written experiment descriptions can influence or disguise expected effect sizes, statistical analysis and outcomes
Talk to a statistician (or other colleague)
Caution
Verbal and written experiment descriptions influence statistical analysis and outcomes
Experimenter understanding can influence use of language
Please share the EDA diagram/session with your statistician.
Figure 2: NC3Rs EDA forces clarification of concepts and is a focus for discussion.
ctrl) and two treatments (trt1, trt2)t-tests assume that datasets are Normal distributions 1
The only input the test gets:

t-tests assume that datasets are Normal distributions
The only input the test gets:
| group | mean | sd |
|---|---|---|
| ctrl | 5.032 | 0.5830914 |
| trt1 | 4.661 | 0.7936757 |
| trt2 | 5.526 | 0.4425733 |
Tip
I’d recommend R for reproducible analyses. 1
| estimate | conf.low | conf.high | p.value | |
|---|---|---|---|---|
| ctrl.vs.trt1 | 0.371 | -0.2875162 | 1.0295162 | 0.2503825 |
| ctrl.vs.trt2 | -0.494 | -0.9828721 | -0.0051279 | 0.0478993 |
| trt1.vs.trt2 | -0.865 | -1.4809144 | -0.2490856 | 0.0092984 |
These p.values are not correct
For three groups, there are three pairwise comparisons.
But t-tests calculate probability for a single pairwise comparison!
Multiple t-tests on your data increase Type I error rate (at P<0.05)
These p.values are not correct
For three groups, there are three pairwise comparisons.
But t-tests calculate probability for a single pairwise comparison!
One solution: multiple test correction
Bonferroni, Benjamini-Hochberg, etc.
We adjust our threshold for significance.
Which comparisons are significant at \(P=0.05\) for a single comparison, when Bonferroni corrected for three comparisons? (i.e. \(P=0.016\))
| estimate | conf.low | conf.high | p.value | |
|---|---|---|---|---|
| ctrl.vs.trt1 | 0.371 | -0.2875162 | 1.0295162 | 0.2503825 |
| ctrl.vs.trt2 | -0.494 | -0.9828721 | -0.0051279 | 0.0478993 |
| trt1.vs.trt2 | -0.865 | -1.4809144 | -0.2490856 | 0.0092984 |
Especially if you have data in three or more groups, use ANOVA
R:No more difficult than applying a t-test
| estimate | conf.low | conf.high | p.value |
|---|---|---|---|
| -0.865 | -1.48091 | -0.24909 | 0.0093 |
| Characteristic | p-value |
|---|---|
| group | 0.008 |
Student’s t-test assumes equal variances
| estimate | conf.low | conf.high | p.value |
|---|---|---|---|
| -0.865 | -1.469 | -0.261 | 0.008 |
ANOVA on two groups is a pairwise Student’s t-test
| Characteristic | p-value |
|---|---|
| group | 0.008 |
| estimate | conf.low | conf.high | p.value |
|---|---|---|---|
| -0.865 | -1.48091 | -0.24909 | 0.0093 |
ANOVA with unequal variance is a pairwise Welch’s t-test
| p.value | method |
|---|---|
| 0.0093 | One-way analysis of means (not assuming equal variances) |
All pairwise comparisons with ANOVA
Use Tukey’s HSD (Honest Significant Difference)
| group1 | group2 | estimate | conf.low | conf.high | p.adj |
|---|---|---|---|---|---|
| ctrl | trt1 | -0.371 | -1.062 | 0.320 | 0.391 |
| ctrl | trt2 | 0.494 | -0.197 | 1.185 | 0.198 |
| trt1 | trt2 | 0.865 | 0.174 | 1.556 | 0.012 |
This is important when using both sexes
But also if there are other batch effects to account for
Figure 3: MRC require that both sexes are used in experiments, unless there is strong justification not to.


Let’s look at penguins!
| species | sex | body_mass_g |
|---|---|---|
| Adelie | male | 3750 |
| Adelie | female | 3800 |
| Adelie | female | 3250 |
| Adelie | female | 3450 |
| Adelie | male | 3650 |
| Adelie | female | 3625 |
| Characteristic | p-value |
|---|---|
| species | <0.001 |
| group1 | group2 | estimate | conf.low | conf.high | p.adj |
|---|---|---|---|---|---|
| Adelie | Chinstrap | 26.924 | -132.353 | 186.201 | 0.916 |
| Adelie | Gentoo | 1386.273 | 1252.290 | 1520.255 | 0.000 |
| Chinstrap | Gentoo | 1359.349 | 1194.430 | 1524.267 | 0.000 |
| Characteristic | p-value |
|---|---|
| species | <0.001 |
| sex | <0.001 |
| group1 | group2 | estimate | conf.low | conf.high | p.adj |
|---|---|---|---|---|---|
| Adelie | Chinstrap | 26.924 | -82.515 | 136.363 | 0.831 |
| Adelie | Gentoo | 1386.273 | 1294.213 | 1478.332 | 0.000 |
| Chinstrap | Gentoo | 1359.349 | 1246.033 | 1472.664 | 0.000 |
| female | male | 667.458 | 599.193 | 735.722 | 0.000 |
Figure 6: Two-way ANOVA lets us see interactions between categories
| Characteristic | p-value |
|---|---|
| species | <0.001 |
| sex | <0.001 |
| species * sex | <0.001 |
There are significant effects due to species and sex
And also an interaction between species and sex
(i.e. the influence of sex varies from species to species)
Can use R’s regression tools to extract more information
| Characteristic | Beta | 95% CI1 | p-value |
|---|---|---|---|
| species | |||
| Adelie | — | — | |
| Chinstrap | 158 | 32, 285 | 0.014 |
| Gentoo | 1,311 | 1,204, 1,418 | <0.001 |
| sex | |||
| female | — | — | |
| male | 675 | 574, 775 | <0.001 |
| species * sex | |||
| Chinstrap * male | -263 | -442, -84 | 0.004 |
| Gentoo * male | 130 | -20, 281 | 0.089 |
| 1 CI = Confidence Interval | |||
Using interactions and regression can be more informative
| group1 | group2 | estimate | conf.low | conf.high | p.adj |
|---|---|---|---|---|---|
| Adelie | Chinstrap | 26.924 | -82.515 | 136.363 | 0.831 |
| Adelie | Gentoo | 1386.273 | 1294.213 | 1478.332 | 0.000 |
| Chinstrap | Gentoo | 1359.349 | 1246.033 | 1472.664 | 0.000 |
| female | male | 667.458 | 599.193 | 735.722 | 0.000 |
| Characteristic | Beta | 95% CI1 | p-value |
|---|---|---|---|
| species | |||
| Adelie | — | — | |
| Chinstrap | 158 | 32, 285 | 0.014 |
| Gentoo | 1,311 | 1,204, 1,418 | <0.001 |
| sex | |||
| female | — | — | |
| male | 675 | 574, 775 | <0.001 |
| species * sex | |||
| Chinstrap * male | -263 | -442, -84 | 0.004 |
| Gentoo * male | 130 | -20, 281 | 0.089 |
| 1 CI = Confidence Interval | |||
Figure 7: EDA may well recommend ANOVA
But NC3Rs EDA power calculations only cover pairwise t-tests!
NC3Rs EDA power calculations only cover pairwise t-tests
But other tools are available
RG*PowerSPSSStataG*PowerSupports ANOVA power calculation

G*Power on macOSStatistical Power
Type II Error, \(\beta\): the probability of a false negative (missing a true positive result)
Power, \(1 - \beta\): the probability that you won’t miss a true positive result (assuming that there is one)
Statistical Threshold
Type I Error, \(\alpha\): the probability of a false positive (calling a positive result when the true result is negative)
This is the P-value threshold you set for your hypothesis tests
For 3Rs we want to minimise individuals used
You need to know 1
Effect size definition
Effect size is an interpretable number that quantifies the difference between the data and a hypothesis.
Multiple measures for this
G*Power uses Cohen’s F for ANOVA
Our experiment
A priori power calculation (ANOVA): main effects and interactions
F-testANOVA, Fixed effects, special, main effects and interactionsF testsA priori: Compute required sample size - given \(\alpha\), power, and effect size0.40.050.814Effect size
A value of Cohen’s F (\(f\)) that represents the effect size we want to be able to detect.
The contribution of the effect we want to detect to the overall variation in the dataset

G*powerImportant
8.324.04348520.807Caution

G*power outputG*Power will let you plot how sample size trends with desired powerFigure 11: G*Power sample size vs power plot
Figure 12: G*Power sample size vs power plot
Use NC3Rs EDA to formalise your design
Use ANOVA (where appropriate)
If using ANOVA, G*Power can calculate required samples for desired power

AWERB 3Rs day 2023