I am worried that I have a pair of loaded dice: biased to roll one number more often than expected by chance alone.
My experiment
I roll my two dice and the “biased” number shows on one die; check if both dice show the same number.
If both dice show the same number more often than chance alone would suggest, I will accept my dice are loaded.
Let’s set a P-value threshold for accepting the null hypothesis
\(H_0\): both dice are fair and show that same number by chance alone.
Ooh! Risky!
Figure 1: How to bias dice by heating them in an oven at about 121degC for 10min. Don’t use a microwave or blame me for the consequences/if you get caught.
If one die shows the pre-named “bias” number, what is the probability that both dice are fair and showed the same number, by chance alone?
A definition
The probability of an event occurring is: the proportion of all possible outcomes that are that event.
Tossing a coin
Outcomes: heads or tails (two outcomes, assuming a fair coin and toss)
Probability of showing heads: \(\frac{1}{2} = 0.5\) as it is one of two outcomes
Caution
Verbal and written experiment descriptions can influence or disguise expected effect sizes, statistical analysis and outcomes
Talk to a statistician (or other colleague)
Caution
Verbal and written experiment descriptions influence statistical analysis and outcomes
Experimenter understanding can influence use of language
Please share the EDA diagram/session with your statistician.
Figure 2: NC3Rs EDA forces clarification of concepts and is a focus for discussion.
ctrl
) and two treatments (trt1
, trt2
)t-tests assume that datasets are Normal distributions 1
The only input the test gets:
t-tests assume that datasets are Normal distributions
The only input the test gets:
group | mean | sd |
---|---|---|
ctrl | 5.032 | 0.5830914 |
trt1 | 4.661 | 0.7936757 |
trt2 | 5.526 | 0.4425733 |
Tip
I’d recommend R
for reproducible analyses. 1
estimate | conf.low | conf.high | p.value | |
---|---|---|---|---|
ctrl.vs.trt1 | 0.371 | -0.2875162 | 1.0295162 | 0.2503825 |
ctrl.vs.trt2 | -0.494 | -0.9828721 | -0.0051279 | 0.0478993 |
trt1.vs.trt2 | -0.865 | -1.4809144 | -0.2490856 | 0.0092984 |
These p.values
are not correct
For three groups, there are three pairwise comparisons.
But t-tests calculate probability for a single pairwise comparison!
Multiple t-tests on your data increase Type I error rate (at P<0.05)
These p.values
are not correct
For three groups, there are three pairwise comparisons.
But t-tests calculate probability for a single pairwise comparison!
One solution: multiple test correction
Bonferroni, Benjamini-Hochberg, etc.
We adjust our threshold for significance.
Which comparisons are significant at \(P=0.05\) for a single comparison, when Bonferroni corrected for three comparisons? (i.e. \(P=0.016\))
estimate | conf.low | conf.high | p.value | |
---|---|---|---|---|
ctrl.vs.trt1 | 0.371 | -0.2875162 | 1.0295162 | 0.2503825 |
ctrl.vs.trt2 | -0.494 | -0.9828721 | -0.0051279 | 0.0478993 |
trt1.vs.trt2 | -0.865 | -1.4809144 | -0.2490856 | 0.0092984 |
Especially if you have data in three or more groups, use ANOVA
R
:No more difficult than applying a t-test
estimate | conf.low | conf.high | p.value |
---|---|---|---|
-0.865 | -1.48091 | -0.24909 | 0.0093 |
Characteristic | p-value |
---|---|
group | 0.008 |
Student’s t-test assumes equal variances
estimate | conf.low | conf.high | p.value |
---|---|---|---|
-0.865 | -1.469 | -0.261 | 0.008 |
ANOVA on two groups is a pairwise Student’s t-test
Characteristic | p-value |
---|---|
group | 0.008 |
estimate | conf.low | conf.high | p.value |
---|---|---|---|
-0.865 | -1.48091 | -0.24909 | 0.0093 |
ANOVA with unequal variance is a pairwise Welch’s t-test
p.value | method |
---|---|
0.0093 | One-way analysis of means (not assuming equal variances) |
All pairwise comparisons with ANOVA
Use Tukey’s HSD (Honest Significant Difference)
group1 | group2 | estimate | conf.low | conf.high | p.adj |
---|---|---|---|---|---|
ctrl | trt1 | -0.371 | -1.062 | 0.320 | 0.391 |
ctrl | trt2 | 0.494 | -0.197 | 1.185 | 0.198 |
trt1 | trt2 | 0.865 | 0.174 | 1.556 | 0.012 |
This is important when using both sexes
But also if there are other batch effects to account for
Figure 3: MRC require that both sexes are used in experiments, unless there is strong justification not to.
Let’s look at penguins!
species | sex | body_mass_g |
---|---|---|
Adelie | male | 3750 |
Adelie | female | 3800 |
Adelie | female | 3250 |
Adelie | female | 3450 |
Adelie | male | 3650 |
Adelie | female | 3625 |
Characteristic | p-value |
---|---|
species | <0.001 |
group1 | group2 | estimate | conf.low | conf.high | p.adj |
---|---|---|---|---|---|
Adelie | Chinstrap | 26.924 | -132.353 | 186.201 | 0.916 |
Adelie | Gentoo | 1386.273 | 1252.290 | 1520.255 | 0.000 |
Chinstrap | Gentoo | 1359.349 | 1194.430 | 1524.267 | 0.000 |
Characteristic | p-value |
---|---|
species | <0.001 |
sex | <0.001 |
group1 | group2 | estimate | conf.low | conf.high | p.adj |
---|---|---|---|---|---|
Adelie | Chinstrap | 26.924 | -82.515 | 136.363 | 0.831 |
Adelie | Gentoo | 1386.273 | 1294.213 | 1478.332 | 0.000 |
Chinstrap | Gentoo | 1359.349 | 1246.033 | 1472.664 | 0.000 |
female | male | 667.458 | 599.193 | 735.722 | 0.000 |
Figure 6: Two-way ANOVA lets us see interactions between categories
Characteristic | p-value |
---|---|
species | <0.001 |
sex | <0.001 |
species * sex | <0.001 |
There are significant effects due to species and sex
And also an interaction between species and sex
(i.e. the influence of sex varies from species to species)
Can use R
’s regression tools to extract more information
Characteristic | Beta | 95% CI1 | p-value |
---|---|---|---|
species | |||
Adelie | — | — | |
Chinstrap | 158 | 32, 285 | 0.014 |
Gentoo | 1,311 | 1,204, 1,418 | <0.001 |
sex | |||
female | — | — | |
male | 675 | 574, 775 | <0.001 |
species * sex | |||
Chinstrap * male | -263 | -442, -84 | 0.004 |
Gentoo * male | 130 | -20, 281 | 0.089 |
1 CI = Confidence Interval |
Using interactions and regression can be more informative
group1 | group2 | estimate | conf.low | conf.high | p.adj |
---|---|---|---|---|---|
Adelie | Chinstrap | 26.924 | -82.515 | 136.363 | 0.831 |
Adelie | Gentoo | 1386.273 | 1294.213 | 1478.332 | 0.000 |
Chinstrap | Gentoo | 1359.349 | 1246.033 | 1472.664 | 0.000 |
female | male | 667.458 | 599.193 | 735.722 | 0.000 |
Characteristic | Beta | 95% CI1 | p-value |
---|---|---|---|
species | |||
Adelie | — | — | |
Chinstrap | 158 | 32, 285 | 0.014 |
Gentoo | 1,311 | 1,204, 1,418 | <0.001 |
sex | |||
female | — | — | |
male | 675 | 574, 775 | <0.001 |
species * sex | |||
Chinstrap * male | -263 | -442, -84 | 0.004 |
Gentoo * male | 130 | -20, 281 | 0.089 |
1 CI = Confidence Interval |
Figure 7: EDA may well recommend ANOVA
But NC3Rs EDA power calculations only cover pairwise t-tests!
NC3Rs EDA power calculations only cover pairwise t-tests
But other tools are available
R
G*Power
SPSS
Stata
G*Power
Supports ANOVA power calculation
G*Power
on macOSStatistical Power
Type II Error, \(\beta\): the probability of a false negative (missing a true positive result)
Power, \(1 - \beta\): the probability that you won’t miss a true positive result (assuming that there is one)
Statistical Threshold
Type I Error, \(\alpha\): the probability of a false positive (calling a positive result when the true result is negative)
This is the P-value threshold you set for your hypothesis tests
For 3Rs we want to minimise individuals used
You need to know 1
Effect size definition
Effect size is an interpretable number that quantifies the difference between the data and a hypothesis.
Multiple measures for this
G*Power
uses Cohen’s F for ANOVA
Our experiment
A priori power calculation (ANOVA): main effects and interactions
F-test
ANOVA, Fixed effects, special, main effects and interactions
F tests
A priori: Compute required sample size - given
\(\alpha\), power, and effect size
0.4
0.05
0.8
1
4
Effect size
A value of Cohen’s F (\(f\)) that represents the effect size we want to be able to detect.
The contribution of the effect we want to detect to the overall variation in the dataset
G*power
Important
8.32
4.043
48
52
0.807
Caution
G*power
outputG*Power
will let you plot how sample size trends with desired powerFigure 11: G*Power
sample size vs power plot
Figure 12: G*Power
sample size vs power plot
Use NC3Rs EDA to formalise your design
Use ANOVA (where appropriate)
If using ANOVA, G*Power
can calculate required samples for desired power
AWERB 3Rs day 2023