I am worried that I have a pair of loaded dice: biased to roll one number more often than expected by chance alone.
My experiment
I roll my two dice and the “biased” number shows on one die; check if both dice show the same number.
If both dice show the same number more often than chance alone would suggest, I will accept my dice are loaded.
Let’s set a P-value threshold for accepting the null hypothesis
\(H_0\): both dice are fair and show that same number by chance alone.
Ooh! Risky!
If one die shows the pre-named “bias” number, what is the probability that both dice are fair and showed the same number, by chance alone?
A definition
The probability of an event occurring is: the proportion of all possible outcomes that are that event.
Tossing a coin
Outcomes: heads or tails (two outcomes, assuming a fair coin and toss)
Probability of showing heads: \(\frac{1}{2} = 0.5\) as it is one of two outcomes
Caution
Verbal and written experiment descriptions can influence or disguise expected effect sizes, statistical analysis and outcomes
Talk to a statistician (or other colleague)
Caution
Verbal and written experiment descriptions influence statistical analysis and outcomes
Experimenter understanding can influence use of language
Please share the EDA diagram/session with your statistician.
ctrl
) and two treatments (trt1
, trt2
)t-tests assume that datasets are Normal distributions 1
The only input the test gets:
t-tests assume that datasets are Normal distributions
The only input the test gets:
group | mean | sd |
---|---|---|
ctrl | 5.032 | 0.5830914 |
trt1 | 4.661 | 0.7936757 |
trt2 | 5.526 | 0.4425733 |
Tip
I’d recommend R
for reproducible analyses. 1
estimate | conf.low | conf.high | p.value | |
---|---|---|---|---|
ctrl.vs.trt1 | 0.371 | -0.2875162 | 1.0295162 | 0.2503825 |
ctrl.vs.trt2 | -0.494 | -0.9828721 | -0.0051279 | 0.0478993 |
trt1.vs.trt2 | -0.865 | -1.4809144 | -0.2490856 | 0.0092984 |
These p.values
are not correct
For three groups, there are three pairwise comparisons.
But t-tests calculate probability for a single pairwise comparison!
Multiple t-tests on your data increase Type I error rate (at P<0.05)
These p.values
are not correct
For three groups, there are three pairwise comparisons.
But t-tests calculate probability for a single pairwise comparison!
One solution: multiple test correction
Bonferroni, Benjamini-Hochberg, etc.
We adjust our threshold for significance.
Which comparisons are significant at \(P=0.05\) for a single comparison, when Bonferroni corrected for three comparisons? (i.e. \(P=0.016\))
estimate | conf.low | conf.high | p.value | |
---|---|---|---|---|
ctrl.vs.trt1 | 0.371 | -0.2875162 | 1.0295162 | 0.2503825 |
ctrl.vs.trt2 | -0.494 | -0.9828721 | -0.0051279 | 0.0478993 |
trt1.vs.trt2 | -0.865 | -1.4809144 | -0.2490856 | 0.0092984 |
Especially if you have data in three or more groups, use ANOVA
R
:No more difficult than applying a t-test
estimate | conf.low | conf.high | p.value |
---|---|---|---|
-0.865 | -1.48091 | -0.24909 | 0.0093 |
Characteristic | p-value |
---|---|
group | 0.008 |
Student’s t-test assumes equal variances
estimate | conf.low | conf.high | p.value |
---|---|---|---|
-0.865 | -1.469 | -0.261 | 0.008 |
ANOVA on two groups is a pairwise Student’s t-test
Characteristic | p-value |
---|---|
group | 0.008 |
estimate | conf.low | conf.high | p.value |
---|---|---|---|
-0.865 | -1.48091 | -0.24909 | 0.0093 |
ANOVA with unequal variance is a pairwise Welch’s t-test
p.value | method |
---|---|
0.0093 | One-way analysis of means (not assuming equal variances) |
All pairwise comparisons with ANOVA
Use Tukey’s HSD (Honest Significant Difference)
group1 | group2 | estimate | conf.low | conf.high | p.adj |
---|---|---|---|---|---|
ctrl | trt1 | -0.371 | -1.062 | 0.320 | 0.391 |
ctrl | trt2 | 0.494 | -0.197 | 1.185 | 0.198 |
trt1 | trt2 | 0.865 | 0.174 | 1.556 | 0.012 |
This is important when using both sexes
But also if there are other batch effects to account for
Let’s look at penguins!
species | sex | body_mass_g |
---|---|---|
Adelie | male | 3750 |
Adelie | female | 3800 |
Adelie | female | 3250 |
Adelie | female | 3450 |
Adelie | male | 3650 |
Adelie | female | 3625 |
Characteristic | p-value |
---|---|
species | <0.001 |
group1 | group2 | estimate | conf.low | conf.high | p.adj |
---|---|---|---|---|---|
Adelie | Chinstrap | 26.924 | -132.353 | 186.201 | 0.916 |
Adelie | Gentoo | 1386.273 | 1252.290 | 1520.255 | 0.000 |
Chinstrap | Gentoo | 1359.349 | 1194.430 | 1524.267 | 0.000 |
Characteristic | p-value |
---|---|
species | <0.001 |
sex | <0.001 |
group1 | group2 | estimate | conf.low | conf.high | p.adj |
---|---|---|---|---|---|
Adelie | Chinstrap | 26.924 | -82.515 | 136.363 | 0.831 |
Adelie | Gentoo | 1386.273 | 1294.213 | 1478.332 | 0.000 |
Chinstrap | Gentoo | 1359.349 | 1246.033 | 1472.664 | 0.000 |
female | male | 667.458 | 599.193 | 735.722 | 0.000 |
Characteristic | p-value |
---|---|
species | <0.001 |
sex | <0.001 |
species * sex | <0.001 |
There are significant effects due to species and sex
And also an interaction between species and sex
(i.e. the influence of sex varies from species to species)
Can use R
’s regression tools to extract more information
Characteristic | Beta | 95% CI1 | p-value |
---|---|---|---|
species | |||
Adelie | — | — | |
Chinstrap | 158 | 32, 285 | 0.014 |
Gentoo | 1,311 | 1,204, 1,418 | <0.001 |
sex | |||
female | — | — | |
male | 675 | 574, 775 | <0.001 |
species * sex | |||
Chinstrap * male | -263 | -442, -84 | 0.004 |
Gentoo * male | 130 | -20, 281 | 0.089 |
1 CI = Confidence Interval |
Using interactions and regression can be more informative
group1 | group2 | estimate | conf.low | conf.high | p.adj |
---|---|---|---|---|---|
Adelie | Chinstrap | 26.924 | -82.515 | 136.363 | 0.831 |
Adelie | Gentoo | 1386.273 | 1294.213 | 1478.332 | 0.000 |
Chinstrap | Gentoo | 1359.349 | 1246.033 | 1472.664 | 0.000 |
female | male | 667.458 | 599.193 | 735.722 | 0.000 |
Characteristic | Beta | 95% CI1 | p-value |
---|---|---|---|
species | |||
Adelie | — | — | |
Chinstrap | 158 | 32, 285 | 0.014 |
Gentoo | 1,311 | 1,204, 1,418 | <0.001 |
sex | |||
female | — | — | |
male | 675 | 574, 775 | <0.001 |
species * sex | |||
Chinstrap * male | -263 | -442, -84 | 0.004 |
Gentoo * male | 130 | -20, 281 | 0.089 |
1 CI = Confidence Interval |
But NC3Rs EDA power calculations only cover pairwise t-tests!
NC3Rs EDA power calculations only cover pairwise t-tests
But other tools are available
R
G*Power
SPSS
Stata
G*Power
Supports ANOVA power calculation
Statistical Power
Type II Error, \(\beta\): the probability of a false negative (missing a true positive result)
Power, \(1 - \beta\): the probability that you won’t miss a true positive result (assuming that there is one)
Statistical Threshold
Type I Error, \(\alpha\): the probability of a false positive (calling a positive result when the true result is negative)
This is the P-value threshold you set for your hypothesis tests
For 3Rs we want to minimise individuals used
You need to know 1
Effect size definition
Effect size is an interpretable number that quantifies the difference between the data and a hypothesis.
Multiple measures for this
G*Power
uses Cohen’s F for ANOVA
Our experiment
A priori power calculation (ANOVA): main effects and interactions
F-test
ANOVA, Fixed effects, special, main effects and interactions
F tests
A priori: Compute required sample size - given
\(\alpha\), power, and effect size
0.4
0.05
0.8
1
4
Effect size
A value of Cohen’s F (\(f\)) that represents the effect size we want to be able to detect.
The contribution of the effect we want to detect to the overall variation in the dataset
Important
8.32
4.043
48
52
0.807
Caution
G*Power
will let you plot how sample size trends with desired powerUse NC3Rs EDA to formalise your design
Use ANOVA (where appropriate)
If using ANOVA, G*Power
can calculate required samples for desired power
AWERB 3Rs day 2023