8 Eliminating batch effects
It may take a couple of minutes to install the necessary packages in WebR
- please be patient.
This page preloads the datasets as tissue
and catheter
for use in R
cells.
8.1 Introduction
In Chapter 7 we saw how to use a linear modelling approach to estimate directly the influence of gene knockout and other interventions on the amount of viable bacteria recovered (and so infer a difference in “stickiness” or adherence to a substrate).
This is an extremely powerful method not just because it lets us identify the strengths of multiple effects directly and simultaneously from our data (rather than testing for “a difference” as with ANOVA/t-tests), but because we can also use them to exclude the effects of interfering influences such as batch effects.
In this section, we will see how to use the R
package lme4
to account for and remove the effects of those unwanted influences.
lme4
is a linear modelling package that extends the capability of R
’s built-in (and still extremely powerful) linear modelling function lm()
to enable mixed-effects models. This extension is what allows us to account for batch effects in our experiment.
8.2 The Model
We take as our starting point the model we had in Chapter 7:
\[ \textrm{measured logCFU} = \textrm{baseline} + \textrm{knockout} + \textrm{empty} + \textrm{complement} + \epsilon \tag{8.1}\]
We want to extend Equation 8.1 to account for the influence of each individual batch, separately. That is, we want to represent a situation where all measurements in batch 1 are under a similar influence, but that influence is different to the one affecting batches 2-8, and so on.
In mathematical notation, we could modify Equation 8.1 to represent the measurement for a row \(i\) in the table as:
\[ \textrm{measured logCFU}_i = \textrm{baseline} + \textrm{knockout} + \textrm{empty} + \textrm{complement} + \textrm{batch}_i + \epsilon \tag{8.2}\]
where \(\textrm{batch}_i\) represents “influence due to the corresponding batch for measurement \(i\).” This is quite a small adjustment on paper (or the screen!), but it can have very wide-ranging effects.
We’re taking some liberties with formal mathematical notation here and leaving out some components of the equation to make the connection between biology and the equation clearer.
8.3 The Model in R
The change we need to make to our model representation in R
is similarly small. The previous model looked like this:
~ KO + empty + complement logCFU
and the modified model requires the addition of only one term:
~ KO + empty + complement + (1 | batch) logCFU
So, what does (1 | batch)
mean?
8.3.1 Fixed and random effects
The model we have built is now what’s called a “mixed model” that incorporates fixed effects and random effects. But what are they and why do they differ?
Fixed effects are the influences that we focus our practical experiments on. In our experiments we make a specific, intentional change to the system (e.g. knocking out a gene) that we expect to have a consistent effect on the outcome. The term “fixed” in “fixed effect” refers to the expectation of this effect being constant across individual runs of the experiment.
Random effects are influences on the outcome that can vary between individual runs (such as the “measurement error \(\epsilon\) which reflects small random variations in each measurement), or between groups of runs - such as the batch-level differences we saw in Chapter 7. We expect that every measurement in the same batch is affected in approximately the same way, with some random error, but that the effect can differ between batches.
The part of the R
model expressed by (1 | batch)
represents this random effect due to a measurement being made in a particular batch.
As before, we will walk through the modelling process for the catheter data, but modelling the tissue data is left as an exercise for you to solve.
8.4 Fitting the catheter model
We will use the lme4
function lmer()
to fit the model in Equation 8.2 to our catheter
dataset in the WebR
cells below.
8.4.1 Load the data
We have preloaded the data for you as the dataframes tissue
and catheter
.
Use the WebR
cell below to confirm the catheter data is loaded
Note that the data contains three columns: KO
, empty
, and complement
that contain either a 1
or 0
value, and a batch
column that indicates the batch to which the measurement belongs
The KO
, empty
, and complement
columns describe whether the logCFU measurement in the row corresponds to a line that is affected by a gene knockout (KO
= 1
), the presence of a plasmid vector (empty
= 1
), or the reintroduced gene (complement
= 1
). These columns allow us to use the lmer()
function to estimate the influence of each biological intervention.
The batch
column contains a number representing the batch to which the measurement belongs. The R
model uses this to estimate (and remove) the effect on measurement that appears to be due only to it belonging to that batch.
We again assume that measurements of the bacteria all share the same (wild-type) baseline.
8.4.2 Define the model
As noted above, R
’s model syntax lets us include the influence of batches as a random effect.
Here, our measured value is in the column logCFU
, and is assumed to be influenced by factors in the columns KO
, empty
, and complement
(where they apply/are equal to one), and also by a random influence shared by all members of the same batch
number. We represent this with the R
statement below:
~ KO + empty + complement + (1 | batch) logCFU
which reads: “logCFU
is influenced by (~
) the sum of effects of KO
, empty
, and complement
(where they apply), and by a batch-specific effect corresponding to batch
.”
8.4.3 Fit the model
To fit our data to this model, we use this model definition in the lmer()
function as below (specifying that the dataframe we’re using is catheter
):
<- lmer(logCFU ~ KO + empty + complement + (1 | batch), data=catheter) catheter_model
Fit the catheter data to this model in the WebR
cell below.
Use the R
code:
<- lmer(logCFU ~ KO + empty + complement + (1 | batch), data=catheter) catheter_mixed_model
8.5 Interpreting the catheter model
The fitted model is stored in the variable catheter_mixed_model
and, just as before, we can obtain useful information from it in a number of ways.
8.5.1 Coefficients and confidence intervals
The coefficients that the model reports are the estimated effects of each factor of interest. To obtain the coefficients, use the R
code below to produce a summary of the fitted model:
summary(catheter_mixed_model)
We use summary()
instead of coef()
because the mixed-effects model is more complex and the output is not quite the same.
The confidence intervals reported by the model are the range of values that the model thinks are most likely for the coefficients. These work the same way as coefficients for t-tests and similar statistical methods: the range of values bounded by the 2.5% and 97.5% confidence limits is a 95% confidence interval. We would expect that the true value of the coefficient being estimated would lie in this range 95% of the time.
To find the confidence intervals for the coefficients of the model, use the R
code below:
confint(catheter_mixed_model)
Find the estimated coefficients and corresponding confidence intervals for the catheter model in the WebR
cell below.
Use the R
code:
summary(catheter_mixed_model)
confint(catheter_mixed_model)
8.5.2 Reading the output
You should see output that resembles the data below:
Linear mixed model fit by REML ['lmerMod']
Formula: logCFU ~ KO + empty + complement + (1 | batch)
Data: catheter
REML criterion at convergence: 18.2
Scaled residuals:
Min 1Q Median 3Q Max
-2.4164 -0.5003 0.1475 0.5460 1.9160
Random effects:
Groups Name Variance Std.Dev.
batch (Intercept) 0.11587 0.3404
Residual 0.06385 0.2527
Number of obs: 40, groups: batch, 4
Fixed effects:
Estimate Std. Error t value
(Intercept) 6.3383 0.2536 24.992
KO -0.3213 0.1130 -2.843
empty -0.1466 0.3587 -0.409
complement 0.3987 0.1130 3.528
Correlation of Fixed Effects:
(Intr) KO empty
KO -0.223
empty -0.637 -0.158
complement 0.000 0.000 -0.158
Computing profile confidence intervals ...
2.5 % 97.5 %
.sig01 0.1163306 0.6012999
.sigma 0.1981686 0.3153956
(Intercept) 5.8858767 6.7906885
KO -0.5424385 -0.1001904
empty -0.7863870 0.4932101
complement 0.1775290 0.6197772
8.5.2.1 Coefficients
The output of summary(catheter_mixed_model)
shows more information than for the output of lm()
with coef()
, but we can focus on the important section, which is in the section titled Fixed Effects
:
Fixed effects:
Estimate Std. Error t value
(Intercept) 6.3383 0.2536 24.992
KO -0.3213 0.1130 -2.843
empty -0.1466 0.3587 -0.409
complement 0.3987 0.1130 3.528
(Intercept)
is again the wild-type (WT
)/control logCFU value coefficient estimate. Here this is6.34
, which looks to be a reasonable value by visual comparison with Figure 6.1.- The
KO
coefficient estimate is-0.32
, which indicates that knocking out the etpD gene reduces the logCFU value by 0.32 units (as this is a log scale, that corresponds approximately to halving the number of recovered CFUs). - The
empty
coefficient estimate is-0.15
, which indicates a further reduction in bacterial recovery is seen due to inserting the plasmid. - The
complement
coefficient estimate is0.40
- of the same order as the reduction in logCFU seen for the gene knockout. This implies that returning the gene also restores bacterial recovery to the original, wild-type level.
8.5.2.2 Confidence intervals
The confint(catheter_mixed_model)
output lists the expected 95% confidence intervals for the coefficients.
2.5 % 97.5 %
.sig01 0.1163306 0.6012999
.sigma 0.1981686 0.3153956
(Intercept) 5.8858767 6.7906885
KO -0.5424385 -0.1001904
empty -0.7863870 0.4932101
complement 0.1775290 0.6197772
We can ignore the .sig01
and .sigma
rows to focus on our fixed effects of interest.
- The
(Intercept)
/WT
/baseline logCFU is likely to lie between5.89
and6.79
units - The effect of knocking out the gene (
KO
) is likely to lie between-0.54
and0.010
units. The interval does not include zero so we can be confident that this is a negative effect. - Introduction of the plasmid vector changes logCFU by between
-0.79
and0.49
units which includes zero, and so we can’t rule out that there is no effect on logCFU due to presence of the plasmid - Complementing the etpD gene (
complement
) increases logCFU by between0.18
and0.62
units, enhancing recovery of bacteria. This interval does not include zero so we can be confident that this is a positive effect.
Compare the estimated effects (coefficients) for our fixed effects when fitting both models (the model above, and that in Chapter 7). How different are they and what can we conclude about the relative usefulness of each model?
WT
/(Intercept)
KO
empty
complement
> coef(catheter_model)
(Intercept) KO empty complement 6.3382829 -0.3213144 -0.1465880 0.3986531
> summary(catheter_mixed_model)
:
Fixed effects
Estimate Std. Error t value6.3383 0.2536 24.992
(Intercept) -0.3213 0.1130 -2.843
KO -0.1466 0.3587 -0.409
empty 0.3987 0.1130 3.528 complement
The estimated coefficients for the (Intercept)
/WT
, KO
, empty
, and complement
fixed effects are essentially identical to four decimal places.
There is no difference between the two models in their estimates of effect sizes. This suggests that both models converge to consistent estimates which gives confidence in the result.
Compare the confidence intervals for the estimated effects (coefficients) of our fixed effects when fitting both models (the model above, and that in Chapter 7). How different are they and what can we conclude about the relative usefulness of each model?
WT
/(Intercept)
KO
empty
complement
> confint(catheter_model)
2.5 % 97.5 %
6.10862926 6.567936607
(Intercept) -0.64609377 0.003464903
KO -0.47136735 0.178191328
empty 0.07387378 0.723432458
complement
> confint(catheter_mixed_model)
2.5 % 97.5 %
5.8858767 6.7906885
(Intercept) -0.5424385 -0.1001904
KO -0.7863870 0.4932101
empty 0.1775290 0.6197772 complement
- The confidence interval for
(Intercept)
/WT
is wider for the mixed model, but both models estimate an interval that does not include zero (so is confidently positive), and they estimate the samelogCFU
of6.3383
. - The confidence interval for
KO
includes zero in the original model (catheter_model
) and so cannot confidently be assumed to be negative. The mixed model (catheter_mixed_model
) confidence interval does not include zero, and is narrower. We can be more confident that the data shows a reduction inlogCFU
when correcting for batch effects. - The confidence interval for
empty
is wider for the mixed model, but both models estimate an interval that includes zero, so we cannot exclude that there is no effect onlogCFU
of introducing the plasmid, in either model. Both models estimate the samelogCFU
of-0.1466
. - The confidence interval for
complement
is positive and does not include zero in both model results, so we can be confident that the data shows complementation increases thelogCFU
. The confidence interval is narrower for the mixed model suggesting that our estimate is more precise, though both models estimate the same effect size.
The mixed model that accounts for systematic errors introduced by batching has not affected our estimate of effect sizes for each fixed effect (e.g. knocking out the etpD gene), but it has affected our confidence in each estimate.
Specifically, by taking batches into account we have made the estimates of the effect of knocking out and complementing the etpD gene more precise (i.e. there is less uncertainty about whether the effect is nonzero, and about the size of the effect).
8.6 Fitting and interpreting the tissue model
Fit the tissue data to the mixed-effects model, using the WebR
cell below.
Use the same functions as for the catheter model, but be sure to use the tissue
dataset instead
- Use
lmer()
to fit the model - Use
summary()
andconfint()
to find the coefficients for theWT
,KO
,empty
, andcomplement
factors of interest
Use the R
code below to fit the model
<- lmer(logCFU ~ KO + empty + complement + (1 | batch), data=tissue) tissue_mixed_model
Use the R
code below to see the estimated coefficients and confidence intervals
summary(tissue_mixed_model)
confint(tissue_mixed_model)
8.6.1 Interpreting the results
What are the coefficients of the four factors of interest, and what do they imply about the effects of each factor? How do they compare to the estimates made by the model that did not take batching into account?
WT
/(Intercept)
KO
empty
complement
> summary(tissue_mixed_model)
:
Fixed effects
Estimate Std. Error t value6.54455 0.14185 46.137
(Intercept) -0.30835 0.06142 -5.020
KO -0.06357 0.20061 -0.317
empty -0.05806 0.06142 -0.945 complement
- The
(Intercept)
value is6.54
, which is consistent with the baseline recovery from the catheter experiment, and the previous model. - The
KO
value is-0.31
, which is negative, so consistent with etpD deletion resulting in reduced recovery of bacteria/reduced adhesion, and the previous model. - The
empty
value is-0.06
which is close to zero and consistent with there being no strong effect on logCFU due to incorporation of the plasmid vector, and the previous model. - The
complement
value is-0.06
which is also small and close to zero, consistent with there being no strong effect on logCFU due to reintroduction of the etpD gene, and also the previous model.
What are the confidence intervals for the coefficients of the four factors of interest, and what do they imply about the effects of each factor? How do they compare to those of the previous model?
WT
/(Intercept)
KO
empty
complement
> confint(tissue_model)
2.5 % 97.5 %
6.2729898 6.81611414
(Intercept) -0.4286512 -0.18804344
KO -0.4476160 0.32047790
empty -0.1783650 0.06224276 complement
- The
(Intercept)
value lies between6.27
and6.81
, which is consistent with the baseline recovery from the catheter experiment, and slightly wider than the original model without batching. - The
KO
value lies between-0.42
and-0.18
, which are both negative so this is consistent with etpD deletion resulting in reduced recovery of bacteria/reduced adhesion. It is also much narrower than the original model’s interval, implying a more precise estimate. - The
empty
confidence interval includes zero and is consistent with there being no strong effect on logCFU due to incorporation of the plasmid vector. It is wider than the interval estimated by the original model. - The
complement
confidence interval includes zero and is consistent with there being no strong effect on logCFU due to reintroduction of the etpD gene. It is also much narrower than the original model’s interval, implying a more precise estimate.
The mixed model that accounts for systematic errors introduced by batching has, again, not affected our estimate of effect sizes for each fixed effect (e.g. knocking out the etpD gene), but it has once more affected our confidence in each estimate.
Specifically, by taking batches into account we have made the estimates of the effect of knocking out and complementing the etpD gene more precise (i.e. there is less uncertainty about whether the effect is nonzero, and about the size of the effect).
Here, the results give us confidence that knocking out etpD results in reduced logCFU
, but that restoring/complementing the gene does not affect logCFU
. This is an intriguing result that requires explanation and, because we have been able to account for systematic sources of experimental error, i.e. batch effects, we can be confident in our conclusions and that there is a new question to be answered:
Why does knocking out etpD reduce recovery from catheter material and human tissue, but only restore recovery from catheter material?
8.7 Summary
In this section taken real experimental data through a fairly advanced statistical analysis to eliminate unwanted sources of error, and improved your confidence in your results. You have:
- understood how to build up a statistical model that represents multiple experimental factors of interest in
R
- used
R
to model experimental data with mixed effects (fixed and random) - obtained estimates of the sizes of effects due to multiple experimental interventions
- obtained estimates of confidence in the sizes of those effects, including whether the effects are really different from zero
Now proceed to ?sec-conclusions to wrap up the workshop.