7 Modelling Experimental Data
It may take a couple of minutes to install the necessary packages in WebR
- please be patient.
This page preloads the datasets as tissue
and catheter
for use in R
cells.
7.1 Introduction
In Chapter 6 we used boxplots and scatterplots to visualise the data obtained from the adherence experiments with the wild-type (WT
), knockout (KO
), empty vector (empty
), and complemented gene (complement
) lines.
We saw that there appeared to be an effect of knocking out the etpD gene on the adherence (stickiness) of bacteria to both human tissue and to the catheter material. However, restoring the etpD gene appeared only to make a difference to adherence to the catheter material.
We also observed an apparent batch effect in both sets of experiments, where the results obtained appeared to be strongly affected by which experimental batch they were obtained in.
We should like to remove the influence of these batch effects as they make our results less reliable as a representation of the biological influence of the etpD gene.
(we will do this in the next section, Chapter 8)
All our conclusions so far have been drawn from visual inspection of graphs, but we can do better than this by modelling the influence of biological factors of interest on our outcome (i.e. logCFU/mL).
By using statistical modelling approaches, we can estimate the quantitative influence of (i) knocking out the gene, (ii) introducing an empty plasmid, and (iii) reintroducing the gene as a complement and we shall do this below.
7.2 The Model
We can consider the measured “stickiness” or adherence of the bacteria to be composed of the combined influence of a number of factors. We can use our experimental results and a statistical method called linear modelling (or linear regression) to estimate the amount of influence each factor has.
First though, we start with a baseline level of “stickiness…”
7.2.1 The baseline: wild-type adherence (WT
)
The wild-type (control) line is expected to display the baseline level of adherence for the bacteria.
So, when we measure the WT
line we are establishing the natural baseline “stickiness” of the bacterium, measured as logCFU recovered from the substrate. In statistical terms we might make this into an equation like:
\[ \textrm{measured logCFU} = \textrm{baseline} \tag{7.1}\]
and Equation 7.1 says “the logCFU we measure for any wild-type bacterium is that of the baseline, wild-type organism.”
In reality though, we always expect some completely random experimental measurement error, which we can represent with the symbol \(\epsilon\) (epsilon) as Equation 7.2:
\[ \textrm{measured logCFU} = \textrm{baseline} + \epsilon \tag{7.2}\]
i.e. “the logCFU we measure for any wild-type bacterium is that of the baseline, wild-type organism, plus or minus some measurement error.”
7.2.2 The first intervention: knocking out a gene of interest (KO
)
If we don’t knock out a gene from a bacterium it is just the same as the wild type. We’d expect the logCFU we recover from the substrate to be the baseline logCFU.
But, if we do knock out a gene, we can represent the measured logCFU for that bacterium by Equation 7.3:
\[ \textrm{measured logCFU} = \textrm{baseline} + \textrm{knockout} + \epsilon \tag{7.3}\]
which says “the logCFU we measure for the bacterium is the baseline level, plus the effect of knocking out the gene.”
Essentially the difference in measured logCFU between the WT
and KO
groups is taken to be due to the effect of knocking out the gene of interest (plus or minus a bit of measurement error).
7.2.3 Introducing the empty vector (empty
)
Introducing an empty plasmid vector puts strain on a bacterium and might affect its “stickiness.” As we only introduce the empty vector into a knockout strain, the measured logCFU for that bacterium is that of the baseline plus the effect of the knockout, and any effect of including the empty vector. We can describe this in Equation 7.4:
\[ \textrm{measured logCFU} = \textrm{baseline} + \textrm{knockout} + \textrm{empty} + \epsilon \tag{7.4}\]
which says “the logCFU we measure for the bacterium is the baseline level, plus the effect of knocking out the gene of interest, and the effect of introducing an empty vector (plus or minus some measurement error).”
7.2.4 Complementing the gene (complement
)
Finally, reintroducing the gene on the plasmid vector (complementing the gene) is expected to change the logCFU recovered. The effect of introducing the gene is added on to the effect of knocking it out and introducing an empty plasmid, and is described in Equation 7.5:
\[ \textrm{measured logCFU} = \textrm{baseline} + \textrm{knockout} + \textrm{empty} + \textrm{complement} + \epsilon \tag{7.5}\]
7.2.5 So what?
By now you might be thinking: “we’ve made an equation that represents the quantitative influence of knocking out a gene, introducing an empty plasmid, and complementing the gene, but so what? - how do we get numbers for this?”
To get those numbers, we can use a statistical technique called linear modelling (aka linear regression). It is very powerful and useful, and remarkably straightforward to use in R
.
We will walk through the modelling process for the catheter data, but modelling the tissue data is left as an exercise for you to solve.
7.3 Fitting the catheter model
We will use the R
built-in function lm()
to fit the model in Equation 7.5 to our catheter
dataset in the WebR
cells below.
7.3.1 Load the data
We have preloaded the data for you as the dataframes tissue
and catheter
.
Use the WebR
cell below to confirm the catheter data is loaded
Note that the data contains three columns: KO
, empty
, and complement
that contain either a 1
or 0
value.
These columns describe whether the logCFU measurement in the row corresponds to a line that is affected by a gene knockout (KO
= 1
), the presence of a plasmid vector (empty
= 1
), or the reintroduced gene (complement
= 1
). These columns allow us to use the lm()
function to estimate the influence of each biological intervention.
We assume that measurements of the bacteria all share the same (wild-type) baseline.
7.3.2 Define the model
R
uses a specific syntax for defining a model, where factors of interest influence a measured value.
Here, our measured value is in the column logCFU
, and is assumed to be influenced by factors in the columns KO
, empty
, and complement
(where they apply/are equal to one). We represent this with the R
statement below:
~ KO + empty + complement logCFU
which reads: “logCFU
is influenced by (~
) the sum of effects of KO
, empty
, and complement
(where they apply).”
7.3.3 Fit the model
To fit our data to this model, we use this model definition in the lm()
function as below (specifying that the dataframe we’re using is catheter
):
<- lm(logCFU ~ KO + empty + complement, data=catheter) catheter_model
Fit the catheter data to this model in the WebR
cell below.
Use the R
code:
<- lm(logCFU ~ KO + empty + complement, data=catheter) catheter_model
7.4 Interpreting the catheter model
The fitted model is stored in the variable catheter_model
, and we can obtain useful information from it in a number of ways.
7.4.1 Coefficients and confidence intervals
The coefficients that the model reports are the estimated effects of each factor of interest. To obtain the coefficients, use the R
code below to produce a summary of the fitted model:
coef(catheter_model)
The confidence intervals reported by the model are the range of values that the model thinks are most likely for the coefficients. These work the same way as coefficients for t-tests and similar statistical methods: the range of values bounded by the 2.5% and 97.5% confidence limits is a 95% confidence interval. We would expect that the true value of the coefficient being estimated would lie in this range 95% of the time.
To find the confidence intervals for the coefficients of the model, use the R
code below:
confint(catheter_model)
Find the estimated coefficients and corresponding confidence intervals for the catheter model in the WebR
cell below.
Use the R
code:
coef(catheter_model)
confint(catheter_model)
7.4.2 Reading the output
You should see output that resembles the data below:
(Intercept) KO empty complement
6.3382829 -0.3213144 -0.1465880 0.3986531
2.5 % 97.5 %
(Intercept) 6.10862926 6.567936607
KO -0.64609377 0.003464903
empty -0.47136735 0.178191328
complement 0.07387378 0.723432458
7.4.2.1 Coefficients
The output of coef(catheter_model)
shows values for each factor of interest in the column Estimate
, plus a value for “(Intercept)
.”
(Intercept) KO empty complement
6.3382829 -0.3213144 -0.1465880 0.3986531
(Intercept)
is the wild-type (WT
)/control logCFU value coefficient estimate. Here this is6.34
, which looks to be a reasonable value by visual comparison with Figure 6.1.- The
KO
coefficient estimate is-0.32
, which indicates that knocking out the etpD gene reduces the logCFU value by 0.32 units (as this is a log scale, that corresponds approximately to halving the number of recovered CFUs). - The
empty
coefficient estimate is-0.15
, which indicates a further reduction in bacterial recovery is seen due to inserting the plasmid. - The
complement
coefficient estimate is0.40
- of the same order as the reduction in logCFU seen for the gene knockout. This implies that returning the gene also restores bacterial recovery to the original, wild-type level.
7.4.2.2 Confidence intervals
The confint(catheter_model)
output lists the expected 95% confidence intervals for the coefficients.
2.5 % 97.5 %
(Intercept) 6.10862926 6.567936607
KO -0.64609377 0.003464903
empty -0.47136735 0.178191328
complement 0.07387378 0.723432458
This is the more useful output for deciding whether there is a real effect due to a factor of interest
The point estimate of a coefficient may be non-zero, implying a negative or positive effect of that factor on the measured outcome but, if the confidence interval includes zero it is not reasonable to exclude the possibility that there is actually no effect.
In this experiment, the confidence intervals tell us the following:
- The
(Intercept)
/WT
/baseline logCFU is likely to lie between6.11
and6.57
units - The effect of knocking out the gene (
KO
) is likely to lie between-0.646
and0.003
units which appears to be a negative effect, but as the interval includes zero we cannot strictly rule out that there is no effect - Introduction of the plasmid vector changes logCFU by between
-0.47
and0.18
units which includes zero, and so we can’t rule out that there is no effect - Complementing the etpD gene (
complement
) increases logCFU by between0.074
and0.72
units, enhancing recovery of bacteria
These results taken together suggest that knocking out the etpD gene diminishes recovery of bacteria (implying reduced adherence) and that complementing the gene restores the recovery level. However, the statistics are equivoval because the KO
confidence interval includes zero.
We can, however, be reasonably certain that addition of the empty vector does not appear to influence logCFU strongly, as the confidence interval includes zero.
7.5 Fitting and interpreting the tissue model
Fit the tissue data to this model, using the WebR
cell below.
Use the same functions as for the catheter model, but be sure to use the tissue
dataset instead
- Use
lm()
to fit the model - Use
coef()
andconfint()
to find the coefficients for theWT
,KO
,empty
, andcomplement
factors of interest
Use the R
code below to fit the model
<- lm(logCFU ~ KO + empty + complement, data=tissue) tissue_model
Use the R
code below to see the estimated coefficients and confidence intervals
coef(tissue_model)
confint(tissue_model)
7.5.1 Interpreting the results
What are the coefficients of the four factors of interest, and what do they imply about the effects of each factor?
WT
/(Intercept)
KO
empty
complement
> coef(tissue_model)
(Intercept) KO empty complement 6.54455199 -0.30834731 -0.06356898 -0.05806111
- The
(Intercept)
value is6.54
, which is consistent with the baseline recovery from the catheter experiment - The
KO
value is-0.31
, which is negative, so consistent with etpD deletion resulting in reduced recovery of bacteria/reduced adhesion - The
empty
value is-0.06
which is close to zero and consistent with there being no strong effect on logCFU due to incorporation of the plasmid vector - The
complement
value is-0.06
which is also small and close to zero, consistent with there being no strong effect on logCFU due to reintroduction of the etpD gene
What are the confidence intervals for the coefficients of the four factors of interest, and what do they imply about the effects of each factor?
WT
/(Intercept)
KO
empty
complement
> confint(tissue_model)
2.5 % 97.5 %
6.4070598 6.6820442
(Intercept) -0.5027906 -0.1139040
KO -0.2580123 0.1308744
empty -0.2525044 0.1363822 complement
- The
(Intercept)
value lies between6.41
and6.68
, which is consistent with the baseline recovery from the catheter experiment - The
KO
value lies between-0.50
and-0.11
, which are both negative so this is consistent with etpD deletion resulting in reduced recovery of bacteria/reduced adhesion - The
empty
confidence interval includes zero and is consistent with there being no strong effect on logCFU due to incorporation of the plasmid vector - The
complement
confidence interval includes zero and is consistent with there being no strong effect on logCFU due to reintroduction of the etpD gene
7.6 But wait!
Back in Chapter 6 we saw that the experimental data was affected by batch effects - systematic influences on the measured values that seemed to come from the way the experiment was conducted, not the biological factors whose influence we actually wanted to measure.
Maybe the results we have just obtained aren’t as correct, or at least as accurate, as they seem because of these systematic batch effects?
We should try to account for systematic batch effects as much as possible, and we’ll do this in the next section.