7  Modelling Experimental Data

Note

It may take a couple of minutes to install the necessary packages in WebR - please be patient.

This page preloads the datasets as tissue and catheter for use in R cells.

7.1 Introduction

In Chapter 6 we used boxplots and scatterplots to visualise the data obtained from the adherence experiments with the wild-type (WT), knockout (KO), empty vector (empty), and complemented gene (complement) lines.

We saw that there appeared to be an effect of knocking out the etpD gene on the adherence (stickiness) of bacteria to both human tissue and to the catheter material. However, restoring the etpD gene appeared only to make a difference to adherence to the catheter material.

Important

We also observed an apparent batch effect in both sets of experiments, where the results obtained appeared to be strongly affected by which experimental batch they were obtained in.

We should like to remove the influence of these batch effects as they make our results less reliable as a representation of the biological influence of the etpD gene.

(we will do this in the next section, Chapter 8)

Caution

All our conclusions so far have been drawn from visual inspection of graphs, but we can do better than this by modelling the influence of biological factors of interest on our outcome (i.e. logCFU/mL).

By using statistical modelling approaches, we can estimate the quantitative influence of (i) knocking out the gene, (ii) introducing an empty plasmid, and (iii) reintroducing the gene as a complement and we shall do this below.

7.2 The Model

We can consider the measured “stickiness” or adherence of the bacteria to be composed of the combined influence of a number of factors. We can use our experimental results and a statistical method called linear modelling (or linear regression) to estimate the amount of influence each factor has.

First though, we start with a baseline level of “stickiness…”

7.2.1 The baseline: wild-type adherence (WT)

The wild-type (control) line is expected to display the baseline level of adherence for the bacteria.

So, when we measure the WT line we are establishing the natural baseline “stickiness” of the bacterium, measured as logCFU recovered from the substrate. In statistical terms we might make this into an equation like:

\[ \textrm{measured logCFU} = \textrm{baseline} \tag{7.1}\]

and Equation 7.1 says “the logCFU we measure for any wild-type bacterium is that of the baseline, wild-type organism.”

In reality though, we always expect some completely random experimental measurement error, which we can represent with the symbol \(\epsilon\) (epsilon) as Equation 7.2:

\[ \textrm{measured logCFU} = \textrm{baseline} + \epsilon \tag{7.2}\]

i.e. “the logCFU we measure for any wild-type bacterium is that of the baseline, wild-type organism, plus or minus some measurement error.”

7.2.2 The first intervention: knocking out a gene of interest (KO)

If we don’t knock out a gene from a bacterium it is just the same as the wild type. We’d expect the logCFU we recover from the substrate to be the baseline logCFU.

But, if we do knock out a gene, we can represent the measured logCFU for that bacterium by Equation 7.3:

\[ \textrm{measured logCFU} = \textrm{baseline} + \textrm{knockout} + \epsilon \tag{7.3}\]

which says “the logCFU we measure for the bacterium is the baseline level, plus the effect of knocking out the gene.”

Essentially the difference in measured logCFU between the WT and KO groups is taken to be due to the effect of knocking out the gene of interest (plus or minus a bit of measurement error).

7.2.3 Introducing the empty vector (empty)

Introducing an empty plasmid vector puts strain on a bacterium and might affect its “stickiness.” As we only introduce the empty vector into a knockout strain, the measured logCFU for that bacterium is that of the baseline plus the effect of the knockout, and any effect of including the empty vector. We can describe this in Equation 7.4:

\[ \textrm{measured logCFU} = \textrm{baseline} + \textrm{knockout} + \textrm{empty} + \epsilon \tag{7.4}\]

which says “the logCFU we measure for the bacterium is the baseline level, plus the effect of knocking out the gene of interest, and the effect of introducing an empty vector (plus or minus some measurement error).”

7.2.4 Complementing the gene (complement)

Finally, reintroducing the gene on the plasmid vector (complementing the gene) is expected to change the logCFU recovered. The effect of introducing the gene is added on to the effect of knocking it out and introducing an empty plasmid, and is described in Equation 7.5:

\[ \textrm{measured logCFU} = \textrm{baseline} + \textrm{knockout} + \textrm{empty} + \textrm{complement} + \epsilon \tag{7.5}\]

7.2.5 So what?

By now you might be thinking: “we’ve made an equation that represents the quantitative influence of knocking out a gene, introducing an empty plasmid, and complementing the gene, but so what? - how do we get numbers for this?”

To get those numbers, we can use a statistical technique called linear modelling (aka linear regression). It is very powerful and useful, and remarkably straightforward to use in R.

Note

We will walk through the modelling process for the catheter data, but modelling the tissue data is left as an exercise for you to solve.

7.3 Fitting the catheter model

We will use the R built-in function lm() to fit the model in Equation 7.5 to our catheter dataset in the WebR cells below.

7.3.1 Load the data

Important

We have preloaded the data for you as the dataframes tissue and catheter.

Use the WebR cell below to confirm the catheter data is loaded

Note that the data contains three columns: KO, empty, and complement that contain either a 1 or 0 value.

These columns describe whether the logCFU measurement in the row corresponds to a line that is affected by a gene knockout (KO = 1), the presence of a plasmid vector (empty = 1), or the reintroduced gene (complement = 1). These columns allow us to use the lm() function to estimate the influence of each biological intervention.

Note

We assume that measurements of the bacteria all share the same (wild-type) baseline.

7.3.2 Define the model

R uses a specific syntax for defining a model, where factors of interest influence a measured value.

Here, our measured value is in the column logCFU, and is assumed to be influenced by factors in the columns KO, empty, and complement (where they apply/are equal to one). We represent this with the R statement below:

logCFU ~ KO + empty + complement

which reads: “logCFU is influenced by (~) the sum of effects of KO, empty, and complement (where they apply).”

7.3.3 Fit the model

To fit our data to this model, we use this model definition in the lm() function as below (specifying that the dataframe we’re using is catheter):

catheter_model <- lm(logCFU ~ KO + empty + complement, data=catheter)
Challenge

Fit the catheter data to this model in the WebR cell below.

Use the R code:

catheter_model <- lm(logCFU ~ KO + empty + complement, data=catheter)

7.4 Interpreting the catheter model

The fitted model is stored in the variable catheter_model, and we can obtain useful information from it in a number of ways.

7.4.1 Coefficients and confidence intervals

The coefficients that the model reports are the estimated effects of each factor of interest. To obtain the coefficients, use the R code below to produce a summary of the fitted model:

coef(catheter_model)

The confidence intervals reported by the model are the range of values that the model thinks are most likely for the coefficients. These work the same way as coefficients for t-tests and similar statistical methods: the range of values bounded by the 2.5% and 97.5% confidence limits is a 95% confidence interval. We would expect that the true value of the coefficient being estimated would lie in this range 95% of the time.

To find the confidence intervals for the coefficients of the model, use the R code below:

confint(catheter_model)
Challenge

Find the estimated coefficients and corresponding confidence intervals for the catheter model in the WebR cell below.

Use the R code:

coef(catheter_model)
confint(catheter_model)

7.4.2 Reading the output

You should see output that resembles the data below:

(Intercept)          KO       empty  complement 
  6.3382829  -0.3213144  -0.1465880   0.3986531 

                  2.5 %      97.5 %
(Intercept)  6.10862926 6.567936607
KO          -0.64609377 0.003464903
empty       -0.47136735 0.178191328
complement   0.07387378 0.723432458

7.4.2.1 Coefficients

The output of coef(catheter_model) shows values for each factor of interest in the column Estimate, plus a value for “(Intercept).”

(Intercept)          KO       empty  complement 
  6.3382829  -0.3213144  -0.1465880   0.3986531 
  • (Intercept) is the wild-type (WT)/control logCFU value coefficient estimate. Here this is 6.34, which looks to be a reasonable value by visual comparison with Figure 6.1.
  • The KO coefficient estimate is -0.32, which indicates that knocking out the etpD gene reduces the logCFU value by 0.32 units (as this is a log scale, that corresponds approximately to halving the number of recovered CFUs).
  • The empty coefficient estimate is -0.15, which indicates a further reduction in bacterial recovery is seen due to inserting the plasmid.
  • The complement coefficient estimate is 0.40 - of the same order as the reduction in logCFU seen for the gene knockout. This implies that returning the gene also restores bacterial recovery to the original, wild-type level.

7.4.2.2 Confidence intervals

The confint(catheter_model) output lists the expected 95% confidence intervals for the coefficients.

                  2.5 %      97.5 %
(Intercept)  6.10862926 6.567936607
KO          -0.64609377 0.003464903
empty       -0.47136735 0.178191328
complement   0.07387378 0.723432458
Tip

This is the more useful output for deciding whether there is a real effect due to a factor of interest

Warning

The point estimate of a coefficient may be non-zero, implying a negative or positive effect of that factor on the measured outcome but, if the confidence interval includes zero it is not reasonable to exclude the possibility that there is actually no effect.

In this experiment, the confidence intervals tell us the following:

  • The (Intercept)/WT/baseline logCFU is likely to lie between 6.11 and 6.57 units
  • The effect of knocking out the gene (KO) is likely to lie between -0.646 and 0.003 units which appears to be a negative effect, but as the interval includes zero we cannot strictly rule out that there is no effect
  • Introduction of the plasmid vector changes logCFU by between -0.47 and 0.18 units which includes zero, and so we can’t rule out that there is no effect
  • Complementing the etpD gene (complement) increases logCFU by between 0.074 and 0.72 units, enhancing recovery of bacteria
Caution

These results taken together suggest that knocking out the etpD gene diminishes recovery of bacteria (implying reduced adherence) and that complementing the gene restores the recovery level. However, the statistics are equivoval because the KO confidence interval includes zero.

We can, however, be reasonably certain that addition of the empty vector does not appear to influence logCFU strongly, as the confidence interval includes zero.

7.5 Fitting and interpreting the tissue model

Challenge

Fit the tissue data to this model, using the WebR cell below.

Use the same functions as for the catheter model, but be sure to use the tissue dataset instead

  • Use lm() to fit the model
  • Use coef() and confint() to find the coefficients for the WT, KO, empty, and complement factors of interest

Use the R code below to fit the model

tissue_model <- lm(logCFU ~ KO + empty + complement, data=tissue)

Use the R code below to see the estimated coefficients and confidence intervals

coef(tissue_model)
confint(tissue_model)

7.5.1 Interpreting the results

Questions

What are the coefficients of the four factors of interest, and what do they imply about the effects of each factor?

  • WT/(Intercept)
  • KO
  • empty
  • complement
> coef(tissue_model)
(Intercept)          KO       empty  complement 
 6.54455199 -0.30834731 -0.06356898 -0.05806111 
  • The (Intercept) value is 6.54, which is consistent with the baseline recovery from the catheter experiment
  • The KO value is -0.31, which is negative, so consistent with etpD deletion resulting in reduced recovery of bacteria/reduced adhesion
  • The empty value is -0.06 which is close to zero and consistent with there being no strong effect on logCFU due to incorporation of the plasmid vector
  • The complement value is -0.06 which is also small and close to zero, consistent with there being no strong effect on logCFU due to reintroduction of the etpD gene
Questions

What are the confidence intervals for the coefficients of the four factors of interest, and what do they imply about the effects of each factor?

  • WT/(Intercept)
  • KO
  • empty
  • complement
> confint(tissue_model)
                 2.5 %     97.5 %
(Intercept)  6.4070598  6.6820442
KO          -0.5027906 -0.1139040
empty       -0.2580123  0.1308744
complement  -0.2525044  0.1363822
  • The (Intercept) value lies between 6.41 and 6.68, which is consistent with the baseline recovery from the catheter experiment
  • The KO value lies between -0.50 and -0.11, which are both negative so this is consistent with etpD deletion resulting in reduced recovery of bacteria/reduced adhesion
  • The empty confidence interval includes zero and is consistent with there being no strong effect on logCFU due to incorporation of the plasmid vector
  • The complement confidence interval includes zero and is consistent with there being no strong effect on logCFU due to reintroduction of the etpD gene

7.6 But wait!

Back in Chapter 6 we saw that the experimental data was affected by batch effects - systematic influences on the measured values that seemed to come from the way the experiment was conducted, not the biological factors whose influence we actually wanted to measure.

Caution

Maybe the results we have just obtained aren’t as correct, or at least as accurate, as they seem because of these systematic batch effects?

We should try to account for systematic batch effects as much as possible, and we’ll do this in the next section.