7  Modelling Experimental Data

7.1 Introduction

In Chapter 6 we used boxplots and scatterplots to visualise the data obtained from the adherence experiments with the wild-type (WT), knockout (KO), empty vector (empty), and complemented gene (complement) lines.

We saw that there appeared to be an effect of knocking out the etpD gene on the adherence (stickiness) of bacteria to both human tissue and to the catheter material. However, restoring the etpD gene appeared only to make a difference for adherence to the catheter material.

Important

We also observed an apparent batch effect in both sets of experiments, where the results obtained appeared to be strongly affected by which experimental batch they were obtained in.

We should like to remove the influence of these batch effects as they would be expected make our results less reliable as a representation of the biological influence of the etpD gene.

(we will do this in the next section, Chapter 9)

Caution

All our conclusions so far have been drawn from visual inspection of graphs, but we can do better than this by modelling the influence of biological factors of interest on our outcome (i.e. logCFU/mL) statistically.

By using statistical modelling approaches, we can estimate the quantitative influence of (i) knocking out the gene, (ii) introducing an empty plasmid, and (iii) reintroducing the gene as a complement. We shall do this below.

7.2 The Model

We can consider the measured “stickiness” or adherence of the bacteria to be composed of the combined influence of a number of factors. We can use our experimental results and a statistical method called linear modelling (or linear regression) to estimate the amount of influence each factor has.

First though, we start with a baseline level of “stickiness…”

7.2.1 The baseline: wild-type adherence (WT)

The wild-type (control) line is expected to display a baseline level of adherence for the bacteria: how “sticky” the bacterium is when we don’t interfere with it.

So, when we measure the WT line we are establishing the natural baseline “stickiness” of the bacterium, measured as logCFU recovered from the substrate. In statistical terms we might make this into an equation like:

\[ \textrm{measured logCFU (WT)} = \textrm{baseline} \tag{7.1}\]

and Equation 7.1 says “the logCFU we measure for any wild-type bacterial sample is representative of the baseline, wild-type organism.”

Important

In reality though, we always expect some completely random experimental measurement error, which we can represent with the symbol \(\epsilon\) (epsilon) as Equation 7.2:

\[ \textrm{measured logCFU (WT)} = \textrm{baseline} + \epsilon \tag{7.2}\]

i.e. “the logCFU we measure for any wild-type bacterial sample is representative of the baseline, wild-type organism, plus or minus some measurement error (\(\epsilon\)).”

(statistics is all about estimating the representative value, in the presence of errors like this)

7.2.2 The first intervention: knocking out a gene of interest (KO)

If we don’t knock out a gene from a bacterium it is just the same as the wild type, and we’d expect the logCFU we recover from the substrate to be the baseline logCFU.

But, if we do knock out a gene, we can represent the measured logCFU for that bacterium by Equation 7.3:

\[ \textrm{measured logCFU (KO)} = \textrm{baseline} + \textrm{knockout} + \epsilon \tag{7.3}\]

which says “the logCFU we measure for the bacterium is the baseline level, plus the effect of knocking out the gene (plus random error).”

Essentially the difference in measured logCFU between the WT and KO groups is taken to be due to the effect of knocking out the gene of interest (plus or minus a bit of measurement error).

7.2.3 Introducing the empty vector (empty)

Introducing an empty plasmid vector puts strain on a bacterium and might affect its “stickiness,” even if only indirectly. We only introduce the empty vector into the knockout strain as a control for complementation, so the measured logCFU for that bacterium is that of the baseline plus the effect of the knockout, _and any effect of including the empty vector (plus random error). We can describe this in Equation 7.4:

\[ \textrm{measured logCFU (empty)} = \textrm{baseline} + \textrm{knockout} + \textrm{empty} + \epsilon \tag{7.4}\]

which says “the logCFU we measure for the bacterium is the baseline level, plus the effect of knocking out the gene of interest, plus the effect of introducing an empty vector (plus or minus some measurement error).”

7.2.4 Complementing the gene (complement)

Finally, reintroducing the gene on a plasmid vector (complementing the gene) is expected to change the logCFU recovered. The effect of introducing the gene is added on to the effect of knocking it out and introducing an empty plasmid, and is described in Equation 8.1:

\[ \textrm{measured logCFU (complement)} = \textrm{baseline} + \textrm{knockout} + \textrm{empty} + \textrm{complement} + \epsilon \tag{7.5}\]

7.2.5 So what?

By now you might be thinking: “we’ve made an equation that represents the quantitative influence of knocking out a gene, introducing an empty plasmid, and complementing the gene, but so what? - how do we get numbers for this?”

To get those numbers, we can use a statistical technique called linear modelling (aka linear regression). It is very powerful and useful, and remarkably straightforward to use in R.

Note

We will walk through the linear regression fitting process for the catheter data in Chapter 8, but modelling the tissue data is left as an exercise for you to solve.