8 Fitting Experimental Data

Note

It may take a couple of minutes to install the necessary packages in WebR - please be patient.

This page preloads the datasets as tissue and catheter for use in R cells.

8.1 Fitting the catheter model

We will use the R built-in function lm() to fit the model in Equation 8.1 (see Chapter 7) to our catheter dataset in the WebR cells below.

\[ \textrm{measured logCFU (complement)} = \textrm{baseline} + \textrm{knockout} + \textrm{empty} + \textrm{complement} + \epsilon \tag{8.1}\]

8.1.1 Load the data

Important

We have preloaded the data for you as the dataframes tissue and catheter.

Use the WebR cell below to confirm the catheter data is loaded

Note that the data contains three columns: KO, empty, and complement that contain either a 1 or 0 value.

These columns describe whether the logCFU measurement in the row corresponds to a line that is affected by a gene knockout (KO = 1), the presence of a plasmid vector (empty = 1), or the reintroduced gene (complement = 1). These columns allow us to use the lm() function to estimate the influence of each biological intervention.

Note

We assume that measurements of the bacteria all share the same (wild-type) baseline. This will be represented by the intercept in our linear model fit.

8.1.2 Define the model

R uses a specific syntax for defining a model, where factors of interest influence a measured value.

Here, our measured value is in the column logCFU, and is assumed to be influenced by factors in the columns KO, empty, and complement (where they apply/are equal to one). We represent this with the R statement below:

logCFU ~ KO + empty + complement

which reads: “logCFU is influenced by (~) the sum of the effects of KO, empty, and complement (where they apply).”

Note

In statistical jargon, logCFU is the dependent variable, because its value is assumed to depend on the values of KO, empty, and complement.

The values of KO, empty, and complement are under experimenter control and not dependent on anything else, so these are called independent variables.

8.1.3 Fit the model

To fit our data to this model, we use this model definition in the lm() function as below (specifying that the dataframe we’re using is catheter) to create a result that is stored in catheter_model:

catheter_model <- lm(logCFU ~ KO + empty + complement, data=catheter)

Challenge

Fit the catheter data to this model in the WebR cell below.

Use the R code:

catheter_model <- lm(logCFU ~ KO + empty + complement, data=catheter)

8.2 Interpreting the catheter model

The fitted model is stored in the variable catheter_model, and we can obtain useful information from it in a number of ways.

8.2.1 Coefficients and confidence intervals

The coefficients that the model reports are the estimated effects of each factor of interest, i.e. the intercept and each independent variable. To obtain the coefficients, use the R code below to produce a summary of the fitted model:

coef(catheter_model)

The confidence intervals reported by the model are the range of values that the model thinks are most likely for the coefficients. These work the same way as coefficients for t-tests and similar statistical methods: the range of values bounded by the 2.5% and 97.5% confidence limits is a 95% confidence interval. We would expect that the true value of the coefficient being estimated would lie in this range, in 95 out of 100 estimates.

To find the confidence intervals for the coefficients of the model, use the R code below:

confint(catheter_model)

Challenge

Find the estimated coefficients and corresponding confidence intervals for the catheter model in the WebR cell below.

Use the R code:

coef(catheter_model)
confint(catheter_model)

8.2.2 Reading the output

You should see output that resembles the data below:

(Intercept)          KO       empty  complement 
  6.3382829  -0.3213144  -0.1465880   0.3986531 

                  2.5 %      97.5 %
(Intercept)  6.10862926 6.567936607
KO          -0.64609377 0.003464903
empty       -0.47136735 0.178191328
complement   0.07387378 0.723432458

8.2.2.1 Coefficients

The output of coef(catheter_model) shows values for each factor of interest in the column Estimate, plus a value for “(Intercept).”

(Intercept)          KO       empty  complement 
  6.3382829  -0.3213144  -0.1465880   0.3986531

(Intercept) is the wild-type (WT)/control logCFU value coefficient estimate. Here this is 6.34, which looks to be a reasonable value by visual comparison with Figure 6.1.
The KO coefficient estimate is -0.32, which indicates that knocking out the etpD gene reduces the logCFU value by 0.32 units (as this is a log scale, that corresponds approximately to halving the number of recovered CFUs).
The empty coefficient estimate is -0.15, which indicates a further reduction in bacterial recovery is seen due to inserting the empty plasmid, even though it doesn’t contain any genes we would think are active in adherence.
The complement coefficient estimate is 0.40 - of the same order of magnitued as the reduction in logCFU seen for the gene knockout. This implies that returning the gene also restores bacterial recovery to the original, wild-type level.

8.2.2.2 Confidence intervals

The confint(catheter_model) output lists the expected 95% confidence intervals for the coefficients.

                  2.5 %      97.5 %
(Intercept)  6.10862926 6.567936607
KO          -0.64609377 0.003464903
empty       -0.47136735 0.178191328
complement   0.07387378 0.723432458

Tip

This is the more useful output for deciding whether there is a real effect due to a factor of interest

Warning

The point estimate of a coefficient may be non-zero, implying a negative or positive effect of that factor on the measured outcome but if the confidence interval includes zero it is not reasonable to exclude the possibility that there is in reality no effect on the outcome.

In this experiment, the confidence intervals tell us the following:

The (Intercept)/WT/baseline logCFU is likely to lie between 6.11 and 6.57 units
The effect of knocking out the etpD gene (KO) is likely to lie between -0.646 and 0.003 units which appears to be a negative effect but, as the interval includes zero we cannot strictly rule out that there is no effect
Introduction of the plasmid vector changes logCFU by between -0.47 and 0.18 units. This range includes zero, and so we can’t rule out that there is no effect due to introducing the empty vector.
Complementing the etpD gene (complement) increases logCFU by between 0.074 and 0.72 units, enhancing recovery of bacteria

Caution

These results taken together suggest that knocking out the etpD gene diminishes recovery of bacteria (implying reduced adherence) and that complementing the gene restores the recovery level. However, the statistics are equivocal because the KO confidence interval includes zero and the observed change may just be due to happenstance. This may be surprising.

We can, however, be reasonably certain that addition of the empty vector does not appear to influence logCFU strongly, as the confidence interval includes zero.

8.3 Fitting and interpreting the tissue model

Challenge

Fit the tissue data to this model, using the WebR cell below.

I need a hint

Use the same functions as for the catheter model, but be sure to use the tissue dataset instead

Use lm() to fit the model
Use coef() and confint() to find the coefficients for the WT, KO, empty, and complement factors of interest

Help, I’m stuck!

Use the R code below to fit the model

tissue_model <- lm(logCFU ~ KO + empty + complement, data=tissue)

Use the R code below to see the estimated coefficients and confidence intervals

coef(tissue_model)
confint(tissue_model)

8.3.1 Interpreting the results

Questions

What are the coefficients of the four factors of interest, and what do they imply about the effects of each factor?

WT/(Intercept)
KO
empty
complement

Answers

> coef(tissue_model)
(Intercept)          KO       empty  complement 
 6.54455199 -0.30834731 -0.06356898 -0.05806111

The (Intercept) value is 6.54, which is consistent with the baseline recovery from the catheter experiment
The KO value is -0.31, which is negative, so consistent with etpD deletion resulting in reduced recovery of bacteria/reduced adhesion
The empty value is -0.06 which is close to zero and consistent with there being no strong effect on logCFU due to incorporation of the plasmid vector
The complement value is -0.06 which is also small and close to zero, consistent with there being no strong effect on logCFU due to reintroduction of the etpD gene

Questions

What are the confidence intervals for the coefficients of the four factors of interest, and what do they imply about the effects of each factor?

WT/(Intercept)
KO
empty
complement

Answers

> confint(tissue_model)
                 2.5 %     97.5 %
(Intercept)  6.4070598  6.6820442
KO          -0.5027906 -0.1139040
empty       -0.2580123  0.1308744
complement  -0.2525044  0.1363822

The (Intercept) value lies between 6.41 and 6.68, which is consistent with the baseline recovery from the catheter experiment
The KO value lies between -0.50 and -0.11. The entire range is negative so this is consistent with etpD deletion resulting in reduced recovery of bacteria/reduced adhesion
The empty confidence interval includes zero and is consistent with there being no strong effect on logCFU due to incorporation of the plasmid vector
The complement confidence interval includes zero and so is consistent with there being no strong effect on logCFU due to reintroduction of the etpD gene

8.4 But wait!

Back in Chapter 6 we saw that the experimental data was affected by batch effects - systematic influences on the measured values that seemed to come from the way the experiment was conducted, not the biological factors whose influence we actually wanted to measure.

Caution

Maybe the results we have just obtained aren’t as correct, or at least as accurate, as they seem because of these systematic batch effects?

We should try to account for systematic batch effects as much as possible, and we’ll do this in the next section, Chapter 9.