1 What Does Statistics Do, Anyway?

When researchers carry out experiments, they typically collect data for that experiment. This data usually involves at least two measurements of each of a number of individuals (or experimental units).

Sometimes “individuals” means actual individual people, but it might also mean individual medications, groups of people, or something else, depending on experimental context.

Often, one of the measurements is treated as a response variable - some kind of outcome that the researcher is interested in. For instance, if we were to measure pupil dilation as a result of administering a medication, we would treat the amount of dilation as the response variable. In terms of experimental design, the researcher is attempting to determine how the response variable depends on - or is explained by the other measured variables. The response variable is sometimes called the dependent variable because it depends on the other variables in the experiment.

The measured variables that might explain the behaviour of the response variable (i.e. those variables on which the dependent variable might depend) are called explanatory variables. For instance, the pupil dilation in our example might depend on: the amount of drug given; the sex of the patient; the age of the patient; whether the drug was delivered orally or intravenously; and so on. Explanatory variables in an experiment might be measured (i.e. properties of the individual that are observed, like patient sex) or assigned (e.g. the researcher gives a drug or a placebo to different groups of individuals). As the explanatory variables are considered not to depend on the other variables in the experiment, you may see them called independent variables.

  • Independent variable: equivalent to explanatory variable. Often under the experimenter’s direct control, and typically plotted on the \(x\)-axis.
  • Dependent variable: equivalent to response variable. Not under the experimenter’s direct control, and usually considered an outcome of an experiment whose value is influenced by one or more independent variables. Typically plotted on the \(y\)-axis.

Explanatory variables explain response variables.

Dependent variables depend on independent variables.

Figure 1.1: Data from an experiment assessing the effect of two treatment conditions on crop yield, in comparison to a ‘control’ (no treatment). weight is the response variable, group is the (assigned) explanatory variable.

1.1 “Understanding Questions” and “Prediction Questions”

Statistics allows us to calculate answers to many kinds of questions. These often belong to one of two key types: understanding questions, and prediction questions.

  • understanding questions might take the form of determining whether changing the value of an explanatory variable affects the value of the response variable. The question might be asking whether changing the explanatory variable affects the response variable at all or, if it does, how much of an effect there is.
  • prediction questions also need to determine a similar relationship between explanatory and response variables, but they then seek to use those relationships to predict future behaviour that hasn’t been measured yet.

1.2 Models

In all questions, whether for understanding or prediction, we rely on modelling: modelling the behaviours of the variables we measure, and modelling the interactions between those variables - such as the way that explanatory variables influence the response variable value. Modelling is a broad, complex field, but for the purpose of this workshop we can consider models to be simplified representations of a system, including its data and relationships between elements of the system.

1.2.1 Representing data with a model

Firstly, although we measure data in our experiments, some statistical approaches (called parametric methods) do not use the raw data directly to calculate their result. Instead, they use the data to calculate values that are parameters of a distribution (e.g. mean and standard deviation). The corresponding distribution is an idealised model of the measurements, and it simplifies our calculations to use these models to represent our data with the model, instead of using the actual individual datapoints.

If you have ever performed a t-test or similar statistical analysis, based on the means and standard deviations of two datasets, you were using models of your data to calculate the statistical result.

The means and standard deviations you used in your t-test were parameters that described an idealised model of your data, in place of using the raw measurements themselves.

Click to toggle a wee bit of historical context

Statistical approaches can be computationally expensive. So, before computational approaches were common, we used to rely on books of statistical tables that contained precomputed values for statistical distributions. Being able to model our rough, raw data as these distributions meant that we could then use the values from the statistical tables to represent our data “cleanly” and quickly. This simplified matters very much and, naturally, the teaching of statistics (especially applied statistics) focused on methods that could use the values in these tables. These methods are - broadly - the “classical” statistical tests you were probably first taught. In essence, these tests could be performed readily, so they were taught first.

Today, computationally-expensive approaches like Monte Carlo methods and Bayesian statistics have been made much more accessible, can use the original data directly, and are readily applied to more complex datasets and questions than could be used with the classical methods.

A key concern in parametric methods is ensuring that the choice of model distribution to represent an experimental dataset is appropriate, and a supplementary notebook explores how common distributions represent data. Figure 1.2 below, shows a set of 200 measured datapoints and the frequency with which particular values were seen to occur, as a histogram. There is also a curve describing the expected distribution of frequencies of values (in this case, it is a Normal distribution).

Notice that the sampled data does not follow the shape of the curve exactly.

Statistical methods that use a model (like a Normal distribution) to represent the data are called parametric methods. Those which use data directly and do not represent it as an idealised model are called nonparametric methods.

Two hundred 'measured' data values (histogram) and the corresponding Normal distribution model (curve) generating those values. The modelled distribution is entirely described by the *parameters*: mean=10 (µ) and standard deviation=3 (σ). We can see here that the frequency of real data values don't match the curve exactly; the model simplifies our representation of the 'noisy' data. The dashed line represents the mean value of the model, and the dotted lines represent values at +/- one and two standard deviations from the mean.

Figure 1.2: Two hundred ‘measured’ data values (histogram) and the corresponding Normal distribution model (curve) generating those values. The modelled distribution is entirely described by the parameters: mean=10 (µ) and standard deviation=3 (σ). We can see here that the frequency of real data values don’t match the curve exactly; the model simplifies our representation of the ‘noisy’ data. The dashed line represents the mean value of the model, and the dotted lines represent values at +/- one and two standard deviations from the mean.

Please take some time to explore a statistical distribution interactively in your browser, at the link below.

In this session, you can explore the way in which the parameters of a statistical distribution alter its shape, and how well the distribution appears to represent measured data, depending on how much data you are able to collect.

BEWARE OF HISTOMANCY!

One strategy for choosing a distribution to represent your data is to plot a histogram of that data and, by looking at it, choose an appropriately-shaped distribution that looks “similar”. In his book Statistical Rethinking, Richard McElreath refers to this as “Histomancy”, by analogy to haruspicy, necromancy and other fake magic arts that claimed to disclose hidden truths by looking at entrails, or speaking with the dead.

As you will have seen from the interactive session, even data that is drawn specifically from a Normal Distribution does not always look like a Normal Distribution when presented as a histogram. In general, it may very well be that a given choice of distribution is good or poor, but an appropriate choice cannot be determined only by examining histograms, or by guesswork. We need to apply sound principles to our choice of statistical methods.

One of the supplementary notebooks describes how the way you collect and measure data determines the appropriate statistical distribution.

Remember, even an apparently simple process, like describing a dataset by its mean/average value and a standard deviation, is modelling the dataset.

1.2.2 Representing a relationship with a model

The actual relationship between explanatory and response variables (for example, increasing the value of the explanatory variable increases the value of the response variable - like putting more tins of beans in a bag increases the weight of that bag) can often be described by a simplified representation of that relationship. This simplified relationship is also a model.

For example, if we suspect that the amount of sunlight received by a plant influences its growth, we might try to fit a straight line describing the relationship between measured values of “plant height” (response variable) and “hours of sunlight received” (explanatory variable). Here, the line of best-fit is a model of the relationship between those variables, reflecting our belief in a linear relationship between the two variables. The inferred values of intercept and gradient (or slope) for the straight line are parameters of the model describing the relationship.

A researcher’s choice of model (and of statistical method) embodies a set of beliefs about the processes that generated the data. In effect, they represent the researchers’ beliefs about the natural world or, at least, about the experiment that was performed.

Thirty datapoints indicating a relationship between an explanatory variable and a response variable. Datapoints are shown as blue dots. The relationship is modelled by the orange straight line (obtained with linear regression) that has gradient and intercept as shown in the figure. The model simplifies our representation of the noisy relationship between the measured and response variables, and implies that the researcher believes the relationship between the explanatory and response variables is best described by a straight line.

Figure 1.3: Thirty datapoints indicating a relationship between an explanatory variable and a response variable. Datapoints are shown as blue dots. The relationship is modelled by the orange straight line (obtained with linear regression) that has gradient and intercept as shown in the figure. The model simplifies our representation of the noisy relationship between the measured and response variables, and implies that the researcher believes the relationship between the explanatory and response variables is best described by a straight line.

Much of statistical analysis of relationships between variables is about estimating (or fitting) the values of parameters for these models, and exploring the implications of the fitted parameters that result.

For example, if we fit a straight-line relationship to explain the change in weight of a bag filled with tins of beans, as we fill it with different numbers of tins, we would obtain a value for the gradient, and the intercept. The gradient might be interpreted as the average mass of a single tin of beans; the intercept might be interpreted as the mass of the empty bag. The parameters of the model are interpretable as real-world concepts.

Model parameters can be very helpful if they relate to real-world concepts or phenomena.

Figure 1.4: Model parameters can be very helpful if they relate to real-world concepts or phenomena.

Please take some time to explore a statistical relationship interactively in your browser, at the link below.

In this session, you can explore the way in which the parameters of a statistical model alter how well it appears to fit measured data, what we mean by a “good” and “bad” fit, and how we represent that.

Whenever we fit a curve to our data, the curve is a model and we are representing the relationship between the data as a model. The parameters (e.g. intercept and slope for a linear regression; exponent for an exponential curve) are then estimated parameters for the model. A “good” model will have parameters that are understandable in terms of the real-world, or the system under study.

2 R code used in this notebook

Click to toggle R code for Figure 1.1

The R code below plots the Normal distribution from figure 1.1, with a random sampling of 200 “measured” variables from this distribution, using ggplot2.

mu = 10                          # define mean of distribution
sd = 3                           # define standard deviation of distribution
breaks = seq(0, 20, by=0.4)      # set up the breakpoints between bars in the histogram

dfm_hist = data.frame(vals=rnorm(200, mean=mu, sd=sd))               # create a data frame

p = ggplot(dfm_hist, aes(x=vals))                                    # set up the ggplot with data
p = p + geom_histogram(aes(y=..density..),                           # add a histogram layer
                       breaks=breaks,
                       fill="cornflowerblue")
p = p + stat_function(fun=dnorm, args=list(mean=mu, sd=sd),          # add Normal curve layer
                      geom="line")
p = p + annotate("segment",                                          # show mean as dashed line
                 x=mu, xend=mu, y=0, yend=0.2,
                 colour="darkorange1", size=1, linetype="dashed")
p = p + annotate("segment",                                          # show std devs as dotted lines
                 x=c(mu-2*sd, mu-sd, mu+sd, mu+2*sd),
                 xend=c(mu-2*sd, mu-sd, mu+sd, mu+2*sd),
                 y=c(0, 0, 0, 0), yend=c(0.2, 0.2, 0.2, 0.2),
                 colour="goldenrod", size=1, linetype="dotted")
p = p + annotate("text",                                             # annotate the lines
                 x=c(mu-2*sd, mu-sd, mu, mu+sd, mu+2*sd),
                 y=c(0.21, 0.21, 0.21, 0.21, 0.21),
                 colour="darkorange3",
                 label=c("µ - 2σ", "µ - σ", "µ", "µ + σ", "µ + 2σ"))
p = p + xlim(mu - 3 * sd, mu + 3 * sd)                               # set x-axis limits
p = p + xlab("measured variable") + ylab("frequency")                # add axis labels
p = p + theme(plot.background = element_rect(fill = figbg,           # colour background
                                             color = figbg))
p                                                                    # show plot
Click to toggle R code for Figure 1.2

The R code below plots the linear relationship from figure 1.2, with a random sampling of 30 “measured” variables, using ggplot2.

n_samples = 30       # number of measured samples

# parameters for linear relationship
slope = 1.7          # slope of relationship
intcp = 5            # intercept of relationship

# parameters for measurement noise; assumed to be Normally distributed
# note, our representation of noise is also a model
mu = 0               # mean measurement error
std = 8              # standard deviation of measurement error

# generate x values
data = data.frame(x=runif(n_samples, 0.4, 20))
# generate "true" y values
data = data %>% mutate(y=(x * slope) + intcp)
# generate "measured" y values
data = data %>% mutate(ym = y + rnorm(n_samples, mu, std))

# predict linear relationship
fitlm = lm(ym ~ x, data=data)       # fit a straight line
m_coeff = fitlm$coefficients[2]     # gradient
c_coeff = fitlm$coefficients[1]     # intercept
data$predlm = predict(fitlm)        # add datapoints corresponding to line
predslm = predict(fitlm, interval="confidence")  # obtain confidence intervals
data = cbind(data, predslm)         # complete dataset

# plot relationships
p = ggplot(data, aes(x=x, y=ym))
p = p + geom_point(colour="cornflowerblue")
#p = p + geom_ribbon(aes(ymin=lwr, ymax=upr), alpha=0.15)
p = p + geom_line(aes(y=predlm), size=1, colour="darkorange3")
p = p + annotate("text",                                             # annotate the model
                 x=2,
                 y=40,
                 hjust=0,
                 colour="darkorange3",
                 label=paste("gradient = ", format(round(m_coeff, 2), nsmall=2),
                             ", intercept = ", format(round(c_coeff, 2), nsmall=2)))
p = p + xlab("measured variable") + ylab("response variable")        # add axis labels
p = p + theme(plot.background = element_rect(fill = figbg,           # colour background
                                             color = figbg))

p