1 Introduction

Much of practical statistics in the literature still revolves around the idea of the Null Hypothesis Significance Test. This is on the face of it quite similar to Karl Popper’s idea of falsification being an important characteristic of a useful experiment.

1.1 Falsificationism

The idea behind falsificationism and hypothesis testing is straightforward:

  • Your experiment should be able to disprove your hypothesis, not merely confirm it.

This is stated formally as:

  • We propose a hypothesis \(H\) (e.g. eating fish once a week for a year always increases IQ by at least 20 points)
  • We show that hypothesis \(H\) necessarily implies the existence of observation \(D\)
  • We look for the existence of \(D\) (e.g. we take people who have never eaten fish and measure their IQ; we then give them fish once per week for a year, and measure their IQ after the process)

If we look for \(D\) and don’t find it (i.e. the people who ate fish do not show an increase in IQ of at least 20 points), we can then state that \(H\) is false.

If we look for \(D\) and find it (i.e. we do see the increase we’re looking for in those who ate fish), this does not imply that \(H\) is true! \(D\) might also be the result of other hypotheses that are not compatible with \(H\).

Failing to refute a hypothesis leads to provisional acceptance of the hypothesis, not proof of the hypothesis.

A classic example of this kind of logic is the Black Swan.

Until Australia was discovered, no-one in Western Europe knew of the existence of black swans. It was therefore reasonable to hold the hypothesis (the null hypothesis \(H_0\)) that “all swans are white.”

\[H_0: \textrm{All swans are white}\]

On arriving in Australia, explorers observed swans with black feathers, immediately disproving hypothesis \(H_0\). But before travelling to Australia, no amount of observations of white swans could prove that hypothesis to be true (indeed, it was false - black swans existed regardless of whether Western explorers had seen them!). However, it only took one observation of a black swan to prove the hypothesis to be false.

It’s tempting to think that this mode of logic - pose questions as hypotheses to be falsified absolutely - is a reliable basis for making scientific progress. But, although it is important, it has significant limitations.

  1. The experiments we conduct are represented as models when we test them; these models may not correspond exactly to the data or to the experimental system. The imperfection of that correspondence may be important.
  2. Most of the questions we ask as scientists are not “logically discrete.” They cannot be framed as hypotheses that can be contradicted by a single example
  3. Observations are prone to error
  4. Quantitative hypotheses do not always concern total presence or absence

For example, many scientific hypotheses do not resemble “all swans are white” but are more like:

\[H_0: \textrm{80% of swans are white}\]

or

\[H_0: \textrm{Black swans are rare}\]

How are we to falsify these hypotheses? How do we even agree on what “rare” means? It has suddenly become much more difficult than finding a single black swan.

With our fish-eating IQ example, maybe our hypothesis should be something like: “eating fish once per week for a year usually increases IQ”? But what does “usually” mean in that case?

It might be argued that these are not good hypotheses. But if that is so, most important scientific hypotheses are also not good hypotheses.

  1. Can you think of an interesting and useful scientific hypothesis that is easy to falsify?
  2. Can you think of an interesting and useful scientific hypothesis that is difficult to falsify?

1.2 Competing hypotheses

As we cannot absolutely falsify most hypotheses, we often instead define competing hypotheses \(H_1\) (e.g. eating fish has no effect on IQ) and \(H_2\) (eating fish increases IQ), and design experiments that can differentiate between them. But it is rare that this differentiation is absolute and able to exclude one possibility entirely. We can only make probabilistic statements about which of the competing hypotheses is most likely to be true, to some level of confidence.

In other words, we need statistics to tell us which hypothesis is most likely.

This approach is summarised formally as Neyman-Pearson Decision Theory:

  1. Set up two statistical hypotheses \(H_1\) and \(H_2\)
  2. Decide on error rates - how willing you are to accept the wrong answer (i.e. false positives or false negatives) - and a sample size, based on cost-benefit analysis. For instance, controlling error rate takes more datapoints and is more expensive, or ethically hard to justify, and this constrains how certain we can be about our statistical result. These error rates define a rejection region for \(H_1\) and \(H_2\).
  3. If the data falls in the rejection region for \(H_1\), we instead accept \(H_2\).

Here:

  • false positive means accepting \(H_2\) in error
  • false negative means accepting \(H_1\) in error

When we “accept” a hypothesis it doesn’t mean that we believe it, only that we behave as if it were true

You may be more familiar with the same idea expressed as Fisher’s Null Hypothesis Testing:

  1. Set up a statistical null hypothesis \(H_0\) (this need not be a nil hypothesis, i.e. zero difference between control and treatment, but often experiments are set up that way).
  2. Report the exact level of statistical significance of your result, given the null hypothesis (e.g. \(p=0.051\) or \(p=0.049\)). Do not arbitrarily use a 5% threshold, and do not talk about accepting or rejecting a hypothesis.
  3. Only use this procedure if you know very little about the problem at hand.

Another approach, which does not match either of these approaches exactly, is characterised as The Null Ritual:

  1. Set up a statistical null hypothesis of “no mean difference” or “zero correlation.” Don’t specify the predictions of your research hypothesis or of any alternative substantive hypotheses.
  2. Use 5% as an arbitrary convention for rejecting the null hypothesis. If the p-value is “significant”, accept your research hypothesis, whatever it is. Report the result as p<0.05, p<0.01, or p<0.001 (whichever comes just above to the p-value you actually obtained).
  3. Always perform this procedure.

The Null Ritual is frequently observed in publications where Null Hypothesis Significance Testing (NHST) has been performed. It is a bad habit in science.

Scientists and statisticians often argue, sometimes quite loudly, about the “correct” way to conduct statistical testing. It is a complex and nuanced topic, with many contrasting opinions.

  1. How would you describe the differences between the Neyman-Pearson Decision Theory approach, and Fisher’s Hypothesis Testing?
  2. Can you think of a scientific question that would be more suitable for one of these approaches than the other?

2 Confidence Intervals

So far, when we have estimated the values of model parameters, such as the mean value in a set of datapoints, we have usually stated a point estimate: a single numerical value. It is more appropriate to acknowledge that our estimate is uncertain, and provide instead a range of plausible values that the parameter could take and still describe the data well.

You have seen an example of this in the interactive notebook “Exploring a Statistical Relationship”

2.1 Empirical Confidence Interval

Let’s say that we have sampled 250 datapoints from a uniform distribution of integers with lower limit -40 and upper limit +60 (see figure 2.1). We know that the mean of this population is exactly 10.

Histogram of 250 datapoints sampled from a uniform distribution with limits (-40, 60), showing a point estimate of the mean.

Figure 2.1: Histogram of 250 datapoints sampled from a uniform distribution with limits (-40, 60), showing a point estimate of the mean.

We can repeat this sampling 1000 times, to obtain 1000 different estimates of the the mean, and calculate quantiles (specifically: percentiles) for these values by ordering all of the estimates and reporting those at percentage intervals from the smallest to the largest estimate.

estimate
0% 4.09
2.5% 6.59
5% 7.15
10% 7.73
25% 8.86
50% 10.07
75% 11.31
90% 12.32
95% 12.92
97.5% 13.52
100% 15.39

The smallest and largest estimates of the mean are 4.09 and 15.39 (the 0% and 100% percentiles), so we know that all our estimates lie between those two limits (inclusive). As 100% of estimates lie in this range, we might call this our 100% Confidence Interval: 4.09-15.39 - we are absolutely confident that the true mean lies in this range.

In practice, the convention is more often to report 95%, 90%, or 50% confidence intervals. These are the quantile/percentile ranges (2.5%-97.5%), (5%-95%), and (25%-75%). They are tabulated below for this data, and shown in Figure 2.2

  • 95% C.I.: 6.59-13.52
  • 90% C.I.: 7.15-12.92
  • 50% C.I.: 8.86-11.31
  • As the confidence level gets larger (50% to 90% to 95%), the limits of the confidence interval range also get larger.
  • In this case, each interval contains what we know to be the true mean of the population: 10.
  • Each confidence interval is approximately symmetric about the median value (10.07)
  • The more datapoints you have, the narrower (more precise) your confidence intervals will be

These are empirical confidence intervals, because they are determined directly from our list of estimated means.

Confidence intervals from 1000 different estimates of the mean of a uniform distribution with limits (-40, 60), shown against a histogram of all estimates.

Figure 2.2: Confidence intervals from 1000 different estimates of the mean of a uniform distribution with limits (-40, 60), shown against a histogram of all estimates.

2.2 Normal Distribution Confidence Intervals

We can define similar Confidence Intervals when we model our data using a parametric distribution, like a Normal Distribution. As this distribution is completely described by an equation that is parameterised by \(\mu\) and \(\sigma\) (the mean, and standard deviation), it is straightforward to calculate the corresponding quantiles/percentiles.

Normal Distribution with mean 0, standard distribution 1, showing 50%, 90%, and 95% Confidence Intervals

Figure 2.3: Normal Distribution with mean 0, standard distribution 1, showing 50%, 90%, and 95% Confidence Intervals

As with empirical confidence intervals:

  • as confidence level gets larger, the limits of the confidence interval extend further from the central value (which, for a Normal Distribution is the mean, median, and mode value).
  • all confidence intervals contain, and are symmetric about the central value

In the context of a Normal Distribution, the Confidence Interval may be interpreted in a number of ways, and you will probably see them all in the literature. For instance:

  • The CI represents the “central mass” of the distribution. For example, 95% of all values in the distribution can be found in the 95% CI. Likewise, 50% of all values in the distribution can be found in the 50% CI.
    • So, if we were to make 100 observations we should expect 95% of those observations to lie within the 95% confidence interval. This means that we can predict likely future values drawn from the same distribution.
  • If our distribution represents a sampled/observed dataset, we can interpret a CI to be our confidence that the mean of the distribution lies in that range
    • For example, we would be “95% confident” that the mean value of our data lay in the 95% CI for our observations.

One way you might see the confidence interval used is to report a researcher’s confidence that their estimate of the mean of their dataset is accurate:

We sampled 30 lobsters from the site and measured claw size. We estimated the mean claw length to be 5.3cm (95%CI: 4.3-6.3)

This implies that the researchers calculated a mean and standard deviation for their dataset, and used these to find the 2.5% and 97.5% percentiles for the corresponding Normal Distribution.

Back in the old days, we used to use books of statistical tables which had these numbers pre-printed in a form we could use for calculations. Packages like R make generating these numbers trivial, today.

Statistical table of Normal Distribution percentiles

Figure 2.4: Statistical table of Normal Distribution percentiles

3 Hypothesis Testing (and \(P\)-Values)

The concept of Confidence Intervals is central to classical statistical Hypothesis Testing. Suppose our researchers above have a hypothesis like:

\[H_0: \textrm{The average length of lobster claws is 5cm}\]

and they assume that lobster claw lengths are Normally-distributed.

Statistical analyses always come with assumptions. These may or may not be clearly stated in the literature.

Even for the simple hypothesis above, we make the assumption that our data are Normally distributed. This is reasonable here because we are making multiple measurements of “the same thing” in a population (see Notebook 02-03: “Where Do Statistical Distributions Come From?”) but, if we were counting spots on lobsters, this would be an inappropriate assumption, because that data would be better represented by a Poisson Distribution.

Having made our measurements, the researchers calculate a mean claw length of 5.7cm, and a standard deviation of 0.4cm. This parametrises the distribution shown in Figure 3.1.

Normal Distribution obtained from lobster claw measurements having mean 5.7cm and standard deviation of 0.4cm, showing 50%, 90%, and 95% Confidence Intervals, and our hypothesis mean $H_0=5$ (orange line)

Figure 3.1: Normal Distribution obtained from lobster claw measurements having mean 5.7cm and standard deviation of 0.4cm, showing 50%, 90%, and 95% Confidence Intervals, and our hypothesis mean \(H_0=5\) (orange line)

  1. Should we accept hypothesis \(H_0\) (i.e. is the average length of a lobster claw 5cm)?
  2. How did you decide on your answer for question 1 is correct?

3.1 P-values

The lobster claw example in Figure 3.1 illustrates many of the difficulties with probabilistic statements about hypotheses. We might reasonably take any of a number of positions on our answer:

  • The mean of our measurements was 5.7, and this is not the same as 5, so we should not accept \(H_0\)
  • The 95% confidence interval for our measurements includes the value 5, so we should (provisionally) accept \(H_0\)
  • The 90% confidence interval for our measurements do not include the value 5, so we should not accept \(H_0\)

But why should we pick either 90% or 95%? It’s not obvious that one is automatically better than the other.

You will probably have encountered ideas like “P<0.05”, “95% significance”, “5% significance”, and so on, and seen them used as if they were a universal threshold for statistical “truth.” There is, in fact, no universal threshold for statistical significance. We are free to make our own choice, but we must state our choice and our reasons for it.

The two different approaches to hypothesis testing we encountered above would suggest two different approaches to making this choice.

The Neyman-Pearson approach would have us decide how willing we are to incorrectly say that we do not accept \(H_0\). If, after calculating the cost-benefit analysis of number of samples and the consequences of getting the “wrong” answer, we decided that we would only accept \(H_0\) if it fell within the 90% confidence interval, we would declare this before conducting the experiment. We would call our result once we had calculated the mean and standard deviation from the experimental sample.

This is what should be happening in papers where authors report values like “the result was significant because P<0.05.” They should have decided before the experiment that P<0.05 was an important threshold, and explained their reasoning. Unfortunately, that is not often the case, in practice.

The alternative Fisher’s Null Hypothesis approach would have us report the calculated P-value, and leave it at that. We would not choose to accept or reject the hypothesis, but the P-value would be a piece of evidence to help us argue our case in the paper that the mean claw length was (or was not) 5cm, and to help the reader decide whether they believed our argument.

This is what is happening in papers where the authors report P-values as precise numbers, e.g. “P=0.0314”.

3.1.1 So what is a P-value?

P-values are closely related to Confidence Intervals. They are effectively two ways of looking at the same thing.

To see this, consider how we defined both the empirical and Normal 95% Confidence Intervals:

  • both CIs are the middle 95% of the datasets, with 2.5% of each dataset to the left of it, and 2.5% to the right.
  • to find the 95% CI for the empirical dataset, we looked at the 2.5% percentile, and the 97.5% percentile, and took these as the boundaries of the confidence interval.
The 95% CI for a Normal Distribution with mean 0 and standard deviation equal to 1 contains 95% of the data and extends from the 2.5% to 97.5% percentiles.

Figure 3.2: The 95% CI for a Normal Distribution with mean 0 and standard deviation equal to 1 contains 95% of the data and extends from the 2.5% to 97.5% percentiles.

Conversely, the areas outside the 95% CI each contain 2.5% of the data (one portion to the left, one to the right).

Each of these excluded regions is called a tail.

The 95% CI for a Normal Distribution excludes 5% of the data in the distribution: 2.5% to the left, and 2.5% to the right

Figure 3.3: The 95% CI for a Normal Distribution excludes 5% of the data in the distribution: 2.5% to the left, and 2.5% to the right

As we have noted, Confidence Intervals for a distribution can be interpreted in a number of different ways. The way we want to focus on now is this:

  • If we randomly sampled numbers from this distribution over and over again, we would expect that 95% of the numbers we sampled would be found within the 95% Confidence Interval

The converse of this is:

  • If we randomly sampled numbers from this distribution over and over again, we would expect that 5% of the numbers we sampled would be found outwith the 95% Confidence Interval

For the 95% CI, the P-value threshold is 100% - 95% = 5%. We could say, “any number found in the two tails of Figure 3.3 has less than 5% probability of being sampled,” and we would express this as “P<0.05” of sampling one of those numbers.

So far we have only considered P-value thresholds, but each number we might choose to check against the distribution will have a P-value associated with it. The P-value is the probability with which that number or a more extreme value would be expected to be drawn from the distribution.

If it helps, you can think of this as what the P-value threshold would be if the boundaries of the tails were such that one fell at that number.

A small P-value suggests that the hypothesis is likely to be untrue - but it may just mean that an unlikely event occurred. There are no firm rules for how small “small” is, in this context.

3.1.1.1 Using the P-value

Considering Figure 3.3, imagine two experimenters \(A\) and \(B\) each independently select a number and want to determine the probability that their number came from the distribution. Their (null) hypothesis is:

\[H_0: \textrm{My number comes from the distribution in the figure}\]

Experimenter \(A\) selected the number -2.5.

  • This value falls in the left tail, so if we were using a P<0.05 threshold, we would conclude that this number was unlikely to come from the distribution, and we might choose to reject \(H_0\).
  • If we were using the actual P-value (P=0.0124) we would conclude that, because this is a small value, there is reasonably strong evidence against this number being drawn from the distribution in the figure.

Experimenter \(B\), however selected the number 1.5.

  • This value falls within the 95% CI, so if we were using a P<0.05 threshold, we would conclude that this number could have come from the distribution, so there are no strong grounds to reject the distribution, and we might choose to accept \(H_0\).
  • If we were using the actual P-value (P=0.1336) we would conclude that, because this is a moderately large value, there is relatively little evidence against this number being drawn from the distribution in the figure.

Except for the detail that neither A or B actually performed an experiment, this is essentially the same process that goes on when you perform an experiment and test the value you measure against a (null) hypothesis.

  • Both A and B had a number.
  • Both A and B had a (null) hypothesis: the distribution in Figure 3.3
  • Both A and B
    • checked the number against the P<0.05 threshold to determine whether their number was likely to fall within the corresponding confidence interval
    • calculated the exact P-value that their number would be drawn from the distribution

In an experimental situation, your hypothesis might be that application of a painkiller reduces perceived pain. So, you might measure the difference in subjective pain for a set of patients before and after administration of a painkiller, giving you a distribution of reported differences (like that in Figure 3.3). You could then ask the question: “Is the value zero (corresponding to no change in pain) found within the appropriate confidence interval for my data?” If the value zero lay within that range, then you could not exclude the possibility that the painkiller has no effect on the perception of pain.

3.1.2 One-tailed vs Two-tailed Distributions

We have looked at Confidence Intervals, which are symmetric about the mean of the distribution, and result in two tails, one on either side of the distribiution. This represents a situation where we want to know the probability that a number in the distribution is close to the mean, or far away from it. Numbers within the CI are close to the mean; numbers not in the CI are far from the mean, or extreme.

In this situation, we consider extreme values as belonging to either of the two tails and, when we test to see if a value is in one of those tails, we call it a two-tailed test.

But what if we only want to know the probability of a number being extremely low, or extremely high (with P<0.05)? In this case, we might want to consider only a single tail containing the lowest (Figure 3.4) or highest (Figure 3.5) 5% of values for the distribution. We find this by looking at the corresponding percentile of values in the distribution (here, 5% or 95%).

The 5% left-tail of the Normal Distribution with mean 0 and standard deviation equal to 1.

Figure 3.4: The 5% left-tail of the Normal Distribution with mean 0 and standard deviation equal to 1.

The 5% right-tail of the Normal Distribution with mean 0 and standard deviation equal to 1.

Figure 3.5: The 5% right-tail of the Normal Distribution with mean 0 and standard deviation equal to 1.

In this situation, we consider extreme values as belonging to only one of the two tails and, when we test to see if a value is in one of those tails, we call it a one-tailed test.

3.1.2.1 Using the P-value

Consider the situation in Figure 3.5. Imagine two experimenters \(A\) and \(B\) each independently select a number and want to determine the probability that their number is larger than would be expected for any value in the distribution. Their hypothesis is:

\[H_0: \textrm{My number is small enough that it could reasonably come from the distribution in the figure}\]

In this case, the test is a one-tailed test because we only consider a value to be extreme if it is too large.

Experimenter \(A\) selected the number -2.5.

  • This value does not fall in the right-hand tail, so if we were using a P<0.05 threshold, we would conclude that this number is not too large to have come from the distribution, and we might choose to accept \(H_0\).
  • If we were using the actual P-value (P=0.9938) we would conclude that, because this is a large value close to 1, there is very strong evidence that this number is not larger than values from the distribution in the figure.

Experimenter \(B\), however selected the number 2.

  • This value falls within the right-tail, so if we were using a P<0.05 threshold, we would conclude that this number is unlikely to have come from the distribution, so we might choose to reject \(H_0\).
  • If we were using the actual P-value (P=0.0228) we would conclude that, because this is a fairly small value, there is a moderate amount of evidence that this number is larger than any that would be drawn from the distribution in the figure.

Many statistical tests - such as the Student’s t-test, and Chi-squared (\(\chi^2\)) test - involve generation of a test statistic (the \(t\) in \(t\)-test is a test statistic), which is then compared against a statistical distribution to determine a P-value. This works in a similar manner to the examples in figures 3.3 and 3.5.

4 Examples

The examples below describe how we might write down the hypotheses for an experiment, for statistical hypothesis testing.

4.1 Testing the mean rate of absorption of a drug - 1

We require to test whether the mean rate of absorption is close to a particular value: 1.5mg/h (perhaps because we want to be certain about rate of uptake in a patient). We measure the rate of absorption, and assume that (i) the data is Normally Distributed, and (ii) no alternative mean has been suggested.

In this case, we can set our hypothesis to be:

\[H_0: \textrm{The mean rate of absorption is 1.5mg/h}\]

and then perform a two-tailed \(t\)-test against an assumed mean value of 1.5, because we want to check if our data is extremely high or extremely low.

4.2 Testing the mean rate of absorption of a drug - 2

We require to test whether the mean rate of absorption is slower than a particular value: 1.5mg/h (because faster absorption may result in kidney or liver damage, for example). We measure the rate of absorption, and assume that (i) the data is Normally Distributed, and (ii) no alternative mean has been suggested.

In this case, we can set our hypothesis to be:

\[H_0: \textrm{The mean rate of absorption is greater than 1.5mg/h}\]

and then perform a one-tailed \(t\)-test against an assumed mean value of 1.5, because we want to check if our data is extremely high.

4.3 Testing the effectiveness of a painkilling drug

We require to test whether our new drug A is equivalent to the current best treatment available: drug B. We will be measuring pain on a continuous scale, assuming that the data is Normally-distributed. We have the same number of patients, matched for the same backgrounds and characteristics, being treated with each drug. Each patient receives one and only one drug.

Here, we can set our hypothesis to be:

\[H_0: \textrm{There is no difference between the effectiveness of the two treatments}\]

and then perform a two-tailed, two-sample t-test on the two datasets. In this case we would be checking to see if the mean difference between each dataset mean is extremely different from zero (in either positive or negative direction).