Much of practical statistics in the literature still revolves around the idea of the Null Hypothesis Significance Test. This is on the face of it quite similar to Karl Popper’s idea of falsification being an important characteristic of a useful experiment.
The idea behind falsificationism and hypothesis testing is straightforward:
This is stated formally as:
If we look for \(D\) and don’t find it (i.e. the people who ate fish do not show an increase in IQ of at least 20 points), we can then state that \(H\) is false.
If we look for \(D\) and find it (i.e. we do see the increase we’re looking for in those who ate fish), this does not imply that \(H\) is true! \(D\) might also be the result of other hypotheses that are not compatible with \(H\).
Failing to refute a hypothesis leads to provisional acceptance of the hypothesis, not proof of the hypothesis.
A classic example of this kind of logic is the Black Swan.
Until Australia was discovered, no-one in Western Europe knew of the existence of black swans. It was therefore reasonable to hold the hypothesis (the null hypothesis \(H_0\)) that “all swans are white.”
\[H_0: \textrm{All swans are white}\]
On arriving in Australia, explorers observed swans with black feathers, immediately disproving hypothesis \(H_0\). But before travelling to Australia, no amount of observations of white swans could prove that hypothesis to be true (indeed, it was false - black swans existed regardless of whether Western explorers had seen them!). However, it only took one observation of a black swan to prove the hypothesis to be false.
It’s tempting to think that this mode of logic - pose questions as hypotheses to be falsified absolutely - is a reliable basis for making scientific progress. But, although it is important, it has significant limitations.
For example, many scientific hypotheses do not resemble “all swans are white” but are more like:
\[H_0: \textrm{80% of swans are white}\]
or
\[H_0: \textrm{Black swans are rare}\]
How are we to falsify these hypotheses? How do we even agree on what “rare” means? It has suddenly become much more difficult than finding a single black swan.
With our fish-eating IQ example, maybe our hypothesis should be something like: “eating fish once per week for a year usually increases IQ”? But what does “usually” mean in that case?
It might be argued that these are not good hypotheses. But if that is so, most important scientific hypotheses are also not good hypotheses.
As we cannot absolutely falsify most hypotheses, we often instead define competing hypotheses \(H_1\) (e.g. eating fish has no effect on IQ) and \(H_2\) (eating fish increases IQ), and design experiments that can differentiate between them. But it is rare that this differentiation is absolute and able to exclude one possibility entirely. We can only make probabilistic statements about which of the competing hypotheses is most likely to be true, to some level of confidence.
In other words, we need statistics to tell us which hypothesis is most likely.
This approach is summarised formally as Neyman-Pearson Decision Theory:
Here:
When we “accept” a hypothesis it doesn’t mean that we believe it, only that we behave as if it were true
You may be more familiar with the same idea expressed as Fisher’s Null Hypothesis Testing:
Another approach, which does not match either of these approaches exactly, is characterised as The Null Ritual:
The Null Ritual is frequently observed in publications where Null Hypothesis Significance Testing (NHST) has been performed. It is a bad habit in science.
Scientists and statisticians often argue, sometimes quite loudly, about the “correct” way to conduct statistical testing. It is a complex and nuanced topic, with many contrasting opinions.
So far, when we have estimated the values of model parameters, such as the mean value in a set of datapoints, we have usually stated a point estimate: a single numerical value. It is more appropriate to acknowledge that our estimate is uncertain, and provide instead a range of plausible values that the parameter could take and still describe the data well.
You have seen an example of this in the interactive notebook “Exploring a Statistical Relationship”
Let’s say that we have sampled 250 datapoints from a uniform distribution of integers with lower limit -40 and upper limit +60 (see figure 2.1). We know that the mean of this population is exactly 10.
We can repeat this sampling 1000 times, to obtain 1000 different estimates of the the mean, and calculate quantiles (specifically: percentiles) for these values by ordering all of the estimates and reporting those at percentage intervals from the smallest to the largest estimate.
estimate | |
---|---|
0% | 4.09 |
2.5% | 6.59 |
5% | 7.15 |
10% | 7.73 |
25% | 8.86 |
50% | 10.07 |
75% | 11.31 |
90% | 12.32 |
95% | 12.92 |
97.5% | 13.52 |
100% | 15.39 |
The smallest and largest estimates of the mean are 4.09 and 15.39 (the 0% and 100% percentiles), so we know that all our estimates lie between those two limits (inclusive). As 100% of estimates lie in this range, we might call this our 100% Confidence Interval: 4.09-15.39 - we are absolutely confident that the true mean lies in this range.
In practice, the convention is more often to report 95%, 90%, or 50% confidence intervals. These are the quantile/percentile ranges (2.5%-97.5%), (5%-95%), and (25%-75%). They are tabulated below for this data, and shown in Figure 2.2
These are empirical confidence intervals, because they are determined directly from our list of estimated means.
We can define similar Confidence Intervals when we model our data using a parametric distribution, like a Normal Distribution. As this distribution is completely described by an equation that is parameterised by \(\mu\) and \(\sigma\) (the mean, and standard deviation), it is straightforward to calculate the corresponding quantiles/percentiles.
As with empirical confidence intervals:
In the context of a Normal Distribution, the Confidence Interval may be interpreted in a number of ways, and you will probably see them all in the literature. For instance:
One way you might see the confidence interval used is to report a researcher’s confidence that their estimate of the mean of their dataset is accurate:
We sampled 30 lobsters from the site and measured claw size. We estimated the mean claw length to be 5.3cm (95%CI: 4.3-6.3)
This implies that the researchers calculated a mean and standard deviation for their dataset, and used these to find the 2.5% and 97.5% percentiles for the corresponding Normal Distribution.
Back in the old days, we used to use books of statistical tables which had these numbers pre-printed in a form we could use for calculations. Packages like R
make generating these numbers trivial, today.
The concept of Confidence Intervals is central to classical statistical Hypothesis Testing. Suppose our researchers above have a hypothesis like:
\[H_0: \textrm{The average length of lobster claws is 5cm}\]
and they assume that lobster claw lengths are Normally-distributed.
Statistical analyses always come with assumptions. These may or may not be clearly stated in the literature.
Even for the simple hypothesis above, we make the assumption that our data are Normally distributed. This is reasonable here because we are making multiple measurements of “the same thing” in a population (see Notebook 02-03: “Where Do Statistical Distributions Come From?”) but, if we were counting spots on lobsters, this would be an inappropriate assumption, because that data would be better represented by a Poisson Distribution.
Having made our measurements, the researchers calculate a mean claw length of 5.7cm, and a standard deviation of 0.4cm. This parametrises the distribution shown in Figure 3.1.
The lobster claw example in Figure 3.1 illustrates many of the difficulties with probabilistic statements about hypotheses. We might reasonably take any of a number of positions on our answer:
But why should we pick either 90% or 95%? It’s not obvious that one is automatically better than the other.
You will probably have encountered ideas like “P<0.05”, “95% significance”, “5% significance”, and so on, and seen them used as if they were a universal threshold for statistical “truth.” There is, in fact, no universal threshold for statistical significance. We are free to make our own choice, but we must state our choice and our reasons for it.
The two different approaches to hypothesis testing we encountered above would suggest two different approaches to making this choice.
The Neyman-Pearson approach would have us decide how willing we are to incorrectly say that we do not accept \(H_0\). If, after calculating the cost-benefit analysis of number of samples and the consequences of getting the “wrong” answer, we decided that we would only accept \(H_0\) if it fell within the 90% confidence interval, we would declare this before conducting the experiment. We would call our result once we had calculated the mean and standard deviation from the experimental sample.
This is what should be happening in papers where authors report values like “the result was significant because P<0.05.” They should have decided before the experiment that P<0.05 was an important threshold, and explained their reasoning. Unfortunately, that is not often the case, in practice.
The alternative Fisher’s Null Hypothesis approach would have us report the calculated P-value, and leave it at that. We would not choose to accept or reject the hypothesis, but the P-value would be a piece of evidence to help us argue our case in the paper that the mean claw length was (or was not) 5cm, and to help the reader decide whether they believed our argument.
This is what is happening in papers where the authors report P-values as precise numbers, e.g. “P=0.0314”.
P-values are closely related to Confidence Intervals. They are effectively two ways of looking at the same thing.
To see this, consider how we defined both the empirical and Normal 95% Confidence Intervals:
Conversely, the areas outside the 95% CI each contain 2.5% of the data (one portion to the left, one to the right).
Each of these excluded regions is called a tail.
As we have noted, Confidence Intervals for a distribution can be interpreted in a number of different ways. The way we want to focus on now is this:
The converse of this is:
For the 95% CI, the P-value threshold is 100% - 95% = 5%. We could say, “any number found in the two tails of Figure 3.3 has less than 5% probability of being sampled,” and we would express this as “P<0.05” of sampling one of those numbers.
So far we have only considered P-value thresholds, but each number we might choose to check against the distribution will have a P-value associated with it. The P-value is the probability with which that number or a more extreme value would be expected to be drawn from the distribution.
If it helps, you can think of this as what the P-value threshold would be if the boundaries of the tails were such that one fell at that number.
A small P-value suggests that the hypothesis is likely to be untrue - but it may just mean that an unlikely event occurred. There are no firm rules for how small “small” is, in this context.
Considering Figure 3.3, imagine two experimenters \(A\) and \(B\) each independently select a number and want to determine the probability that their number came from the distribution. Their (null) hypothesis is:
\[H_0: \textrm{My number comes from the distribution in the figure}\]
Experimenter \(A\) selected the number -2.5.
Experimenter \(B\), however selected the number 1.5.
Except for the detail that neither A or B actually performed an experiment, this is essentially the same process that goes on when you perform an experiment and test the value you measure against a (null) hypothesis.
In an experimental situation, your hypothesis might be that application of a painkiller reduces perceived pain. So, you might measure the difference in subjective pain for a set of patients before and after administration of a painkiller, giving you a distribution of reported differences (like that in Figure 3.3). You could then ask the question: “Is the value zero (corresponding to no change in pain) found within the appropriate confidence interval for my data?” If the value zero lay within that range, then you could not exclude the possibility that the painkiller has no effect on the perception of pain.
We have looked at Confidence Intervals, which are symmetric about the mean of the distribution, and result in two tails, one on either side of the distribiution. This represents a situation where we want to know the probability that a number in the distribution is close to the mean, or far away from it. Numbers within the CI are close to the mean; numbers not in the CI are far from the mean, or extreme.
In this situation, we consider extreme values as belonging to either of the two tails and, when we test to see if a value is in one of those tails, we call it a two-tailed test.
But what if we only want to know the probability of a number being extremely low, or extremely high (with P<0.05)? In this case, we might want to consider only a single tail containing the lowest (Figure 3.4) or highest (Figure 3.5) 5% of values for the distribution. We find this by looking at the corresponding percentile of values in the distribution (here, 5% or 95%).
In this situation, we consider extreme values as belonging to only one of the two tails and, when we test to see if a value is in one of those tails, we call it a one-tailed test.
Consider the situation in Figure 3.5. Imagine two experimenters \(A\) and \(B\) each independently select a number and want to determine the probability that their number is larger than would be expected for any value in the distribution. Their hypothesis is:
\[H_0: \textrm{My number is small enough that it could reasonably come from the distribution in the figure}\]
In this case, the test is a one-tailed test because we only consider a value to be extreme if it is too large.
Experimenter \(A\) selected the number -2.5.
Experimenter \(B\), however selected the number 2.
Many statistical tests - such as the Student’s t-test, and Chi-squared (\(\chi^2\)) test - involve generation of a test statistic (the \(t\) in \(t\)-test is a test statistic), which is then compared against a statistical distribution to determine a P-value. This works in a similar manner to the examples in figures 3.3 and 3.5.
The examples below describe how we might write down the hypotheses for an experiment, for statistical hypothesis testing.
We require to test whether the mean rate of absorption is close to a particular value: 1.5mg/h (perhaps because we want to be certain about rate of uptake in a patient). We measure the rate of absorption, and assume that (i) the data is Normally Distributed, and (ii) no alternative mean has been suggested.
In this case, we can set our hypothesis to be:
\[H_0: \textrm{The mean rate of absorption is 1.5mg/h}\]
and then perform a two-tailed \(t\)-test against an assumed mean value of 1.5, because we want to check if our data is extremely high or extremely low.
We require to test whether the mean rate of absorption is slower than a particular value: 1.5mg/h (because faster absorption may result in kidney or liver damage, for example). We measure the rate of absorption, and assume that (i) the data is Normally Distributed, and (ii) no alternative mean has been suggested.
In this case, we can set our hypothesis to be:
\[H_0: \textrm{The mean rate of absorption is greater than 1.5mg/h}\]
and then perform a one-tailed \(t\)-test against an assumed mean value of 1.5, because we want to check if our data is extremely high.
We require to test whether our new drug A is equivalent to the current best treatment available: drug B. We will be measuring pain on a continuous scale, assuming that the data is Normally-distributed. We have the same number of patients, matched for the same backgrounds and characteristics, being treated with each drug. Each patient receives one and only one drug.
Here, we can set our hypothesis to be:
\[H_0: \textrm{There is no difference between the effectiveness of the two treatments}\]
and then perform a two-tailed, two-sample t-test on the two datasets. In this case we would be checking to see if the mean difference between each dataset mean is extremely different from zero (in either positive or negative direction).