Statistical distributions, such as the Normal, Binomial, and Poisson distributions are often presented without an explanation of their origins, or what they represent. This is unfortunate, as knowing the processes that generate a distribution gives insight into how, why, and when those distributions are useful and appropriate for an analysis. If your data is generated in a way that seems naturally to lead to a particular kind of distribution, then you should probably consider using tests appropriate for that kind of distribution.
This notebook summarises common statistical distributions, and links out to interactive sessions that illustrate how statistical distributions are generated from simple processes. Specifically:
This notebook provides static information, and you will get the most out of it if you follow the links to the interactive examples to see how each distribution can be generated from a simple process.
The Normal Distribution (also known as the Gaussian Distribution or Bell Curve) is probably the most common statistical distribution you will meet. Many statistical methods assume that the data being fed into them is Normally-distributed (or, at least, that the errors involved in measuring the data that are fed into them are Normally-distributed). Due to the many ways in which Normal distributions can arise when we make measurements in experiments, this is often a reasonable assumption to make. Among the ways in which data tends to form a Normal distribution are:
Please take some time to explore how the Normal distribution arises naturally out of estimating the mean value of a population, using the link below.
In this session, you can explore the way in which the variation in how we measure a value (or estimate a mean) affects Normal distribution that results.
You met a Normal distribution in Notebook 02 (“What Does Statistics Do, Anyway?”), and it’s reproduced here in Figure 2.1
Important features of the Normal distribution include:
The earlier interactive session allowed you to explore the way in which the parameters of a statistical distribution alter its shape, and how well it appears to represent observed data.
Not all data is Normally-distributed.
You should check all necessary assumptions before performing a statistical test - this often means testing your dataset for “Normality.”
Many other natural random processes (and experimental observations) are not continuous, but instead have discrete outcomes - their values are constrained. For instance, a coin toss can be only Heads or Tails - there are no intermediate values. Many experiments are set up in a similar way, to have an outcome that is either “success” (Heads/survival) or “failure” (Tails/elimination). The statistics of coin tosses are a good way to understand the statistics of these kinds of experiments.
Coin tosses are an example of a Binomial (two-number) Experiment, also known as a Bernoulli Trial. These kinds of processes have the following properties:
There are two parameters for this process: the number of trials \(n\), and the probability of success on any individual trial, \(p\).
For a fair coin, heads and tails are equally likely, so \(p = q = 0.5\). We could flip our coin in three groups of \(n=30\) times, to obtain three different runs of trials, calling T
for Tails, and H
for Heads:
Each of these runs is independent, and gives us a potentially different number of successes (Heads), each time.
The Binomial Distribution is the pattern we get if we repeat an infinite number of runs, having a set number \(n\) of coin tosses with a set probability \(p\) of “Heads”, and count the “Heads” (successes) in each run. Figure 3.1 shows the binomial distribution for 30 tosses of a fair coin (\(n=30, p=0.5\), orange dots), and the pattern we get from 200 of those runs (blue bars).
The Binomial Distribution can vary in shape, is not usually symmetrical, and the mean, median and mode of the distribution can take different values in the same distribution.
The Binomial Distribution is discrete - only whole numbers (integers) of successes are produced.
The Negative Binomial distribution is generated by exactly the same coin-toss Bernoulli Trial process as the Binomial distribution. But this time, instead of counting the successes we observe in a given number of trials, we count the number of successes we observe before a prescribed number of failures \(n\) is seen. As before, the probability of success is represented by the value \(p\).
There are two parameters for this process: the number of failures we’re waiting for \(n\), and the probability of success on any individual trial, \(p\).
The expected mean and mode values for this distribution are different, and the distribution is usually not symmetrical.
Statistical texts vary in how they define the Negative Binomial Distribution. Some define \(p\) as the probability of success; others define it as the probability of failure. Some define \(n\) as the number of successes; others as the number of failures. There are also other differences in how this distribution can be derived (beyond the scope of this workshop).
Take care when reading other texts that you understand their choice of parametrisation.
The Negative Binomial Distribution can take a number of shapes, and the mean, median and mode can be quite different, even in the same distribution.
The Negative Binomial Distribution is discrete - only whole numbers (integers) of successes are produced.
Exponential distributions are generated by Poisson processes. “Poisson process” is a bit of jargon, but it’s really a model for many natural processes you may be familiar with, including:
A Poisson process is a mathematical approximation to these real-life events, and has a precise definition:
Each event is discrete (it’s a definable “moment” or “thing”), but the distribution of waiting time between events is continuous and random, with an average value (the “average rate”).
The Exponential Distribution describes the waiting time between events in a Poisson process.
There is a single parameter for any Poisson process: the rate at which events occur, \(\lambda\). (You can think of this as the number of events per unit time)
The “tail” of the Exponential distribution can be quite long. Even if the distribution has a small average waiting time, the longest waiting time seen might be very large.
The Poisson Distribution is - as you might have guessed - related to the Exponential distribution. It is produced by the same kind of processes - Poisson processes - but instead of describing the waiting times between successive events, it describes the number of events we would expect to occur in a unit of time.
There is a single parameter for this process, and it is the same one as for the Exponential distribution: the average rate at which events occur, \(\lambda\). (You can think of this as the number of events per unit time)
The rate at which events occur, 𝛌, is a continuous variable, but the Poisson Distribution is discrete and describes the counts of events in a unit of time.