Statistical distributions, such as the Normal, Binomial, and Poisson distributions are often presented in statistical texts without an explanation of their origins, or what they represent. This is unfortunate, as knowing the processes that generate a distribution gives insight into how, why, and when those distributions are useful and appropriate for an analysis.
This notebook summarises common statistical distributions, and links out to interactive sessions that illustrate how statistical distributions are generated from simple processes. Specifically:
This notebook provides static information, and you will get the most out of it if you follow the links to the interactive examples to see how each distribution can be generated from a simple process.
The Normal Distribution (also known as the Gaussian Distribution or Bell Curve) is probably the most common statistical distribution you will meet. Many statistical methods assume that the data being fed into them is Normally-distributed (or, at least, that the errors involved in measuring the data that are fed into them are Normally-distributed). Due to the many ways in which Normal distributions can arise when we make measurements in experiments, this is often a reasonable assumption to make. Among the ways in which data tends to form a Normal distribution are:
Please take some time to explore how the Normal distribution arises naturally out of estimating the mean value of a population, using the link below.
In this session, you can explore the way in which the variation in how we measure a value (or estimate a mean) affects Normal distribution that results.
You met a Normal distribution in Notebook 01 (โWhy Do We Do Statisticsโ), and itโs reproduced here in Figure 2.1
Important features of the Normal distribution include:
The earlier interactive session allowed you to explore the way in which the parameters of a statistical distribution alter its shape, and how well it appears to represent observed data.
Not all data is Normally-distributed.
You should check all necessary assumptions before performing a statistical test - this often means testing your dataset for โNormality.โ
Many other natural random processes (and experimental observations) are not continuous, but instead have discrete outcomes. For instance, a coin toss can be only Heads or Tails - there are no intermediate values. Many experiments are set up in a similar way, to have an outcome that is either โsuccessโ (Heads) or โfailureโ (Tails). The statistics of coin tosses are a good substitute for the statistics of these kinds of experiments.
Coin tosses are an example of a Binomial (two-number) Experiment, also known as a Bernoulli Trial. These kinds of processes have the following properties:
There are two parameters for this process: the number of trials \(n\), and the probability of success on any individual trial, \(p\).
For a fair coin, \(p = q = 0.5\), and we can flip it \(n=30\) times, to obtain three different runs of trials, calling T
for Tails, and H
for Heads:
Each of these runs is independent, and gives us a potentially different number of successes (Heads), each time.
The Binomial Distribution is the pattern we get if we repeat runs with this number of coin tosses with this fair coin an infinite number of times, and count successes. Figure 3.1 shows the binomial distribution for 30 tosses of a fair coin (orange dots), and the pattern we get from 200 runs of tossing such a coin 30 times (blue bars).
The Binomial Distribution can vary in shape, and the mean, median and mode of the distribution can take different values in the same distribution.
The Binomial Distribution is discrete - only whole numbers (integers) of successes are produced.
The Negative Binomial distribution is generated by exactly the same coin-toss Bernoulli Trial process as the Binomial distribution. But this time, instead of counting the successes we observe in a given number of trials, we count the number of successes we observe before a prescribed number of failures \(n\) is seen. As before, the probability of success is represented by the value \(p\).
There are two parameters for this process: the number of failures weโre waiting for \(n\), and the probability of success on any individual trial, \(p\).
The expected mean and mode values for this distribution are different.
Statistical texts vary in how they define the Negative Binomial Distribution. Some define \(p\) as the probability of success; others define it as the probability of failure. Some define \(n\) as the number of successess; others as the number of failures. There are also other differences in how this distribution can be derived.
Take care when reading other texts that you understand their choice of parametrisation.
The Negative Binomial Distribution can take a number of shapes, and the mean, median and mode can be quite different, even in the same distribution.
The Negative Binomial Distribution is discrete - only whole numbers (integers) of successes are produced.
Exponential distributions are generated by Poisson processes. These are processes for which:
Many natural processes may approximate Poisson processes, including:
Each event is discrete, but the distribution of waiting time between events is continuous and random, with an average value (the โaverage rateโ).
The exponential distribution describes the waiting time between events in a Poisson process.
There is a single parameter for this process: the rate at which events occur, ๐. (You can think of this as the number of events per unit time)
The โtailโ of the Exponential distribution can be quite long. A distribution with a small average waiting time may still contain several representatives with very long waiting times.
The Poisson Distribution is - as you might have guessed - related to the Exponential distribution. It is produced by the same kind of processes - Poisson processes - but instead of describing the waiting times between successive events, it describes the number of events we would expect to occur in a unit of time.
There is a single parameter for this process, and it is the same one as for the Exponential distribution: the average rate at which events occur, ๐. (You can think of this as the number of events per unit time)
The rate at which events occur, ๐, is a continuous variable, but the Poisson Distribution is discrete and describes the counts of events in a unit of time.
Poisson processes are more general than just intervals of time
We talk about Poisson processes as being a rate per unit time, but Poisson processes can also be spatial and occur over a unit length (like distances between cars in a lane of traffic), area (like counts of tourists in George Square at a given point in time), or volume (like the number of raisins in a Christmas pudding).