Statistical distributions are models of real processes that generate data
The different distributions you meet represent distinct processes, or alternative measurements of those processes
Each distribution represents the expected probability outcome of an infinite number of repeated actions
- We can model these processes to obtain statistical distributions
- Normal Distributions represent the expected distribution of an infinite number of measurements of a real value
- An infinite number of coin tosses produces two distributions
  - The Binomial Distribution represents how many heads (or tails) we should expect to occur in a certain number of coin tosses
  - The Negative Binomial Distribution represents how many coin tosses we would expect to have to make in order to obtain a certain number of heads (or tails)
- Events occuring at a set average rate produce two distributions
  - The Exponential Distribution represents the probability that a process will take a certain amount of time to complete, or that a certain amount of time will elapse between consecutive events
  - The Poisson Distribution represents the probability that an event will happen a certain number of times in a given period, or something will occur a certain amount of times in a given space
These distributions are used as models of real data, in parametric tests

1 Introduction

Statistical distributions, such as the Normal, Binomial, and Poisson distributions are often presented without an explanation of their origins, or what they represent. This is unfortunate, as knowing the processes that generate a distribution gives insight into how, why, and when those distributions are useful and appropriate for an analysis. If your data is generated in a way that seems naturally to lead to a particular kind of distribution, then you should probably consider using tests appropriate for that kind of distribution.

This notebook summarises common statistical distributions, and links out to interactive sessions that illustrate how statistical distributions are generated from simple processes. Specifically:

Normal distributions are produced when we make repeated measurements (or estimates) of a quantity
- Normal distributions are good for estimating measurement errors, and the “true mean” values of observations
Binomial and Negative Binomial distributions are produced when we count positive and negative outcomes in a series of tests
- These distributions are good for studying trials or tests with binary outcomes (e.g. “success” or “failure”)
Exponential and Poisson distributions are produced when processes take an average amount of time to complete
- These distributions are good for studying process that may take a variable amount of time to reach an end point - such as survival data - or produce a variable number of events in a set amount of time, or a predefined region of space.

This notebook provides static information, and you will get the most out of it if you follow the links to the interactive examples to see how each distribution can be generated from a simple process.

2 The Normal Distribution

The Normal Distribution (also known as the Gaussian Distribution or Bell Curve) is probably the most common statistical distribution you will meet. Many statistical methods assume that the data being fed into them is Normally-distributed (or, at least, that the errors involved in measuring the data that are fed into them are Normally-distributed). Due to the many ways in which Normal distributions can arise when we make measurements in experiments, this is often a reasonable assumption to make. Among the ways in which data tends to form a Normal distribution are:

repeated physical measurements of the same thing
- here, the thing you are measuring may have a constant value, but the errors in measurement are randomly-distributed, but tend to more frequently be close to the true measurement than far from it; the shape made by the distribution of repeated measurements is often Normal
repeatedly estimating the mean of a population by calculating the means of random subsamples
- here, each subsample mean is more likely to be close to the true mean of the population than far from it; the shape of the distribution of multiple subsample means is approximately Normal

Please take some time to explore how the Normal distribution arises naturally out of estimating the mean value of a population, using the link below.

3a. Where Does the Normal Distribution Come From? INTERACTIVE SESSION

In this session, you can explore the way in which the variation in how we measure a value (or estimate a mean) affects Normal distribution that results.

You met a Normal distribution in Notebook 02 (“What Does Statistics Do, Anyway?”), and it’s reproduced here in Figure 2.1

Two hundred 'measured' data values (histogram) and the corresponding Normal distribution model (curve) generating those values. The modelled distribution is entirely described by the parameters: mean=10 (µ) and standard deviation=3 (σ). We can see here that the frequency of real data values don't match the curve exactly; the model simplifies our representation of the 'noisy' data. The dashed line represents the mean value of the model, and the dotted lines represents values at +/- one and two standard deviations from the mean.

Figure 2.1: Two hundred ‘measured’ data values (histogram) and the corresponding Normal distribution model (curve) generating those values. The modelled distribution is entirely described by the parameters: mean=10 (µ) and standard deviation=3 (σ). We can see here that the frequency of real data values don’t match the curve exactly; the model simplifies our representation of the ‘noisy’ data. The dashed line represents the mean value of the model, and the dotted lines represents values at +/- one and two standard deviations from the mean.

Important features of the Normal distribution include:

the Normal distribution is continuous
there is a single “peak” (the distribution is unimodal)
the peak occurs at the mean value of the distribution
- the peak is also at the median value of the distribution
- the mean, median and mode are all identical, in a Normal distribution
the distribution is left-right symmetrical about the mean.

The location of a Normal distribution is described by its mean (µ)
The spread of a Normal distribution is described by its standard deviation (σ)

The earlier interactive session allowed you to explore the way in which the parameters of a statistical distribution alter its shape, and how well it appears to represent observed data.

2a. Exploring a Statistical Distribution INTERACTIVE SESSION

Not all data is Normally-distributed.

You should check all necessary assumptions before performing a statistical test - this often means testing your dataset for “Normality.”

What other natural processes might correspond to a normal distribution?
Why do you think these processes fit this distribution?

3 The Binomial Distribution

Many other natural random processes (and experimental observations) are not continuous, but instead have discrete outcomes - their values are constrained. For instance, a coin toss can be only Heads or Tails - there are no intermediate values. Many experiments are set up in a similar way, to have an outcome that is either “success” (Heads/survival) or “failure” (Tails/elimination). The statistics of coin tosses are a good way to understand the statistics of these kinds of experiments.

Coin tosses are an example of a Binomial (two-number) Experiment, also known as a Bernoulli Trial. These kinds of processes have the following properties:

There are two and only two outcomes for each experiment (or trial). Call these “success” and “failure” (though the names are not important; “Heads” and “Tails” will do).
The experiment is repeated a certain number of times, $n$
Each experiment’s results are completely independent of the others (i.e. no individual experiment’s result influences any of the others)
The probability of any single experiment being a success is represented as $p$ (some value between 0 and 1, where 0 meaens “never” and 1 means “always”), and the probability of failure is therefore $q = 1 - p$.

There are two parameters for this process: the number of trials $n$, and the probability of success on any individual trial, $p$.

For a fair coin, heads and tails are equally likely, so $p = q = 0.5$. We could flip our coin in three groups of $n=30$ times, to obtain three different runs of trials, calling T for Tails, and H for Heads:

H, T, T, T, T, H, T, H, H, T, T, T, H, H, H, H, H, H, H, H, H, T, H, T, T, T, T, H, T, H
T, T, T, T, T, H, H, H, T, T, T, T, T, H, T, H, H, H, T, T, H, H, T, H, H, H, T, H, T, T
H, H, T, T, T, H, T, H, H, T, T, H, T, T, T, H, T, T, T, H, H, T, H, H, H, H, H, H, T, T

Do these runs of heads and tails look random to you?
Do we always see the same numbers of heads and tails?
Does the run “H, T, H, T, H, T, H, T, H, T, H, T, H, T, H, T, H, T, H, T, H, T, H, T, H, T, H, T, H, T” look random to you? If not, why not?

Each of these runs is independent, and gives us a potentially different number of successes (Heads), each time.

The Binomial Distribution is the pattern we get if we repeat an infinite number of runs, having a set number $n$ of coin tosses with a set probability $p$ of “Heads”, and count the “Heads” (successes) in each run. Figure 3.1 shows the binomial distribution for 30 tosses of a fair coin ($n=30, p=0.5$, orange dots), and the pattern we get from 200 of those runs (blue bars).

Two hundred 'measured' data values (histogram) and the corresponding Binomial distribution model (dots) generating those values. The modelled distribution is entirely described by the parameters: number of trials n=30, and probability of success p=0.5. The frequencies of observed data values don't match the curve exactly; as for the Normal distribution the model simplifies our representation of the 'noisy' data. The dashed line represents the expected mean value of the model.

Figure 3.1: Two hundred ‘measured’ data values (histogram) and the corresponding Binomial distribution model (dots) generating those values. The modelled distribution is entirely described by the parameters: number of trials n=30, and probability of success p=0.5. The frequencies of observed data values don’t match the curve exactly; as for the Normal distribution the model simplifies our representation of the ‘noisy’ data. The dashed line represents the expected mean value of the model.

The Binomial Distribution can vary in shape, is not usually symmetrical, and the mean, median and mode of the distribution can take different values in the same distribution.

The Binomial Distribution is discrete - only whole numbers (integers) of successes are produced.

What other natural processes might correspond to a binomial distribution?
Why do you think these processes fit this distribution?

4 The Negative Binomial Distribution

The Negative Binomial distribution is generated by exactly the same coin-toss Bernoulli Trial process as the Binomial distribution. But this time, instead of counting the successes we observe in a given number of trials, we count the number of successes we observe before a prescribed number of failures $n$ is seen. As before, the probability of success is represented by the value $p$.

There are two and only two outcomes for each trial: “success” and “failure”
Trials are performed until a fixed, predefined number of failures $n$ are observed
Each trial’s result is completely independent of all other trials
The probability of any single experiment being a success is represented as $p$, and the probability of failure is $1 = 1 - p$.

There are two parameters for this process: the number of failures we’re waiting for $n$, and the probability of success on any individual trial, $p$.

The expected mean and mode values for this distribution are different, and the distribution is usually not symmetrical.

Statistical texts vary in how they define the Negative Binomial Distribution. Some define $p$ as the probability of success; others define it as the probability of failure. Some define $n$ as the number of successes; others as the number of failures. There are also other differences in how this distribution can be derived (beyond the scope of this workshop).

Take care when reading other texts that you understand their choice of parametrisation.

Two hundred 'measured' data values (histogram) and the corresponding Negative Binomial distribution model (dots) generating those values. The modelled distribution is entirely described by the parameters: number of failures $n=6$, and probability of success $p=0.5$. The frequencies of observed data values don't match the curve exactly; as for the Normal distribution the model simplifies our representation of the 'noisy' data. The dashed line represents the expected mean value of the distribution.

Figure 4.1: Two hundred ‘measured’ data values (histogram) and the corresponding Negative Binomial distribution model (dots) generating those values. The modelled distribution is entirely described by the parameters: number of failures $n=6$, and probability of success $p=0.5$. The frequencies of observed data values don’t match the curve exactly; as for the Normal distribution the model simplifies our representation of the ‘noisy’ data. The dashed line represents the expected mean value of the distribution.

The Negative Binomial Distribution can take a number of shapes, and the mean, median and mode can be quite different, even in the same distribution.

The Negative Binomial Distribution is discrete - only whole numbers (integers) of successes are produced.

What other natural processes might correspond to a negative binomial distribution?
Why do you think these processes fit this distribution?

5 The Exponential Distribution

Exponential distributions are generated by Poisson processes. “Poisson process” is a bit of jargon, but it’s really a model for many natural processes you may be familiar with, including:

time to failure of electrical or electronic equipment
- e.g. hard drives, or light bulbs
customers arriving at a shop
the timings of earthquakes
arrivals at an A&E department
power cuts
time to get served in the coffee-shop queue

A Poisson process is a mathematical approximation to these real-life events, and has a precise definition:

each event occurs singly (i.e. one at a time, never simultaneously)
the average rate of occurrence of events is constant
all events are independent of each other: for instance, past events do not influence future events

Each event is discrete (it’s a definable “moment” or “thing”), but the distribution of waiting time between events is continuous and random, with an average value (the “average rate”).

Are all the processes in the list good candidates to be Poisson processes?
For those that are not so good candidates, what properties of the events make them not such a good fit?

The Exponential Distribution describes the waiting time between events in a Poisson process.

There is a single parameter for any Poisson process: the rate at which events occur, $\lambda$. (You can think of this as the number of events per unit time)

Two hundred 'measured' data values (histogram) and the corresponding Exponential distribution model (curve) generating those values. The modelled distribution is entirely described by the rate parameter λ=0.5. The frequencies of observed data values don't match the curve exactly; as for the Normal distribution the model simplifies our representation of the 'noisy' data. The dashed line represents the expected mean value of the model (1/λ)

Figure 5.1: Two hundred ‘measured’ data values (histogram) and the corresponding Exponential distribution model (curve) generating those values. The modelled distribution is entirely described by the rate parameter λ=0.5. The frequencies of observed data values don’t match the curve exactly; as for the Normal distribution the model simplifies our representation of the ‘noisy’ data. The dashed line represents the expected mean value of the model (1/λ)

The “tail” of the Exponential distribution can be quite long. Even if the distribution has a small average waiting time, the longest waiting time seen might be very large.

6 The Poisson Distribution

The Poisson Distribution is - as you might have guessed - related to the Exponential distribution. It is produced by the same kind of processes - Poisson processes - but instead of describing the waiting times between successive events, it describes the number of events we would expect to occur in a unit of time.

There is a single parameter for this process, and it is the same one as for the Exponential distribution: the average rate at which events occur, $\lambda$. (You can think of this as the number of events per unit time)

Two hundred 'measured' data values (histogram) and the corresponding Poisson distribution model (dots) generating those values. The modelled distribution is entirely described by the rate parameter λ=3. The frequencies of observed data values don't match the curve exactly; as for the Normal distribution the model simplifies our representation of the 'noisy' data. The dashed line represents the expected mean value of the model (λ)

Figure 6.1: Two hundred ‘measured’ data values (histogram) and the corresponding Poisson distribution model (dots) generating those values. The modelled distribution is entirely described by the rate parameter λ=3. The frequencies of observed data values don’t match the curve exactly; as for the Normal distribution the model simplifies our representation of the ‘noisy’ data. The dashed line represents the expected mean value of the model (λ)

The rate at which events occur, 𝛌, is a continuous variable, but the Poisson Distribution is discrete and describes the counts of events in a unit of time.

Poisson processes also apply to length, area, and volume (click to expand)

We hacve described Poisson processes in terms of a rate per unit time, but Poisson processes can also be spatial and represent events over a unit length (like distances between cars in a lane of traffic, or proteins on an actin fibre), area (like counts of tourists in George Square at a given point in time, or colonies on a plate), or volume (like the count of raisins in a Christmas pudding, or cells in a unit volume).

Where Do Statistical Distributions Come From?

Leighton Pritchard, Morgan Feeney